# About the Proteome Quality Index

We present the Proteome Quality Index (PQI), a measure of proteome quality available from a comprehensive database of downloadable proteomes. Completely sequenced genomes for which there is an available set of protein sequences (the proteome) are given a 5-star rating supported by 11 different metrics of quality.

PQI is a constantly updated web resource that currently includes over 3,200 annotated proteomes from multiple providers including all entries from NCBI and ENSEMBL.

## Motivation

The advent of large-scale de novo sequencing technologies has lead to the resolution of a vast number of genomes across all domains of life. However, the assembly and annotation of a new genome is still a challenging task. Studies have shown that there is an enormous variability in the quality and consistency of proteomes, both in terms of the individual sequences of each protein and in terms of the completeness of the collection and how representative it is of the proteins in the complete genome 1. Hence, in the scientific community, there is a strong need for ways to quantify the quality of protein sequence datasets. Coming up with a good measure of proteome quality is difficult, but with PQI we hope to seed discussion and development towards an adequate solution.

## Content

For each proteome we provide information about sequencing technology used, publication count, and numerous automated scoring metrics based on protein composition and phylogenetic placement; the methods which compare a proteome to its local clade are flagged as "clade-based":

• NCBI PubMed Publication count: The raw score is total number of publications related to the genome as listed for that entry in the NCBI

• X content: The score is the percentage of amino acids in all proteins for this proteome that are undefined (i.e. represented by an 'X' in the sequence). The first residue of the protein is excluded from the statistics since there is a high bias for it to be uncertain, even in the highest quality proteomes, due to uncertain translation start sites.

• CEG Domain Combination Homology: This method checks for domain based structural homology to the Core Eukaryotic Gene CEG library used by the CEGMA tool, originally based on the Eukaryotic Orthologous Group KOG sequence orthology. The SUPERFAMILY HMM library is scored against all instances of the KOG entries found in the CEG set to obtain structural domain assignments. The raw score is the proportion of the proteomeâs domain architectures that are homologous to the annotated KOG entries from the CEG set against the total number of KOGs.

• % of sequences UniProt: The raw score is the percentage of sequences in the proteome that appear in the UnipProt database with 100% sequence identity.

• % of sequence covered: The raw score is the percentage of the proteomeâs sequence that is covered by SCOP domain superfamily assignments. (clade-based)

• % of sequences with assignment: The raw score is the percentage of proteins in the proteome that have a SCOP superfamily assignment according to SUPERFAMILY. (clade-based)

• Number of domain superfamilies: The SUPERFAMILY database provides protein domain assignments at the structural classification of proteins (SCOP version 1.75) superfamily level using Hidden Markov Models. (clade-based)

• Mean sequence length: The raw score is the mean number of amino acids in all proteins from the given proteome. (clade-based)

• Mean hit length: The raw score is the mean number of amino acids in the superfamily assignments of the proteome. (clade-based)

• Number of domain families: The raw score is the number of distinct SCOP protein domain families that are annotated to the proteome using a hybrid HMM/pairwise similarity method from the SUPERFAMILY resource. (clade-based)

• Number of Domain Architectures: A 'domain architecture' is an assignment of a protein to a sequential order of SCOP protein domain superfamilies and gaps by the SUPERFAMILY resource. The raw score is the number of the unique domain architectures of the proteome. (clade-based)

• DOGMA: DOGMA measures the completeness of a given transcriptome or proteome and provides information about domain content 3. A core set of Conserved Domain Arrangements (CDAs) is used in DOGMA to be compared against the proteome of interest. A CDA can also consist of a single domain. Per default the core set consists of CDAs that are conserved across six eukaryotic model species: (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Mus musculus and Saccharomyces cerevisiae). Beside this eukaryotic core set, DOGMA offers alternatively specialised core sets for mammals, insects, bacteria and archaea.

## Methodology

Metrics which compare characteristics of proteomes to others (e.g. average sequence length) can indicate outliers, and hence suggest a possible systematic error in the creation of the proteome set (e.g. fragmentary assembly). Since there can be a great variation of these characteristics across the tree of life, the comparison should be done locally amongst similar organisms. This requires a procedure for rough selection of a local phylogenetic clade.

Defining the local clade for comparison:

An organism's local clade is defined as all nearest neighbours that originate from a parent node on the tree, such that: - the clade includes at least 10 distinct species - the branch length to the parent node is at least 0.01 (ensuring enough variation to compare against in the case of many closely-related species), as measured in sToL.

The branch length also serves as a weighting scheme ensuring that the representation of each species is normalized with respect to its phylogenetic placement. As proposed by Gerstein, Sonnhammer & Chothia, the weight of a proteome (tree leaf) is set as the branch length to its parent's node plus a fraction of each consecutive branch length up the tree to the clade's node, shared between all descendants of each parent node normalized according to their current weights at that node. For an organism $o$, the increase in its total weight $\Delta w_{ij}(o)$ contributed from the edge between nodes $i$ and $j$ is given by:

\Delta w_{ij}(o)=t_{ij} \cdot \frac{w_{i}(o)}{\Sigma_{k}w_{i}(k)}

where $k$ is a summation index running through all leaves that descend from node $j$, $t_{ij}$ is the edge length between nodes $i$ and $j$ and $w_{i}(o)$ is the total weight for organism $o$ at node $i$.

All scores are mapped to a human-readable 1-5 star rating. The 5-star ratings are stored internally as a number between 0 and 1.

We developed a single universal function to map all metrics, independent of distribution to a 5-star rating as follows. First, the modified Z-scores are obtained from all metrics and fitted to the cumulative distribution function (CDF; either normal or exponential depending on empirical distributions). Next a second scheme takes metrics from all methods and determines the mean of the proportion of the metric's value against the maximum value max x from the distribution and versus the total of the population, and the scoreâs rank (scaled between 0 and 1). Finally the mean of this second scheme and the CDF is obtained:

score(x_{i})=mean\left( CDF, mean\left( \frac{x_{i}}{max(x))}, \frac{x_{i}}{\Sigma_{j}^{N}x_{j}}, 1-\frac{rank(x_{i})}{N} \right)\right)

The score is then rescaled to yield a value between 0 and 1 corresponding to the final rating. All metrics are re-mapped each time a new proteome is loaded into the database. The automated metrics -after scaling in this way- are averaged with equal weights to yield the final human-readable PQI 1-5 star rating.

1 Chothia, C. and Gough, J. Genomic and structural aspects of protein evolution. Biochem. J., 419, 15-28 (2009).

2 Fang, Hai, et al. "A daily-updated tree of (sequenced) life as a reference for genome research." Scientific reports 3 (2013).

3 Dohmen, Elias, et al. "DOGMA: Domain-based transcriptome and proteome quality assessment." Bioinformatics 32 (2016).