About the Proteome Quality Index

We present the Proteome Quality Index (PQI), a measure of proteome quality available from a comprehensive database of downloadable proteomes. Completely sequenced genomes for which there is an available set of protein sequences (the proteome) are given a 5-star rating supported by 11 different metrics of quality.

PQI is a constantly updated web resource that currently includes over 3,200 annotated proteomes from multiple providers including all entries from NCBI and ENSEMBL.

Motivation

The advent of large-scale de novo sequencing technologies has lead to the resolution of a vast number of genomes across all domains of life. However, the assembly and annotation of a new genome is still a challenging task. Studies have shown that there is an enormous variability in the quality and consistency of proteomes, both in terms of the individual sequences of each protein and in terms of the completeness of the collection and how representative it is of the proteins in the complete genome 1. Hence, in the scientific community, there is a strong need for ways to quantify the quality of protein sequence datasets. Coming up with a good measure of proteome quality is difficult, but with PQI we hope to seed discussion and development towards an adequate solution.

Content

For each proteome we provide information about sequencing technology used, publication count, and numerous automated scoring metrics based on protein composition and phylogenetic placement; the methods which compare a proteome to its local clade are flagged as "clade-based":

Methodology

Clade-based metrics

Metrics which compare characteristics of proteomes to others (e.g. average sequence length) can indicate outliers, and hence suggest a possible systematic error in the creation of the proteome set (e.g. fragmentary assembly). Since there can be a great variation of these characteristics across the tree of life, the comparison should be done locally amongst similar organisms. This requires a procedure for rough selection of a local phylogenetic clade.

Defining the local clade for comparison:

An organism's local clade is defined as all nearest neighbours that originate from a parent node on the tree, such that: - the clade includes at least 10 distinct species - the branch length to the parent node is at least 0.01 (ensuring enough variation to compare against in the case of many closely-related species), as measured in sToL.

The branch length also serves as a weighting scheme ensuring that the representation of each species is normalized with respect to its phylogenetic placement. As proposed by Gerstein, Sonnhammer & Chothia, the weight of a proteome (tree leaf) is set as the branch length to its parent's node plus a fraction of each consecutive branch length up the tree to the clade's node, shared between all descendants of each parent node normalized according to their current weights at that node. For an organism $o$, the increase in its total weight $\Delta w_{ij}(o)$ contributed from the edge between nodes $i$ and $j$ is given by:

\Delta w_{ij}(o)=t_{ij} \cdot \frac{w_{i}(o)}{\Sigma_{k}w_{i}(k)}

where $k$ is a summation index running through all leaves that descend from node $j$, $t_{ij}$ is the edge length between nodes $i$ and $j$ and $w_{i}(o)$ is the total weight for organism $o$ at node $i$.

Human readable 5 star rating:

All scores are mapped to a human-readable 1-5 star rating. The 5-star ratings are stored internally as a number between 0 and 1.

We developed a single universal function to map all metrics, independent of distribution to a 5-star rating as follows. First, the modified Z-scores are obtained from all metrics and fitted to the cumulative distribution function (CDF; either normal or exponential depending on empirical distributions). Next a second scheme takes metrics from all methods and determines the mean of the proportion of the metric's value against the maximum value max x from the distribution and versus the total of the population, and the score’s rank (scaled between 0 and 1). Finally the mean of this second scheme and the CDF is obtained:

score(x_{i})=mean\left( CDF, mean\left( \frac{x_{i}}{max(x))}, \frac{x_{i}}{\Sigma_{j}^{N}x_{j}}, 1-\frac{rank(x_{i})}{N} \right)\right)

The score is then rescaled to yield a value between 0 and 1 corresponding to the final rating. All metrics are re-mapped each time a new proteome is loaded into the database. The automated metrics -after scaling in this way- are averaged with equal weights to yield the final human-readable PQI 1-5 star rating.

1 Chothia, C. and Gough, J. Genomic and structural aspects of protein evolution. Biochem. J., 419, 15-28 (2009).

2 Fang, Hai, et al. "A daily-updated tree of (sequenced) life as a reference for genome research." Scientific reports 3 (2013).

3 Dohmen, Elias, et al. "DOGMA: Domain-based transcriptome and proteome quality assessment." Bioinformatics 32 (2016).