| Title: | Sequence conserved for sub-cellular localization |
| Author: | Rajesh Nair & Burkhard Rost |
| Quote: | Protein Science, 2002, 11, 2836-47 |
Sequence conserved for sub-cellular localization
| 1 | CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 2 | Dept. of Physics, Columbia Univ., 538 West 120th Street, New York, NY 10027, USA |
| 3 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA |
| 4 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| * | Corresponding author: email = rost@columbia.edu URL http://cubic.bioc.columbia.edu/ Tel: +1-212-305-3773, fax: +1-212-305-7932 |
This article is published in (Protein Science, issue, 2002 and pages) © copyright Cold Spring Harbor Laboratory Press (2002). CSHL Press is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.
The more proteins diverged in sequence, the more difficult it becomes for bioinformatics to infer similarities of protein function and structure from sequence. The precise thresholds used in automated genome annotations depend on the particular aspect of protein function transferred by homology. Here, we presented the first large-scale analysis of the relation between sequence similarity and identity in sub-cellular localization. Three results stood out: (1) the sub-cellular compartment is generally more conserved than what might have been expected given that short sequence motifs like nuclear localization signals can alter the native compartment. (2) The sequence-conservation of localization is similar between different compartments, and (3) it is similar to the conservation of structure and enzymatic activity. In particular, we found the transition between the regions of conserved and non-conserved localization to be very sharp although the thresholds for conservation were less well defined than for structure and enzymatic activity. We found that a simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP-distance, distinguished accurately between protein pairs of identical and different localizations. In fact, BLAST expectation values outperformed the HSSP-distance only for alignments in the sub-twilight zone. We succeeded in slightly improving the accuracy of inferring localization through homology by fine-tuning the thresholds. Finally, we applied our results to the entire SWISS-PROT database and five entirely sequenced eukaryotes.
| EC | Enzyme commission classification of enzymes [1] |
| E-val | BLAST [2] expectation value |
| GO | GeneOntology, i.e. functional classification of proteins [3, 4] |
| HSSP | database of protein structure-sequence alignments [5] |
| HSSP-distance | distance from function relating sequence identity and alignment length ( eqn. 1 ) |
| PIDE | percentage pairwise sequence identity |
| proteome | we refer to the set of all proteins in an entirely sequenced organism as the proteome of that organism |
| LALI | length of sequence alignments |
| SWISS-PROT | curated database of protein sequences [6] . |
Sequence conservation of protein function. Proteins retain the 'memory' of their evolutionary ancestry in their sequence, structure and function. High sequence similarity alone is considered to be sufficient evidence for common ancestry and is routinely used to infer structural and functional similarity [7] . However, many proteins of similar structure have no discernible sequence similarity [8, 9, 10, 11] . The evolutionary conservation of protein structure, i.e. the relationship between similarity in sequence and in structure has been explored extensively [12, 5, 13, 14, 15, 16, 17, 18] . The resulting thresholds enable transferring structural annotations [5, 19, 20, 21, 22, 23] . The thresholds for sequence similarity that imply similarity in function cannot be inferred from those for structure [24, 25, 26, 27, 28, 29, 30] . One problem in establishing thresholds for the transfer of function is that the term 'protein function' is - albeit intuitive -not well defined. Function is a complex phenomenon associated with many mutually overlapping levels: chemical, biochemical, cellular, organism mediated, and developmental. These levels are related in complex ways; for example, protein kinases can be related to different cellular functions (such as cell cycle), and to a chemical function (transferase) plus a complex control mechanism by interaction with other proteins. This lack of a precise definition generates two specific problems for analysing the conservation of function. First, we have sufficiently large, machine-readable data only for very few aspects of function [27] . Second, the conservation differs significantly between different types of function [25, 26, 27, 29, 30] . Nevertheless, a better understanding of the relation between function and sequence is fundamentally important, since it can provide insights into the underlying mechanisms of evolving new functions through changes in sequence and structure [31] . Large-scale genome sequencing projects have led to a rapidly widening gap between the number of known sequences and their functional annotations. Efforts at addressing this situation have largely relied on exploiting sequence similarity to infer functional similarity [32, 33, 34, 35, 7, 36, 37, 38] . However, few large-scale studies have evaluated the accuracy of the transfer of function [39, 40, 25, 28, 30] .
Three zones of sequence comparisons: From trivial (safe) over problematic (twilight) to impossible (midnight). In general, we can separate between three regions of protein comparisons ( Fig. 1 ): (1) Safe zone: for very high levels of sequence similarity proteins have similar functions and structures; aligning the proteins is straightforward. (2) Twilight zone [41] : at some point of divergence the alignment of proteins becomes problematic, in fact, we no longer can safely infer similarity in a particular feature from sequence. However, typically a considerable fraction of the protein pairs identified in the twilight zone still have a particular feature in common. (3) Midnight zone: protein pairs in this zone have so low levels of sequence similarity that we can no longer detect their similarity from sequence alone [8, 10] . Interestingly, the vast majority of protein pairs with similar three-dimensional structure (3D) populate the midnight zone [8, 15, 42, 11] .
Statistical scores versus percentage sequence identity for conservation of structure. The most popular database search methods BLAST and PSI-BLAST [43, 44, 45] use neither percent sequence identity nor raw alignment scores to characterise sequence similarity. Instead, they use probabilities or expectation values that reflect the statistical significance of a given alignment. It has been claimed that statistical scoring schemes are superior to scores based on pairwise sequence identity in identifying structural homologues [46, 13, 15, 16, 47] . Statistical scores are clearly superior to simply measuring pairwise sequence identity. Sander & Schneider (1991) introduced another measure for sequence similarity to identify proteins of similar structure when building their HSSP database. Their measure, the HSSP-distance correlates alignment length and pairwise sequence identity. A minor modification of the original HSSP-curve resulted in a measure for sequence similarity that appears more successful than statistical scores in describing structural similarity for pairwise alignments [10] . Can we generalise the lessons learned from the conservation of structure to that of function?
Sharp transition of thresholds describing the sequence conservation of function. Recent efforts at investigating the sequence-function relation have utilised three different classifiers for function: (1) enzymatic activity as described by the Enzyme Commission (EC, [1] ) numbers [24, 48, 25, 26, 27, 29, 30] , (2) SWISS-PROT keywords [25] , and (3) the GeneOntology (GO, [4] ) classification as used in FLYBASE [3, 27] . Wilson et al. (2000) observed that the separation between proteins of similar and non-similar function is best described by sigmoidal curves that drop off sharply at particular thresholds for conservation. They claimed near perfect conservation of function as measured by EC and GO above 40-50% pairwise sequence identity. Other groups have published similar findings for enzymatic activity [24, 25, 26, 29] . More recently, we have found that most of these results appeared over-optimistic, i.e. that enzymatic function is less well conserved than anticipated [30] . Devos and Valencia (2000) have also investigated the sequence identity-to-function relationship for SWISS-PROT [6] keywords and binding sites. While they find that the functional shapes of the threshold separating conserved and non-conserved function are similar between different aspects of function, the precise thresholds differ between EC, keywords, and binding sites. In contrast to all other groups, Pawlowski et al. (2000) reported thresholds for conservation that were not sharp. Their measure of sequence similarity was based on their in-house alignment program BASIC [49] .
Statistical scores versus percentage sequence identity for conservation of function. Wilson et al (2000) found percent identity to be better at quantifying functional similarity than statistical scoring schemes. However, their results also indicated that statistical scoring schemes are better at discriminating between highly diverged sequences. In contrast, Pawlowski et al. (2000) and our recent results [30] suggested that both statistical Z-scores and BLAST expectation values were clearly superior to pairwise sequence identity at quantifying similarity in enzymatic function. However, our results also confirmed that a combination of alignment length and pairwise sequence identity outperforms BLAST scores at levels of low sequence divergence [30] .
Sequence conservation of sub-cellular localization. The sub-cellular localization of a protein is correlated with its function [50, 51, 52] . Obviously, we expect proteins with very similar sequences to be localised in similar cellular compartments. In fact, this assumption is applied in everyday sequence analysis and database annotations [39, 6] . However, while we do know that sub-cellular localization is evolutionarily imprinted onto the protein surface [53] , the precise threshold for the sequence conservation of sub-cellular localisation has not been explored, yet. Here, we investigated this conservation for different compartments, i.e. implicitly for different aspects of function [54, 55] . Using sub-cellular localization to quantify functional similarity, we re-discovered particular features of the sequence-function relationship noticed for other types of function before [24, 25, 17, 26, 27, 29, 30] . We benchmarked the effectiveness of various scoring schemes and proposed a new scheme for refined identification of functional homologues and compared the accuracy of homology transfer to the accuracy of prediction methods [56, 57, 58, 59] . Finally, we applied our scheme to annotate five entirely sequenced eukaryotes and the entire SWISS-PROT database. In conjunction with other studies relating structure-to-function, our work provided useful insights into the evolution of new functions through sequence and structural changes. Our results may lead to improvements in functional annotations of newly sequenced genomes.
Over one million pair comparisons between proteins of known localization. The data set of proteins with experimentally known sub-cellular localization extracted from SWISS-PROT contained 7405 proteins. Since many of these proteins were very similar in sequence (bias), we constructed a representative sequence-unique subset containing 1248 proteins ( Table 1 ). Our objective was to establish at which level of sequence similarity proteins reside in the same localization. In order to obtain estimates that are unbiased, we have to analyse the unbiased, sequence-unique set. However, if we were to align an all-against-all for subset of 1248, we would not find any close homologues since by construction no pair in the sequence-unique set is closely related. Consequently, we could not deduce thresholds for everyday sequence comparisons. Therefore, we have to accept some bias by aligning all proteins from the sequence-unique set against all proteins from the full set. Thus, our results were based on over nine million pair comparisons. Although this number might appear high enough to allow statements about statistical significance, the data set was not evenly distributed between the 11 different compartments ( Table 1 ). We distinguished between major compartments (nucleus, cytoplasm, mitochondria, extra-cellular space, and chloroplasts) for which we had sufficiently large sets, and minor compartments (vacuoles, Golgi, and periplasm) for which the sets were too small to generalise our findings. Three compartments (lysosome, ER, and peroxysome) ranged somewhere in between these two extremes.
Sub-cellular localisation | AllSWISS-PROT a | Sequence-uniqueb |
Nucleus | 2642 | 490 |
Cytoplasm | 2161 | 326 |
Mitochondria | 812 | 160 |
Extra-cellular space | 663 | 107 |
Chloroplast | 943 | 84 |
Lysosome | 112 | 26 |
Endoplasmic reticulum (ER) | 125 | 20 |
Peroxysome | 94 | 18 |
Vacuoles | 35 | 13 |
Golgi apparatus | 14 | 4 |
Periplasm | 4 | 0 |
SUM (all 11) | 7405 | 1248 |
a Numberof proteins with known localization found in SWISS-PROT;
b Numberof sequence-unique proteins, i.e. representative subset of all SWISS-PROTproteins (Methods).
Sequence conservation of localization similar for major compartments. The functional shapes for the sequence conservation of sub-cellular localization were similar across the major compartments ( Fig. 2 ). The thresholds for accurate inference of localization through homology were around HSSP-distances of 4 ( eqn. 1 Fig. 2
|
Fig. 2. : Sequence conservation for major classes of sub-cellular localization. For different thresholds in terms of the HSSP-distance (Eq. 1), we compiled the levels of cumulative accuracy (Eq. 3) and cumulative coverage (Eq. 4). The major compartments had very similar curves for cumulative accuracy. The transition from the safe zone to the twilight zone occurred around HSSP-distances of 4. In contrast to the perfect conservation of structure, the cumulative accuracy was observed to be as low as 80% (for mitochondrial proteins) in the safe zone. The cumulative coverage showed greater variation among the different compartments: the transition for coverage occurred between HSSP-distance 5 and –5. The coverage remained significantly low even at very low levels of accuracy. |
Sharp transition from safe to twilight zone. If we want to infer the localization of a protein through homology to a protein of experimentally known localization, we have to relate sequence similarity to the conservation of the compartment. We explored three different ways of measuring sequence similarity (Methods): (1) BLAST expectation values (E-VAL, [2] , (2) percentage pairwise sequence identity (PIDE), and (3) the distance from the HSSP threshold (DIST, eqn. 1 ). The curves describing the sequence conservation of localization resembled sigmoidal relationships ( Fig. 3
|
Fig. 3. : Average conservation of sub-cellular localization. The upper graphs show the performance of pairwise BLAST searches for the biased set, while the lower graphs show the performance of pairwise BLAST and PSI-BLAST searches on the sequence-unique subset. The filled symbols show cumulative accuracy and cumulative coverage (Eq. 3) for pairwise BLAST; open symbols give the results from PSI-BLAST searches. For the biased set, the cumulative coverage is 1% corresponding to the identification of about 274K pairs from identical localization (true pairs), while for the sequence-unique subset a cumulative coverage of 1% corresponds to the identification of about 21K true pairs. Conservation thresholds for BLAST and PSI-BLAST are indicated by open and filled arrows, respectively. For HSSP-distance (C and F), the conservation threshold using BLAST was at HSSP-distance=4 (open arrow) for the biased and sequence-unique sets, while using PSI-BLAST the conservation threshold was at HSSP-distance=0 (filled arrow) for the sequence-unique set. The cumulative accuracy and cumulative coverage when using BLAST for the sequence-unique set was 87% and 0.36% respectively and for PSI-BLAST it was 91% and 0.4% respectively. For the cumulative accuracy versus percent sequence identity graphs (A and D) no sharp conservation thresholds could be established. The percent sequence identity graphs showed the largest variation for the biased and sequence-unique sets. In contrast, the graphs for BLAST E-values (B and E) and HSSP distances (C and F, Eq. 1) were similar for the biased and the sequence-unique set. The conservation thresholds for PSI-BLAST occurred at a lower threshold than that for pairwise BLAST (D, E and F). The middle graphs plot the logarithm of the BLAST E-values (log to the base e). Note that BLAST E-values below 10-200 did not suffice to safely infer localization. In contrast, at very high HSSP distances and sequence identities localization could be reliably transferred. |
Expectation values outperformed HSSP distance for very diverged pairs. When analysing the relation between accuracy (percentage of localizations correctly inferred above given threshold) and coverage (number of correct inferences made above threshold), we noticed that the HSSP-distance outperformed simple pairwise sequence identity for all thresholds ( Fig. 4 A: circles vs. squares). However, the HSSP-distance was superior to the BLAST expectation values only for proteins of conserved structure (HSSP distances above 0, Fig. 4 A: circles vs. diamonds, pairwise BLAST transition marked by upper, left arrow). Surprisingly, the BLAST expectation values were slightly inferior to percentage pairwise sequence identity for very similar proteins ( Fig. 4 A indicated by lower, right-hand arrow). For PSI-BLAST, percent sequence identity performed even better. Since the HSSP-distance gave the best prediction above the conservation threshold ( Fig. 4 A), using HSSP-distances to infer localization by homology improves the accuracy of information transfer significantly. In fact, the HSSP-curve derived to describe the sequence conservation of protein structure [10] , described the basic difference between protein pairs of identical and of different localizations surprisingly well ( Fig. 5 ). We obtained similar graphs for the individual localizations and for PSI-BLAST profiles (data not shown).
|
Fig. 4. : Performance for different measures of sequence similarity. The black lines and open symbols show cumulative coverage versus cumulative accuracy for PSI-BLAST searches while grey lines and shaded symbols show the same for pairwise BLAST (A and B). The figure plots data only for cumulative accuracy above 80%, which is well below the threshold for conservation of localization. (A) For HSSP distance (circles) and percent sequence identity (squares), PSI-BLAST vastly outperforms pairwise BLAST. However, using BLAST E-values both BLAST and PSI-BLAST gave comparable performance at the conservation threshold (86% cumulative accuracy in figure). For both pairwise BLAST and PSI-BLAST scoring the alignments using HSSP distance (Eq. 1) gave the best coverage versus accuracy graphs. Using HSSP distance for PSI-BLAST alignments gave overall best performance. (B) For both pairwise BLAST and PSI-BLAST using scaled distance (Eq. 2) from the HSSP curve improved performance compared to HSSP distance. The performance was worse when perpendicular distance from the HSSP curve was used. Overall using PSI-BLAST alignments and scaled distance from the HSSP curve gave best performance.The curves for cumulative accuracy and coverage for the scaled HSSP-distance (C) were similar to those obtained for the standard HSSP-distance (Fig. 3F). |
|
Fig. 5. : Percentage pairwise sequence identity vs. length of alignment. The grey plus signs represent protein pairs experimentally observed in identical compartments, while the black squares represent pairs observed in different compartments. The grey line is the HSSP-curve (Eq. 1) optimised to describe the sequence conservation of protein structure [10] . The HSSP-curve was surprisingly accurate at reproducing the curve that may best separate proteins with identical localization from those of different localization. |
Refining the thresholds to infer localization by homology. We might improve the accuracy of transferring experimental information about localization by homology in two ways. We could refine the original HSSP-curve used to determine the thresholds for this inference. However, the incorrect predictions shown in Fig. 5 suggested that this might not be simple. Alternatively, we could modify the way of compiling the distance from the HSSP-curve used to establish thresholds. Towards this end, we explored the following alternatives: (1) standard HSSP-distance ( eqn. 1 note that this is the distance used for all previous figures), (2) perpendicular HSSP-distance (Methods), and (3) a scaled HSSP-distance ( eqn. 2 ). We found the scaled HSSP-distance slightly superior to the standard HSSP-distance, while the perpendicular HSSP-distance performed significantly worse ( Fig. 4 B). For pairwise BLAST searches, the scaled HSSP-distance discovered 13% more pairs of identical localization than the standard HSSP-distance at the same conservation threshold. The curves relating cumulative accuracy and cumulative coverage respectively to scaled HSSP-distance ( Fig. 4 C) were similar to those obtained for HSSP-distance ( Fig. 3 F).
Annotation transfer for entire SWISS-PROT and entire eukaryotes. Using the scaled HSSP-distance ( eqn. 2 ), we annotated the sub-cellular localization based on homology for approximately one-fifth of the proteins in five entirely sequenced eukaryotes and the entire SWISS-PROT database ( Table 2 ). The percentages of proteins for which could infer localization differed substantially between the highest value above 26% for human and the lowest around 13% for the worm. Previously, we predicted that about 20% of all fly proteins are nuclear [62] . In contrast, over 60% of all the proteins for which we could infer localization by homology for entire proteomes were nuclear. Obviously, this number reflected the current bias. The annotations of sub-cellular localization are available at http://cubic.bioc.columbia.edu/db/LocHom/.
| Dataset a | Nprotb | Nhomoc | Phomod | Percentageof homology inferred e | |||||
| nuc | cyt | ext | mit | pla | mis | ||||
| Arabidopsisthaliana (plant) | 25456 | 3997 | 15.7 | 62 | 11 | 11 | 5 | 5 | 7 |
| Caenorhabditiselegans (worm) | 18898 | 2478 | 13.1 | 68 | 8 | 13 | 6 | <1 | 5 |
| Drosophila melanogaster (fly) | 14184 | 3033 | 21.4 | 67 | 10 | 13 | 5 | <1 | 4 |
| Mus musculus (mouse) | 28096 | 7496 | 26.7 | 65 | 15 | 11 | 5 | <1 | 4 |
| Homosapiens (human) | 31073 | 8216 | 26.4 | 62 | 14 | 16 | 4 | <1 | 3 |
| SWISS-PROT | 109381 | 26651 | 24.4 | 35 | 19 | 17 | 14 | 9 | 5 |
a Dataset: Entirely sequenced eukaryotic organisms and the SWISS-PROT database;
b Nprot:number of proteins in the database;
c Nhomo:number of proteins for which localization could be predicted using homology at>70% accuracy (only alignments with scaled HSSP-distance > 4 retained,see methods);
d Panno:percentage of proteins annotated (100*Nhomo/Nprot);
e Percentagesof homology inferred proteins by compartment (= 100*Nx/Nhomo), with x = nuc(nuclear), = cyt (cytoplasmic), = ext (secreted to extra-cellular space), = mit(mitochondrial), = pla (chloroplast), and = mis (proteins predicted in one ofthe other localizations). Other localizations include lysosome, peroxysome, Endoplasmicreticulum, vacuoles, Golgi and periplasm.
Method a | Extra-cellular | Cytoplasmic | Nuclear | Mitochondrial | ||||
oLb | pLb | oL | pL | oL | pL | oL | pL | |
NNPSL | 62.6 | 30.3 | 43.6 | 50.2 | 65.1 | 72.2 | 67.3 | 35.5 |
SubLoc | 72.9 | 52.7 | 66.9 | 59.7 | 82 | 76.1 | 65.0 | 50.2 |
TargetP | 98.8 | 50.9 | - | - | - | - | 92.7 | 45.9 |
a Method used to predict localization: NNPSL: neural network based tool for predictingsubcellular localization based on amino acid composition [73] ; SubLoc: a support vector machinebased tool for predicting subcellular localization based on amino acidcomposition [74] ;TargetP:neural network based tool for large-scale subcellular localization predictionbased on N-terminal sequence information [75]
b twostate accuracy (Eq. 5): oL = correctly predicted in localization L of allobserved in L, pL = correctly predicted in L of all predicted in L. Forinstance, TargetP strongly over-predicts extra-cellular and mitochondrialproteins, therefore yielding high values for oL and low values for pL.
Ab initio prediction methods were not necessarily more accurate. The accuracy of inferring sub-cellular localization by homology exceeded 80% at the HSSP-distances of 4 ( Fig. 2 ). We compared the performance of three publicly available prediction methods on the same sequence-unique data set. Note that the resulting estimates for prediction accuracy are likely to constitute over-estimates since many of our test proteins may have been used to develop those prediction methods. Despite this likely over-estimate, we found that the prediction methods either did not reach levels of 80% accuracy (NNPSL and SubLoc, Table 3 ) or reached higher levels of accuracy at the expense of over-prediction (TargetP, Table 3 ). Thus, the annotation transfer by homology constitutes a powerful prediction method for proteins that are sufficiently sequence similar to proteins of experimentally known localization.
Possible reasons for a sharp transition from safe to twilight zone. Proteins have diverged constrained by the evolutionary pressure to maintain aspects of function and structure. Do proteins naturally cluster in sequence space, i.e. is it possible to identify regions in sequence space that are predominantly populated by proteins of similar function? We have to define criteria for sequence similarity and for the similarity in function to investigate this question. The criterion for functional similarity should be broad enough to accommodate some degree of divergent evolution within protein families. If proteins fall into well-defined clusters, we expect to find well-defined thresholds for sequence conservation. Once sequence similarity falls below this conservation threshold, the noise (false positives) rapidly overwhelms the signal (true homologues) showing an effect similar to a phase transition (for this family). If a large number of protein families show nearly identical behaviour in sequence space, then we will observe a very sharp 'phase transition' like effect. Note that this effect does not necessarily imply to find fewer proteins of similar function below than above the threshold. In general, we expect that the sharpness of the transition will decrease in proportion to the broadness of our criterion for functional similarity. The sub-cellular localization of a protein is a broad definition of functional similarity. Nevertheless, we observed a very sharp transition for the sequence conservation of localization, implying a well-defined conservation threshold. Gerstein and colleagues have argued that the threshold describing the transition from the safe to the twilight zone, indeed, appears like a phase-transitions in physics [27] . Godzik and colleagues have speculated that the sharp transition may be due to qualitative differences between homologous proteins with very similar functions and all other, unrelated, proteins with very little functional similarity [26] . Other groups argued that the particular shape of the transition is explained by a simple statistical model [63, 14, 10] . The following experiment may shed light on this issue: (a) take all proteins in a number of entirely sequenced organisms, (b) align these against all known proteins, and (c) plot the histograms of how many proteins are found at a given level of sequence similarity. We carried this out for 30 completely sequenced organisms using PSI-BLAST profiles to search [45] . We found that the number of hits at a given level of sequence similarity also underwent a sharp transition ( Fig. 6
|
|
|
Fig. 6. : Conservation of function and structure. (A) We aligned all proteins in 30 entirely sequenced organisms with PSI-BLAST against all known proteins. We considered all pairs identified above PSI-BLAST expectation values of 10-3 to constitute the respective family (100%). We plotted the percentage of proteins found at a given threshold for sequence similarity. Both for measuring sequence similarity by pairwise sequence identity (lower x-axis, thin line with triangles), or PSI-BLAST expectation values (upper x-axis, thick line with crosses), the number of members of a group increased non-linearly at some given threshold. (B + C) Sequence conservation of four different features of protein structure and function. The data for the conservation of protein structure (thick grey line with crossed boxes) was compiled according to [10] . The data for the conservation of enzymatic activity was compiled according to [30] . We identified similarity in enzymatic activity by the of the first EC digit distinguishing six classes (oxireductases, transferases, hydrolases, lyases, isomerases, and ligases; thin grey line with triangles), and by the identity of the detailed activity (all four digits conserved, thin black lines with crosses). Finally, we used the data set of sub-cellular localization explored in this study. Sequence similarity was measured by the HSSP-distance (B) and by the BLAST expectation values (C). All comparisons based on pairwise BLAST alignments. |
Sequence conservation of protein function. Our study differed from other investigations of the sequence-to-function relationship in that we did not use hierarchical functional classifications. Instead, we captured a very coarse-grained, physical aspect of function, namely the sub-cellular localization of a protein. Although the biological impacts of this particular aspect are limited, the advantage is that it is more clearly defined and easy to measure experimentally than other types of function. In practice, the sharp transition for the sequence conservation of localization ( Fig. 2 Fig. 3 and Fig. 4 ) implied that once the sequence similarity falls below a certain threshold for 'conserved localization', sampling proteins from the protein universe rapidly becomes random. The BLAST expectation values proved to yield sharper transitions than the HSSP-distance for the conservation of structure, of enzymatic activity and of localization ( Fig. 6B and C). Overall, we found the sequence conservation of localization to be most similar to that of the detailed enzymatic activity (Fig. 6B and C). The HSSP-curve ( eqn. 1 ) was introduced to separate proteins with similar and non-similar structure [5, 10] . We were, therefore, surprised by the success of this curve in distinguishing proteins with identical and different sub-cellular localizations ( Fig. 4 ).
Sub-cellular localization can be inferred accurately through homology. An intricate targeting mechanism helps many proteins to find their right localization in a cell. For example, proteins are targeted to the nucleus if they contain a nuclear localization signal (NLS) [64] ; often NLS motifs involve fewer than ten consecutive residues [62] . The absence of an NLS will cause an otherwise nuclear protein to remain in the cytoplasm. Similarly, proteins entering the secretory pathway usually contain N-terminal signal peptides [65, 66, 67] . Most signal peptides span over about 20-30 residues [68] that are cleaved while the protein is translocated through the extra-cellular membrane. Other N-terminal signals control targeting to the chloroplast and mitochondria [65, 66, 67] . Thus, the targeting mechanisms differ considerably between the compartments. Consequently, we were very surprised to find that the sequence conservation of proteins from different compartments was similar ( Fig. 2 ). The practical implications of our analysis are that we can accurately infer the sub-cellular compartment of a protein if we find close homologues of experimentally known localization. We showed that the inference through homology strikes a better compromise between over- and under-prediction than ab initio prediction methods. This observation is expected, since ab initio methods - like for the example of structure prediction [69, 70] - are designed to work when we cannot use homology-based inference. The improvement in sensitivity and accuracy through a combination of a scaled HSSP-distance and the BLAST expectation values are important for analysing entire proteomes ( Table 3 ). A particularly important aspect of this work is the detailed estimates for accuracy and coverage associated with any automatic annotation made.
Data set. We selected all eukaryotic proteins with annotated sub-cellular localization in SWISS-PROT release 37 [6] . We excluded sequences annotated as "POSSIBLE","PROBABLE" or "BY SIMILARITY". We also excluded membrane proteins and all sequences annotated with multiple localizations. This left 7405 proteins with experimentally annotated localization ( Table 1 All SWISS-PROT). To reduce bias, we selected a representative data set of sequence-unique proteins. Protein pairs were clustered using a simple greedy algorithm starting with the largest and longest families [71, 30] . We investigated different thresholds for clustering the sequences. The major results of our work were insensitive to the particular choice of the threshold (data not shown). Note that the data reported was obtained when using an HSSP-distance of 4 ( eqn. 1 ) to cluster since that value defined the threshold of sequence conservation. The database comparisons for the clustering were performed by pairwise BLAST [43, 2] .
Generating alignments for pair comparisons. We aligned all sequences from the sequence-unique subset ( Table 1 ) against all proteins of known localization using pairwise BLAST [43, 2] . For all proteins from the sequence unique subset we generated PSI-BLAST [45] profiles using a filtered version of all currently known sequences with three iterations [72] . These profiles were then aligned against all proteins of known localization.
Scores for measuring sequence similarity. The simplest way to measure sequence similarity is percentage pairwise sequence identity (PIDE), i.e. the percentage of residues identical between two proteins divided by residues aligned (not counting gaps). The second measure that we used was given by the statistical expectation values as reported by BLAST (E-VAL, note: we typically report the logarithm of this value in our figures). The third scoring scheme we used was the distance from the HSSP-curve [5, 10] :
HSSP-distance = PIDE – HSSP_curve(q)
HSSP_curve(q)= q +
(Eq. 1)
where L was the length of the alignment between two proteins, PIDE the percentage of pairwise identical residues, and HSSP_curve(q) the revised HSSP-threshold for the level q. As described above, we chose q = 4 to reduce the bias. However, to compile distances, we chose the threshold of q = 0.
Modifications to optimise detection of homologues. We introduced two modifications of the standard HSSP-distance ( eqn. 1 ):
(1) Perpendicular HSSP-distance: To calculate the 'perpendicular HSSP-distance' ( eqn. 1 ), percentage sequence identity and alignment length have to be measured in comparable units. This was done by first identifying approximate saturation points (slope 0 or ¥) on the HSSP-curve. Using these saturation points, we re-scaled the length of alignment axis (L in eqn. 1 ) and expressed it in terms of percent identity. For a given alignment, the normal to the re-scaled HSSP-curve was first identified. The length of the normal gave the perpendicular HSSP-distance. We experimented with various re-scaling constants. Finally a re-scaling constant of 0.26 for the length of alignment was observed to provide the best results.
(2) Scaled HSSP-distance: Two proteins with 100% identical residues (PIDE=100) over an alignment length of 25 residues have an HSSP-distance of only 33. However, we observed very few false positives even over relatively short fragments for very similar pairs ( Fig. 5 ). This suggested that better identification of homologues might be possible by using a relative distance from the curve. The scaled HSSP-distance was defined as:
(Eq. 2)
where PIDE was the percentage pairwise sequence identity, and the HSSP_curve was as defined in eqn. 1 (with a threshold of 0).
Definitions of accuracy and coverage. We used the following definitions to measure accuracy/specificity:
(Eq. 3)
with the thresholds given by either (1) percentage pairwise sequence identity, (2) BLAST expectations values, (3) the distance from the HSSP-curve ( eqn. 1 ), (4) the Scaled HSSP-distance ( eqn. 2 ), or (5) the Perpendicular HSSP-distance. We considered all pairs as 'true' that were experimentally found in the same sub-cellular compartment. In analogy, we used the following definitions for coverage/sensitivity:
(Eq. 4)
The accuracy of prediction was measuredusing the ratios:
(Eq. 5)
Note that pL and oL measure two different aspects of prediction methods, in particular, oL reflects how many of the known proteins are correctly predicted while pL reflects how many of the predicted proteins are correctly predicted. For example, a method that strongly over-predicts (like SignalP) yields a high oL and a low pL ( Table 3 ).
Prediction methods. The prediction accuracy of three publicly available subcellular localization predictors was evaluated using the sequence-unique test set (Table 1). The three predictors were: 1) NNPSL: neural network based tool for predicting subcellular localization based on amino acid composition [73] ; 2) SubLoc: a support vector machine based tool for predicting subcellular localization based on amino acid composition [74] ; and 3) TargetP: neural network based tool for large-scale subcellular localization prediction based on N-terminal sequence information [75] . All methods were run with default parameter settings.
Annotating localization based on homology. All proteins belonging to five entirely sequenced eukaryotic proteomes (Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana) and all proteins in the SWISS-PROT database were aligned by pairwise BLAST to our data set of proteins with experimentally known localization ( Table 1 ). We measure sequence similarity by the scaled HSSP-distance, and considered only alignment pairs above the conservation threshold (scaled HSSP-distance=4). We estimated the accuracy of the annotation transfer by homology using the curves that were obtained for the different localizations ( Fig. 4 C). We annotated localization based on the known localization of the closest homologue. We annotated only those proteins for which localization could be inferred with greater than 70% accuracy ( Table 2 ).
Thanks to Dariusz Przybylski, Kazimierz Wrzeszczynski and Trevor Siggers (all Columbia University) for crucial discussions; to Jinfeng Liu (Columbia University) for providing genome data and computer assistance; to Cinque Soto and Yanay Ofran (both Columbia University) for helpful comments on the manuscript. Thanks to Astrid Reinhardt (Baylor College of Medicine, Texas), Tim Hubbard (Sanger Centre, Hinxton), Sujun Hua (Tsinghua University), Zhirong Hun (Tsinghua University), Olof Emanuelsson (Stockholm University), Henrik Nielsen (Technical University of Denmark), Søren Brunak (Technical University of Denmark) and Gunnar von Heijne (Stockholm University) for access to their prediction methods. The work of RN and BR was supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health. Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.
| 1. | Webb, E. C. (1992). EnzymeNomenclature 1992. Recommendations of the Nomenclature committee of theInternational Union of Biochemistry and Molecular Biology. Academic Press, NewYork. |
| 2. | Altschul, S. F. & Gish, W.(1996). Local alignment statistics. Meth. Enzymol., 266, 460-480. |
| 3. | Ashburner, M. & Drysdale, R.(1994). FlyBase--the Drosophila genetic database. Development, 120, 2077-9. |
| 4. | Lewis, S., Ashburner, M. &Reese, M. G. (2000). Annotating eukaryote genomes. Curr Opin Struct Biol, 10,349-54. |
| 5. | Sander, C. & Schneider, R.(1991). Database of homology-derived structures and the structural meaning ofsequence alignment. Proteins, 9, 56-68. |
| 6. | Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucl. Acids Res., 28, 45-48. |
| 7. | Bork, P. & Koonin, E. V. (1998).Predicting functions from protein sequences--where are the bottlenecks? NatGenet, 18, 313-8. |
| 8. | Rost, B. (1997). Protein structuressustain evolutionary drift. Folding & Design, 2, S19-S24. |
| 9. | Murzin, A. G. (1998). How fardivergent evolution goes in proteins. Curr Opin Struct Biol, 8, 380-7. |
| 10. | Rost, B. (1999). Twilight zone ofprotein sequence alignments. Protein Eng, 12, 85-94. |
| 11. | Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. III. A comparative study of sequence conservation in proteinstructural families using multiple structural alignments. J. Mol. Biol., 301,691-711. |
| 12. | Chothia, C. & Lesk, A. M.(1986). The relation between the divergence of sequence and structure inproteins. EMBO J., 5, 823-826. |
| 13. | Abagyan, R. A. & Batalov, S.(1997). Do aligned sequences share the same fold? J. Mol. Biol., 273, 355-368. |
| 14. | Alexandrov, N. N. & Soloveyev,V. V. (1998). Statistical significance of ungapped sequence alignments. InHICCS' 98: Pacific Symposium on Biocomputing' 98 (Altman, R. B., Dunker, A. K.,Hunter, L. & Klein, T. E., eds.), pp. 463-472, World Scientific, Maui,Hawaii, U.S.A.. |
| 15. | Brenner, S. E., Chothia, C. &Hubbard, T. J. P. (1998). Assessing sequence comparison methods with reliablestructurally identified distant evolutionary relationships. Proc. Natl. Acad.Sc. U.S.A., 95, 6073-6078. |
| 16. | Park, J., Karplus, K., Barrett, C.,Hughey, R., Haussler, D. et al. (1998). Sequence comparisons using multiplesequences detect three times as many remote homologues as pairwise methods. J.Mol. Biol., 284, 1201-1210. |
| 17. | Jaroszewski, L., Rychlewski, L.& Godzik, A. (2000). Improving the quality of twilight-zone alignments.Prot. Sci., 9, 1487-1496. |
| 18. | Blake, J. D. & Cohen, F. E.(2001). Pairwise sequence alignment below the twilight zone. J. Mol. Biol.,307, 721-735. |
| 19. | Teichmann, S., Park, J. &Chothia, C. (1998). Structural assignments to the Mycoplasma genitaliumproteins show extensive gene duplication and domain rearrangement. Proc. Natl.Acad. Sc. U.S.A., 14658-14663. |
| 20. | Teichmann, S. A., Chothia, C. &Gerstein, M. (1999). Advances in structural genomics. Curr. Opin. Str. Biol.,9, 390-399. |
| 21. | Liu, J. & Rost, B. (2001).Comparing function and structure between entire proteomes. Prot. Sci., 10,1970-1979. |
| 22. | Vitkup, D., Melamud, E., Moult, J.& Sander, C. (2001). Completeness in structural genomics. Nat Struct Biol,8, 559-66. |
| 23. | Liu, J. & Rost, B. (2002).Target space for structural genomics revisited. Bioinformatics, submitted. |
| 24. | Shah, I. & Hunter, L. (1997).Predicting enzyme function from sequence: a systematic appraisal. In FifthInternational Conference on Intelligent Systems for Molecular Biology(Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C. et al.,eds.), pp. 276-283, AAAI Press, Halkidiki, Greece. |
| 25. | Devos, D. & Valencia, A.(2000). Practical limits of function prediction. Proteins, 41, 98-107. |
| 26. | Pawlowski, K., Jaroszewski, L.,Rychlewski, L. & Godzik, A. (2000). Sensitive sequence comparison asprotein function predictor. Pac Symp Biocomput, 8, 42-53. |
| 27. | Wilson, C. A., Kreychman, J. &Gerstein, M. (2000). Assessing annotation transfer for genomics: quantifyingthe relations between protein sequence, structure and function through traditionaland probabilistic scores. J. Mol. Biol., 297, 233-249. |
| 28. | Devos, D. & Valencia, A.(2001). Intrinsic errors in genome annotation. Trends in Genetics, 17, 429-431. |
| 29. | Todd, A. E., Orengo, C. A. &Thornton, J. M. (2001). Evolution of function in protein superfamilies, from astructural perspective. J. Mol. Biol., 307, 1113-1143. |
| 30. | Rost, B. (2002). Enzyme functionless conserved than anticipated. J. Mol. Biol., in press. |
| 31. | Thornton, J. M., Orengo, C. A.,Todd, A. E. & Pearl, F. M. (1999). Protein folds, functions and evolution.J. Mol. Biol., 293, 333-42. |
| 32. | Koonin, E. V., Bork, P. &Sander, C. (1994). Yeast chromosome III: new gene functions. EMBO J., 13,493-503. |
| 33. | Casari, G., Andrade, M. A., Bork,P., Boyle, J., Daruvar, A. et al. (1995). Challenging times for bioinformatics.Nature, 376, 647-648. |
| 34. | Ouzounis, C., Casari, G., Sander,C., Tamames, J. & Valencia, A. (1996). Computational comparisons of modelgenomes. TIBTECH, 14, 280-285. |
| 35. | Schneider, R., Casari, G., Antoine,d. D., Bremer, P., Schlenkrich, M. et al. (1997). GeneCrunch: Experiences onthe SGI POWER CHALLENGE array with bioinformatics applications. InSupercomputer 1996: Anwendungen, Architekturen, Trends eds.), pp. 109-119, K.G.Saur Verlag, . |
| 36. | Tamames, J., Ouzounis, C., Casari,G., Sander, C. & Valencia, A. (1998). EUCLID: automatic classification ofproteins in functional classes by their database annotations. Bioinformatics,14, 542-3. |
| 37. | Andrade, M. A., Brown, N. P.,Leroy, C., Hoersch, S., de Daruvar, A. et al. (1999). Automated genome sequenceanalysis and annotation. Bioinformatics, 15, 391-412. |
| 38. | Koonin, E. V. (2000). Bridging thegap between sequence and function. Trends Genet, 16, 16. |
| 39. | Eisenhaber, F. & Bork, P.(1998). Wanted: subcellular localization of proteins based on sequence. TrendsCell Biol, 8, 169-70. |
| 40. | Karp, P. D. (1998). What we do notknow about sequence analysis and sequence databases. Bioinformatics, 14, 753-4. |
| 41. | Doolittle, R. F. (1986). Of URFs andORFs: a primer on how to analyze derived amino acid sequences. UniversityScience Books, Mill Valley California. |
| 42. | Rost, B. (1998). Marrying structureand genomics. Structure, 6, 259-263. |
| 43. | Altschul, S. F., Gish, W., Miller,W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool.J. Mol. Biol., 215, 403-410. |
| 44. | Altschul, S. F. (1993). A proteinalignment scoring system sensitive at all evolutionary distances. J. Mol.Evol., 36, 290-300. |
| 45. | Altschul, S., Madden, T., Shaffer,A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a newgeneration of protein database search programs. Nucl. Acids Res., 25,3389-3402. |
| 46. | Pearson, W. R. (1995). Comparisonof methods for searching protein sequenc databases. Prot. Sci., 4, 1145-1160. |
| 47. | Wood, T. C. & Pearson, W. R.(1999). Evolution of protein sequences and structures. J. Mol. Biol., 291,977-995. |
| 48. | Hegyi, H. & Gerstein, M.(1999). The relationship between protein structure and function: a comprehensivesurvey with application to the yeast genome. J. Mol. Biol., 288, 147-64. |
| 49. | Rychlewski, L., Zhang, B. &Godzik, A. (1999). Functional insights from structural predictions: analysis ofthe Escherichia coli genome. Prot. Sci., 8, 614-624. |
| 50. | Ferrigno, P. & Silver, P. A.(1999). Regulated nuclear localization of stress-responsive factors: how thenuclear trafficking of protein kinases and transcription factors contributes tocell survival. Oncogene, 18, 6129-34. |
| 51. | Faust, M. & Montenarh, M. (2000).Subcellular localization of protein kinase CK2. A key to its function? CellTissue Res, 301, 329-40. |
| 52. | Pearce, D. A. (2000). Localizationand processing of CLN3, the protein associated to Batten disease: where is itand what does it do? J Neurosci Res, 59, 19-23. |
| 53. | Andrade, M. A., O'Donoghue, S. I.& Rost, B. (1998). Adaptation of protein surfaces to subcellular location.J. Mol. Biol., 276, 517-25. |
| 54. | Liscovitch, M., Czarny, M., Fiucci,G., Lavie, Y. & Tang, X. (1999). Localization and possible functions ofphospholipase D isozymes. Biochim Biophys Acta, 1439, 245-63. |
| 55. | Sirover, M. A. (1999). New insightsinto an old protein: the functional diversity of mammalianglyceraldehyde-3-phosphate dehydrogenase. Biochim Biophys Acta, 1432, 159-84. |
| 56. | Claros, M. G., Brunak, S. & vonHeijne, G. (1997). Prediction of N-terminal protein sorting signals. Curr.Opin. Str. Biol., 7, 394-398. |
| 57. | Nielsen, H., Brunak, S. & vonHeijne, G. (1999). Machine learning approaches for the prediction of signalpeptides and other protein sorting signals. Prot. Engin., 12, 3-9. |
| 58. | Drawid, A. & Gerstein, M.(2000). A Bayesian system integrating expression data with sequence patternsfor localizing proteins: comprehensive application to the yeast genome. J. Mol.Biol., 301, 1059-75. |
| 59. | Nakai, K. (2001). Prediction of invivo fates of proteins in the era of genomics and proteomics. J. Struct. Biol.,134, 103-116. |
| 60. | Vogt, G., Etzold, T. & Argos,P. (1995). An assessment of amino acid exchange matrices in aligning proteinsequences: the twilight zone revisited. J. Mol. Biol., 249, 816-831. |
| 61. | Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. II. On the relationship between sequence and structural similarityfor proteins that are not obviously related in sequence. J. Mol. Biol., 301,679-689. |
| 62. | Cokol, M., Nair, R. & Rost, B.(2000). Finding nuclear localisation signals. EMBO Reports, 1, 411-415. |
| 63. | Reich, J. G. & Meiske, W.(1987). A simple statistical significance test of window scores in large dotmatrices obtained from protein or nucleic acid sequences. Comput. Appl.Biosci., 3, 25-30. |
| 64. | Mattaj, I. W. & Englmeier, L.(1998). Nucleocytoplasmic transport: the soluble phase. Annu Rev Biochem, 67,265-306. |
| 65. | Schwarz, E. & Neupert, W.(1994). Mitochondrial protein import: mechanisms, components and energetics.Biochim Biophys Acta, 1187, 270-4. |
| 66. | Schatz, G. & Dobberstein, B.(1996). Common principles of protein translocation across membranes. Science,271, 1519-26. |
| 67. | Bruce, B. D. (2000). Chloroplasttransit peptides: structure, function and evolution. Trends Cell Biol, 10,440-7. |
| 68. | Nielsen, H., Engelbrecht, J.,Brunak, S. & von Heijne, G. (1997). Identification of prokaryotic andeukaryotic signal peptides and prediction of their cleavage sites. Protein Eng,10, 1-6. |
| 69. | Rost, B. (2001). Protein secondarystructure prediction continues to rise. J. Struct. Biol., 134, 204-218. |
| 70. | Rost, B. & Eyrich, V. (2001).EVA: large-scale analysis of secondary structure prediction. Proteins, 45 Suppl5, S192-S199. |
| 71. | Hobohm, U., Scharf, M., Schneider,R. & Sander, C. (1992). Selection of representative protein data sets.Prot. Sci., 1, 409-17. |
| 72. | Przybylski, D. & Rost, B.(2002). Alignments grow, secondary structure prediction improves. Proteins, 46,195-205. |
| 73. | Reinhardt, A. & Hubbard, T.(1998). Using neural networks for prediction of the subcellular location ofproteins. Nucl. Acids Res., 26, 2230-2235. |
| 74. | Hua, S. & Sun, Z. (2001).Support vector machine approach for protein subcellular localizationprediction. Bioinformatics, 17, 721-8. |
| 75. | Emanuelsson, O., Nielsen, H.,Brunak, S. & von Heijne, G. (2000). Predicting subcellular localization ofproteins based on their N-terminal amino acid sequence. J. Mol. Biol., 300,1005-16. |
| Contact: rost@columbia.edu | Version: Sep 13, 2002 |