| Title: | Better prediction of sub-cellular localization by combining evolutionary and structural information |
| Author: | Rajesh Nair & Burkhard Rost |
| Quote: | Proteins, 2003, 53, 917-930 |
Better prediction of sub-cellular localization by combining evolutionary and structural information
| 1 | CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 5 | Dept. of Physics, Columbia Univ., 538 West 120th Street, New York, NY 10027, USA |
| * | Corresponding author: email = cubic@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/ Tel: +1-212-305-4018, fax: +1-212-305-7932 |
This article is published in (Proteins, issue, 2003 and pages) © copyright Proteins: Structure, Function, and Genetics Wiley (2003). Wiley is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.
The native sub-cellular compartment of a protein is one aspect of its function. Thus, predicting localization is an important step toward predicting function. Short zip code-like sequence fragments regulate some of the shuttling between compartments. Cataloguing and predicting such motifs is the most accurate means of determining localization in silico. However, only few motifs are currently known, and not all the trafficking appears regulated in this way. The amino acid composition of a protein correlates with its localization. All general prediction methods employed this observation. Here, we explored the evolutionary information contained in multiple alignments and aspects of protein structure to predict localization in absence of homology and targeting motifs. Our final system combined statistical rules and a variety of neural networks to achieve an overall four-state accuracy above 65%, a significant improvement over systems using only composition. The system was at its best for extra-cellular and nuclear proteins; it was significantly less accurate than TargetP for mitochondrial proteins. Interestingly, all methods that were developed on SWISS-PROT sequences failed grossly when fed with sequences from proteins of known structures taken from PDB. We therefore developed two separate systems: one for proteins of known structure and one for proteins of unknown structure. Finally, we applied the PDB-based system along with homology-based inferences and automatic text analysis to annotate all eukaryotic proteins in the PDB (http://cubic.bioc.columbia.edu/db/LOC3D). We imagine that this pilot method - certainly in combination with similar tools - may be valuable target selection in structural genomics.
Key words: protein sub-cellular localization, protein structure, secondary structure, surface composition, sequence motifs, evolutionary profiles, neural network, bioinformatics, PDB, automatic genome annotation.
Abbreviations used: 1D structure, one-dimensional (e.g. sequence or string of secondary structure); 3D structure, three-dimensional co-ordinates of protein structure; ChloroP, prediction of proteins in the chloroplast [1] ; DSSP, program and database assigning secondary structure and solvent accessibility for proteins of known 3D structure [2] ; PDB, Protein Data Bank of experimentally determined 3D structures of proteins [3] ; PHD, Profile based neural networks for predicting secondary structure (PHDsec) [4, 5, 6] , and solvent accessibility (PHDacc) [7, 6] ; PredictNLS, prediction of nuclear proteins through nuclear localization signals [8, 9] ; HSSP, database of protein structure-sequence alignments [10] ; NNPSL, neural networks predicting localization [11] ; NLS, nuclear localization signal; SubLoc, support-vector machine-based prediction of localization [12] ; SignalP, neural network system predicting signal peptides [13, 14] ; SWISS-PROT, data base of protein sequences [15, 16] ; TargetP, combined method predicting chloroplast (ChloroP), extra-cellular (SignalP), and mitochondrial proteins [17] ; PSORT, knowledge-based expert system using amino acid composition and sequence motifs [18, 19, 20] ; TrEMBL, translation of the EMBL-nucleotide database coding DNA to protein sequences [15] .
Notations used: 'sequence-unique', we refer to sequence-unique sets as those in which no pair of proteins has more sequence similarity than a certain threshold (HSSP-distance < 10, eqn. 1 ).
Methods introduced here: LOC3Dnet, combination neural networks trained on PDB sequences using observed (LOC3DnetDSSP) or predicted (LOC3DnetPHD) 1D structure and evolutionary information from multiple alignments; LOC3D, localization prediction based on combination of four different methods; LOCnet, combination of neural networks trained on SWISS-PROT sequences using predicted 1D structure (PHD) and evolutionary information.
Sub-cellular localization important to elucidate protein function. Proteins must be localized in the same sub-cellular compartment to cooperate towards a common function. Therefore, experimentally unravelling the native compartment of a protein constitutes one step on the long way to determining its role. The explosion of sequence information through large-scale sequencing projects has widened the gap between the number of sequences deposited in public databases and the experimental characterisation of the corresponding proteins [21, 22, 23] . Using high-throughput methods of epitope-tagging and immuno-fluorescence analysis Snyder et al [24] have recently reported localization data for the entire proteome of Saccharomyces cerevisiae (bakers yeast). So far, the majority of large-scale experiments suggesting localization have been restricted to yeast. This is primarily due to the fidelity of homologous recombination in yeast and the concomitant ease with which integrated reporter gene fusions can be generated. In contrast, computational tools can provide fast and accurate localization predictions for any organism [25, 26] . In fact, attempts to predict sub-cellular localization has become one of the central problems in bioinformatics [27, 28, 29] .
Inferring localization by homology relatively accurate but not always applicable. A variety of approaches have been used to classify proteins with respect to sub-cellular localization. One of the most reliable means is annotation transfer from homologues [30, 31, 32, 33] : If a protein of experimentally known localization is significantly similar in sequence to a query protein U, localization can be inferred for U. However, the level of 'significant sequence similarity' varies substantially between localizations, and is much higher than that required for correct inference of folds [30, 34, 35] . Thus, even when accepting many errors less than 25% of the proteins in SWISS-PROT [15] can be classified by homology into one of ten compartments [35] . Using text analysis of SWISS-PROT keywords to infer localization, we can annotate sub-cellular localization for about 48% of all proteins in SWISS-PROT [36] .
Sequence motifs predict successfully for some compartments. Another way to predict localization is to identify local sequence motifs such as signal peptides [37, 13, 14, 17, 28, 29] or nuclear localization signals (NLS) [8, 28, 38, 39] . Proteins destined for the secretory pathway, the mitochondria and the chloroplast contain N-terminal targeting peptides that are recognised by the translocation machinery [40, 41] . Thus, prediction methods use only the N-terminal residues [13, 17] . Discriminant analysis has been applied to identify proteins imported into mitochondria [42] . Many proteins destined for the nucleus contain NLS motifs that may occur anywhere in the sequence [43, 44] . Recently, we have collected a data set of experimental and potential NLS motifs as an aid to predicting nuclear localization [8] ; some of these signals are also used for the export from the nucleus [8, 45] . However, the vast majority of proteins have no known motif. Furthermore, a particular problem for methods detecting N-terminal signals is that start-codons are predicted with less than 70% accuracy by genome projects [11] . Overall, known and predicted sequence motifs enable annotating about 30% of the proteins in six entirely sequenced eukaryotic proteomes [46, 47, 32] .
Ab initio methods predict localization for all proteins at lower accuracy. A third approach to predicting localization has been suggested by the observation that the overall amino acid composition correlates with the native compartment [48, 49, 50] . This observation has led to the development of a variety of prediction methods based solely on composition [51, 52, 53, 11, 20, 54] . Higher order correlations (residues i and (i+n), n=2,3,4) have been accounted for by using pseudo-amino acid composition [55, 56] . With the availability of many completely sequenced genomes, phylogenetic profiles have been employed to identify sub-cellular localization [57] . So far, this approach has been much less accurate than methods based solely on composition. PSORT II is a knowledge-based expert system that integrates rules based on amino acid composition with known sequence motifs [18, 19, 20] , and also uses other methods such as NNPSL [11] . Thus, the accuracy of PSORT II somehow depends on the accuracy of the underlying original methods. Drawid & Gerstein have proposed a Bayesian system based on a diverse range of 30 different features [58] . They applied their system to predicting localization of yeast proteins. Using a seven-fold jack-knife procedure on 1342 yeast proteins with known localization, they reported a prediction accuracy of 75% at 67% coverage [58] .
What determines sub-cellular localization? Experimental studies of protein targeting have shown that the sub-cellular localization of a protein is determined by its three-dimensional (3D) structure and/or the presence of local sequence motifs. Mutation studies have shown that disrupting the structure of a protein causes aberrant localization [59, 60, 61, 62] . One way of incorporating global information about protein structure into predictions of localization is by using secondary structure content. A number of local sequence motifs have been shown to mediate protein targeting [40, 37, 44] . Miguel Andrade (EMBL, Heidelberg), Sean O'Donoghue (LION, Heidelberg) et al. have previously concluded that the signal for sub-cellular localization is almost entirely due to the surface residues [30] . Here, we have utilised these aspects of protein structure information to aid predicting localization.
We describe LOC3Dnet and LOCnet ( Fig. 1 ), two systems of neural networks that sort proteins into one of four localization classes: extra-cellular, cytoplasmic, nuclear and mitochondrial. One (LOC3Dnet) is specialised on sequences from the PDB, the other (LOCnet) on sequences from SWISS-PROT. We excluded helical membrane proteins and used proteins from other minor compartments only as 'false positives'. In particular, proteins from the secretory pathway that are retained in the Golgi apparatus or the Endoplasmic reticulum were treated as 'non extra-cellular'. The method used 3+1 layers to make the final decision. The first layer consisted of four dedicated neural networks that used particular features from protein sequences, alignments, and structure to pre-sort proteins into L/not-L (L = cytoplasmic, nuclear, extra-cellular, mitochondrial). The second layer neural networks combined different input features. The third layer used a simple jury decision [63] to assign one of four localization-states to each protein. We applied the final system to predict the native sub-cellular compartment for all eukaryotic proteins in PDB. Toward this end, we added a fourth layer combining the information from PredictNLS [8] , sequence homology [35] , automatic text analysis [36] and the prediction system introduced here. The neural networks were trained and tested on PDB and SWISS-PROT sequences of experimentally annotated localizations. We distinguished the following input features: overall amino acid composition, the amino acid composition of surface residues, composition of three-state secondary structure (helix, strand, other), and the amino acid composition separated into three secondary structure states (helix, strand, other). We also predicted secondary structure and surface composition using PHDsec and PHDacc respectively [7, 6] . Since biased data sets tend to yield over-estimates in prediction accuracy [64] , we took great care to select sequence-unique subsets of the data to estimate performance. All results of the methods described here were based either on four-fold cross-validation experiments or on SWISS-PROT sequences that had no significant sequence similarity to any protein used for development. The LOC3D localization prediction server for protein structures and the results of our annotations for all eukaryotic chains in the PDB can be accessed through the web at http://c2b2.columbia.edu/db/LOC3D/.
Fig. 1: Final neural network architecture. For the first level of pairwise neural networks we used an architecture of 20-60 input units and 2 output units with a hidden layer consisting of 3-9 units. We used the output from the different first level pairwise neural networks as input to the second level integrating neural network. The second level pairwise networks consisted of 6 input units and 2 output units with a hidden layer consisting of 3 units. The final localization prediction was based on a jury decision of the outputs from the different pairwise integrating networks.
Structural information improves accuracy. The overall amino acid composition, the surface composition, the three-state secondary structure composition, and the combined sequence-structure composition all showed some correlation with sub-cellular localization in two dimensions (Appendix Fig. 5 ). The strongest signal was the residue composition separated by three secondary structure states (HEL, Appendix). We trained four different neural networks that were specialised to discriminate between one of four localizations (cytoplasmic, extra-cellular, nuclear, and mitochondrial, Table 1) and all others, i.e. each network had two output units. Then, we combined the outputs from each specialist through a statistical jury decision [63] to give the final four state prediction of localization ( eqn. 2 ). Networks based on the secondary structure state specific residue composition reached the highest accuracy (from 57% for sequence only to above 59% for secondary structure dependent composition, Fig. 2 A). Networks based on predicted surface ( Fig. 2 A) performed slightly worse than those based on the observed data (Fig. 2A). However, when using only the exposed surface predicted by PHD ( Fig. 2 A) prediction accuracy dropped below that obtained by sequence alone ( Fig. 2 A).
Footnote: & Note: all estimates for performance reported were obtained for the test sets or for data sets that had never been used for development. In particular, we never showed results for the training sets to eschew confusion.
Method a | Extra-cellular | Cytoplasmic | Nuclear | Mitochondrial | ||||||||
| oL | s(oL) | MC | oL | s(oL) | MC | oL | s(oL) | MC | oL | s(oL) | MC | |
Composition only | 82 | 2.9 | 0.51 | 62 | 1.0 | 0.29 | 70 | 6.7 | 0.33 | 85 | 1.7 | 0.37 |
Composition of surface DSSP | 79 | 5.3 | 0.47 | 57 | 7.6 | 0.22 | 70 | 7.9 | 0.35 | 73 | 5.2 | 0.23 |
Composition of surface PHDacc | 76 | 3.6 | 0.42 | 62 | 2.5 | 0.23 | 69 | 3.5 | 0.28 | 65 | 1.4 | 0.20 |
Composition by sec. str. DSSP | 84 | 3.0 | 0.57 | 61 | 3.2 | 0.22 | 72 | 3.0 | 0.40 | 78 | 3.4 | 0.31 |
Composition by sec. str. PHDsec | 84 | 2.2 | 0.58 | 66 | 2.5 | 0.23 | 70 | 5.2 | 0.37 | 77 | 6.6 | 0.32 |
Sum over DSSP networks | 83 | 3.3 | 0.57 | 59 | 2.2 | 0.24 | 73 | 6.5 | 0.41 | 77 | 5.7 | 0.31 |
Sum over PHD networks | 83 | 2.9 | 0.56 | 65 | 2.7 | 0.28 | 71 | 5.4 | 0.38 | 77 | 7.0 | 0.31 |
Network on DSSP networks | 81 | 5.0 | 0.54 | 62 | 0.9 | 0.26 | 72 | 7.6 | 0.39 | 77 | 5.1 | 0.30 |
Network on PHD networks | 83 | 4.5 | 0.55 | 67 | 2.0 | 0.27 | 69 | 5.8 | 0.35 | 75 | 9.0 | 0.29 |
* Abbreviations used:
* Abbreviations used:
oL: percentagetwo-state accuracy on test set (Eqn. 7);
s(oL): Standarddeviation of oL for four-fold cross-validation.
MC: MathewÕscorrelation coefficient [89] (Eqn. 10).
Fig. 2: Structural and evolutionary information improves prediction accuracy. The curves show accuracy (Eqn. 5) for the pairwise prediction of all classes. (A) Single sequence information: Of thesingle pairwise first level neural networks, networks trained on amino acid composition separated into the three secondary structure states (HEL) were the most accurate. Here, DSSP represents the observed (from DSSP) and PHD the predicted (from PHD) three-state secondary structure. Both training and testing of the neural networks is performed on the observed or predicted features. Using the observed composition of surface residues from DSSP, prediction accuracy improved by up to a few percentage points over using sequence information alone. The three-state secondary structure predicted by PHDsec was accurate enough to preserve most of the gains in accuracy due to introduction of structural information. However, surface composition was not be predicted accurately enough: Using exposed surface predicted by PHDacc, prediction accuracy dropped below that obtained by using sequence alone. (B) Alignment information: Prediction accuracy increased on average by 2% for the profile based networks. The only exception was the networks based on surface composition for which there was no significant improvement in accuracy. Using sequence profiles as input, networks based on overall composition were more accurate than networks based on observed and predicted surface composition (Table 1 for estimates of standard deviations).
Evolutionary information improves by three percentage points. The advantages of using evolutionary information in the form of sequence profiles have been demonstrated for secondary structure prediction by a number of researchers [65, 4, 63, 5, 6, 66, 67, 68, 69, 70, 71] . Using sequence profiles as input (Methods), prediction accuracy increased by up to 3% ( Fig. 2 B) for pairwise networks based on overall sequence and secondary structure. The number of proteins in the alignments was on average similar to that obtained for all PDB proteins solved over the last two years (data not shown, [72, 73] ), in other words, the set of proteins that we used to evaluate performance did not stand out from what is typical for the PDB.
Nuclear and extra-cellular proteins predicted significantly better. The prediction accuracy for the different localization classes showed similar trends when using different sequence features as input to the neural network. Extra-cellular and nuclear localizations were predicted most accurately while the cytoplasmic and mitochondrial classes were predicted with a much lower accuracy ( Fig. 3 Table 1 ).
Fig. 3: Pairwise first level neural networks accurate for some localizations. The accuracy versus coverage curve for the four pairwise neural networks trained on amino acid composition separated into the observed secondary structure states shows that extra-cellular and nuclear classes were predicted very accurately (above 80% accuracy at 75% coverage). However, prediction accuracy for the cytoplasmic and mitochondrial classes was much lower (accuracy less than 65% at 75% coverage). The standard deviation in prediction accuracy for each of the localization classes was roughly five percentage points. Networks trained on other composition features showed similar trends in prediction accuracy for the different localization classes (data not shown).
Second level combination of simple networks improves significantly. We obtained by far the best results for each of the four localizations using combinations of the previously developed networks. We tried two versions: Combine outputs from first level networks through a statistical jury decision, and feed first level network output into a second level network. Amongst the combined networks, using evolutionary information consistently improved over using single sequences by about two percentage points (Appendix Fig. 6 A+B). The combined networks tested on predicted surface and secondary structure were only slightly less accurate than those tested on the DSSP data. The accuracy of our final system was more than six percentage points higher than networks based only on amino acid composition. The profile-based combination networks performed best and were thus used to make the final localization predictions.
Over 73% accuracy for most reliably predicted 50% of all proteins. So far, we reported levels of accuracy valid when we forced a prediction for each protein. However, some of the predictions were 'stronger' than others. We translated this prediction strength into a reliability index ( eqn. 3 ) and investigated the dependency of prediction accuracy on this reliability ( Fig. 4 ). When we made predictions only for the most reliably predicted half of all proteins of known structure, prediction accuracy exceeded 75% for both the networks based on predicted and observed structural information ( Fig. 4 A). This corresponds approximately to a reliability index of 40 for both the profile based networks. Similarly, prediction accuracy reached about 73% for the most strongly predicted half of proteins of unknown structure, however, the actual value for this system was slightly different, namely this point was reached for a reliability index of about 47 ( Fig. 4 B).
Fig. 4: Over 75% accuracy for the most reliably predicted half of all proteins. We converted the raw neural network output into a reliability index depending on the strength of the prediction. The left panel (A) shows the performance of the systems trained and tested on sequences from PDB, i.e. proteins of known structure, the right panel (B) gives the performance of the system trained and tested on sequences from SWISS-PROT, i.e. proteins of unknown structure. For example, at a reliability index of 40, the prediction accuracy was over 74% for the PDB specialists; more than half the proteins in the test set were predicted at this reliability level. While the scale was slightly different for the system trained and tested on SWISS-PROT sequences, the performance was similar: about half the proteins were predicted at a reliability index of 47, about 73% of the proteins predicted at this level were correctly predicted.
Comparison to other methods. Two publicly available methods address a similar general-purpose prediction of localization without sequence motifs or homology, namely the neural network based program NNPSL [11] and the support vector machine-based program SubLoc [12] . We applied both methods to our test set of 359 sequence-unique eukaryotic PDB chains. Since some of these proteins were used for developing NNPSL and SubLoc, our test of these public methods may slightly over-estimate their performance. On the PDB data, SubLoc reached an overall four-state accuracy around 50%, NNPSL around 44% ( Table 2 ). Our networks that used amino acid composition alone reached a similar level (NetSeq in Table 2 ). Incorporating predicted surface, secondary structure and evolutionary information the final combination network performed significantly better (LOC3DnetPHD in Table 2 ). Our system trained on SWISS-PROT sequences (LOCnet) was conceptually identical to the one trained on PDB sequences (LOC3DnetPHD). Hence, we were surprised that it performed significantly worse (over ten percentage points reduction in Q4). This drop suggested that PDB sequences differ so significantly from SWISS-PROT sequences that we need a specialist to predict sub-cellular localization for PDB proteins. When comparing these methods on a data set of sequence-unique SWISS-PROT proteins of known localization that had been added between release 40 and 41 (and were neither used in our development nor in that of the other methods tested), we got different results ( Table 3 ): now SubLoc reached an overall four-state accuracy around 54%, NNPSL around 52%, and our system trained on SWISS-PROT sequences (LOCnet) clearly outperformed the system trained on PDB sequences (LOC3DnetPHD). Again, we observed that using all the information (predicted 1D structure and alignment profiles) yielded a sustained improvement around eight percentage points (NetSeq vs. LOCnet in Table 3 ). Our current system was only inferior to NNPSL and SubLoc for mitochondrial proteins. Comparing our system to methods that also utilise sequence motifs (PSORT II) or that specialise on particular general signals (TargetP), we confirmed that our method performed particularly poorly on mitochondrial proteins: TargetP performed clearly best for mitochondrial proteins. In contrast, it appeared that our system implicitly picked up the presence of signal peptides used in TargetP.
| Method | Extra-cellular | Cytoplasmic | Nuclear | Mitochondrial | |||||||||
| Q | oL | pL | gAv | oL | pL | gAv | oL | pL | gAv | oL | pL | gAv | |
NNPSL | 43.7 | 70 | 52 | 0.61 | 47 | 51 | 0.50 | 28 | 63 | 0.42 | 30 | 8 | 0.16 |
SubLoc | 50.1 | 51 | 56 | 0.54 | 67 | 47 | 0.56 | 45 | 61 | 0.53 | 43 | 22 | 0.31 |
|
| ||||||||||||
NetSeq | 57.4 | 65 | 69 | 0.67 | 74 | 45 | 0.58 | 53 | 71 | 0.62 | 21 | 25 | 0.23 |
LOC3DnetDSSP | 65.5 | 78 | 78 | 0.78 | 55 | 52 | 0.54 | 76 | 73 | 0.75 | 43 | 34 | 0.39 |
LOC3DnetPHD | 63.8 | 69 | 75 | 0.72 | 71 | 54 | 0.62 | 68 | 72 | 0.70 | 34 | 32 | 0.33 |
LOCnet | 52 | 81 | 67 | 0.74 | 61 | 53 | 0.57 | 33 | 73 | 0.49 | 39 | 15 | 0.24 |
* Abbreviations used :
Methods: name of method/server used to predict localization (methods introduced here in italic); NNPSL: Neural network based prediction of sub-cellular localization [11]; SubLoc: Subcellular localization prediction using support-vector machines [12]; NetSeq: neural network trained only on amino acid composition of PDB sequences; LOC3DnetDSSP: network trained on PDB sequences using observed (DSSP) 1D structure and evolutionary profiles; LOC3DnetPHD: network trained on PDB sequences using predicted (PHD) 1D structure and evolutionary profiles; LOCnet: network trained on SWISS-PROT sequences using predicted 1D structure and evolutionary profiles.
Missing Methods:
Note that two general methods are missing in this table, namely PSORT II [18-20] and TargetP [17] since both explicitly use information about signal peptides that are usually not present in PDB sequences (Table 3 compares these two based on full SWISS-PROT sequences).
Classes:
As described in Methods all test proteins were experimentally annotated in SWISS-PROT as exclusively belonging to one of the four classes shown. (Note that extra-cellular excludes secreted proteins retained in interior compartments.)
Scores:
Q4: percentage four-state accuracy on test set (Eqn. 9); oL: two-state accuracy (correctly predicted as percentage of observed, Eqn. 7); pL: two-state specificity (correctly predicted as percentage of predicted, Eqn. 6); gAv: geometric average between oL and pL (Eqn. 8).
Significant differences:
For our networks the standard deviation in the four-state accuracy was about 5 percentage points. The following estimates for standard deviations were published: NNPSL [11], about 2.5 percentage points. No estimates of error have been provided for SubLoc [12]. The best methods for each class/score are marked in bold face. The standard deviations given above (2.5 percentage points for SubLoc) were used to mark indistinguishable methods.
| Method | Extra-cellular | Cytoplasmic | Nuclear | Mitochondrial | |||||||||
| Q | oL | pL | gAv | oL | pL | gAv | oL | pL | gAv | oL | pL | gAv | |
NNPSL | 51.5 | 62 | 61 | 0.61 | 40 | 45 | 0.43 | 58 | 66 | 0.62 | 68 | 31 | 0.46 |
SubLoc | 57.4 | 52 | 71 | 0.61 | 57 | 46 | 0.51 | 71 | 65 | 0.68 | 63 | 49 | 0.56 |
PSORT II | 53.2 | 32 | 89 | 0.53 | 51 | 52 | 0.51 | 74 | 55 | 0.64 | 62 | 46 | 0.53 |
TargetP | - | 77 | 77 | 0.77 | - | - | - | - | - | - | 78 | 57 | 0.67 |
|
| ||||||||||||
NetSeq | 56.2 | 70 | 74 | 0.72 | 55 | 44 | 0.49 | 68 | 65 | 0.66 | 28 | 29 | 0.29 |
LOCnet | 64.2 | 86 | 76 | 0.81 | 56 | 54 | 0.54 | 73 | 71 | 0.72 | 53 | 45 | 0.49 |
LOC3DnetPHD | 43.4 | 59 | 43 | 0.50 | 58 | 39 | 0.48 | 33 | 50 | 0.40 | 35 | 50 | 0.41 |
* Abbreviations used asin Table 2, with the following exceptions:
Data set: all sequence-uniqueproteins added between release 41 and 40 of SWISS-PROT (Table 5).
Methods: Here the referencemethod NetSeq was a neural network trained onlyon the amino acid composition of SWISS-PROT sequences.
Additional methods: PSORT II: knowledge-based expert system using amino acidcomposition and sequence motifs [18, 19, 20] ; TargetP: combined methodpredicting extra-cellular (SignalP), chloroplast (ChloroP) and mitochondrialproteins [17] .
Significant differences:
For our networks the standard deviation in thefour-state accuracy was about 5 percentage points. The following estimates forstandard deviations were published: TargetP [17] about 1 percentage points, PSORT II [20] about 3.5 percentage points (see Table 2 for the other methods).
Predicting the localizations for all eukaryotic proteins in PDB. Finally, we annotated sub-cellular localization for all eukaryotic protein chains in PDB. The LOC3D system employed toward this end combined four different methods: (1) inferring nuclear localization based on the presence of NLS [8, 9] , (2) transferring experimental annotations of from SWISS-PROT through sequence homology [35] , (3) inferring localization through automatic text-analysis of SWISS-PROT keywords [36] , and (4) predictions from the network-based system, described here (LOC3DnetDSSP). The final annotation was based on the most accurate prediction from any of the four different methods (winner-take-all). Overall, transfer by homology accounted for 44% of the final annotations ( Table 4 ). The other means of explicitly using experimental annotations (automatic text analysis of SWISS-PROT keywords) yielded another 37% of the annotations. Additionally, 130 PDB proteins contained nuclear localization signals [8, 9] . Thus, the success of the homology and motif-based methods left only 18% of the PDB proteins un-annotated. For about 40% of these the accuracy of LOC3DnetDSSP was above its average of 65%. Secreted proteins were predicted to be the most abundant class in PDB ( Table 4 ). We made all predictions available on our web site ( http://cubic.bioc.columbia.edu/db/LOC3D/ ).
| Confidence | Nprd | Phom | Pkwd | Pnet | Pnls | Pcyt | Pext | Pnuc | Pmit | Pother |
| 100 | 5015 | 67 | 31 | 0 | 2 | 23 | 49 | 10 | 9 | 9 |
| 95-99 | 182 | 99 | 1 | 0 | - | 57 | 2 | 40 | 1 | 2 |
| 90-94 | 674 | 18 | 52 | 31 | - | 6 | 60 | 23 | 5 | 6 |
| 85-89 | 296 | 26 | 39 | 35 | - | 23 | 45 | 20 | 4 | 7 |
| 80-84 | 589 | 7 | 81 | 12 | - | 16 | 69 | 9 | 2 | 3 |
| 75-79 | 566 | 4 | 82 | 15 | - | 45 | 28 | 9 | 17 | 1 |
| 70-74 | 359 | 1 | 78 | 21 | - | 25 | 26 | 17 | 16 | 16 |
| 65-69 | 118 | 69 | - | 31 | - | 71 | 2 | 17 | 10 | 0 |
| 60-64 | 195 | 5 | - | 95 | - | 14 | 23 | 14 | 49 | 0 |
| 55-59 | 236 | - | - | 100 | - | 8 | 13 | 7 | 71 | 0 |
| 50-54 | 94 | - | - | 100 | - | 64 | 14 | 7 | 15 | 0 |
| <50 | 469 | - | - | 100 | - | 70 | 9 | 3 | 18 | 0 |
| SUM | 8793 | 44 | 37 | 18 | 1 | 27 | 43 | 12 | 12 | 7 |
| S = 100% | S = 100% | |||||||||
* Abbreviations used:
Confidence: estimated annotation/prediction accuracyat this level of homology/prediction reliability;
Nprd: numberof eukaryotic PDB chains predicted at this accuracy level;
Pnls: percentage of chains with nuclear localization signal (Nnls/Nprd);
Phom: percentageof chains annotated through homology (Nhom/Nprd);
Pkwd: percentageof chains for which localization could be inferred using text-analysis ofSWISS-PROT keywords (Nkwd/Nprd);
Pnet: percentageof sequences predicted using LOC3DnetDSSP (Table2), i.e. our final system for proteins of known structure;
Pcyt: percentageof sequences predicted cytoplasmic;
Pext: percentageof sequences predicted extra-cellular;
Pnuc: percentageof sequences predicted nuclear;
Pmit: percentageof sequences predicted mitochondrial;
Pother: percentageof sequences predicted in other localizations; the other localizations include:chloroplast, lysosome, peroxysome, Endoplasmic reticulum, vacuoles, Golgiapparatus and periplasm.
'-' nopredictions using this method at this confidence level.
Significant improvement through combining information. Our major finding was that integrating all sources of information, namely evolutionary information with overall, surface, and secondary structure compositions, yielded by far the best method to predict sub-cellular localization ( Table 2 Appendix Fig. 6 ). Hence, all sources of information were crucial in combination. Nevertheless, we could clearly single out the following trends. First, networks using amino acid composition separated by secondary structure state gave the highest prediction accuracy ( Fig. 2 ). Second, the accuracy of secondary structure predictions sufficed to significantly improve predictions of sub-cellular localization. Third, replacing single-sequence composition (Fig 2A) by profile-composition ( Fig. 2 B) significantly improved prediction accuracy. The gain in accuracy was maximal (about three percentage points) for the profile based combination networks that combined all information.
Significant improvement over existing methods for three compartments. Overall, our final prediction systems were significantly more accurate than other publicly available general-purpose methods ( Table 2 Table 3 ). The only shortcoming of our methods was the relatively poor performance on mitochondrial proteins for which TargetP [17] was significantly, and PSORT II [18, 19, 20] notably better ( Table 2 ). The difference between the performance of PDB-trained and SWISS-PROT-trained systems on mitochondria indicated that one reason for the poor performance was the lack of data. However, our reference system (NetSeq in Table 3 ) was conceptually similar to NNPSL, nevertheless, our system performed significantly worse for mitochondria. We are currently trying to improve this aspect of our method. The problem with mitochondrial proteins arises at the point of combining the pairwise (L/not-L) networks into a four state prediction: our pairwise networks for mitochondria/other are reasonably accurate (data not shown).
The performance was slightly over-estimated for some methods. On the sequence-unique set of new SWISS-PROT proteins with experimental annotations of localization, we failed to fully verify the published levels of accuracy for some of the methods ( Table 3 ). In particular, SubLoc [12] was estimated to achieve a level of 79% accuracy, while the method reached only about 58% on our data ( Table 3 ). The difference may be explained by the fact that up to 90% pairwise sequence identity was allowed between testing and training set for the original publication of SubLoc. Cai et al. [74] also claim to reach a very high level of accuracy (73%). However, that value is not easy to compare. Firstly, only identical proteins were excluded in training and testing set, in other words, it is not clear how many of the proteins used for testing are close sequence homologues to the proteins used for training. Secondly, Cai et al. did not include mitochondrial proteins, instead they included two other classes, namely plasma membrane and chloroplast. More recently this group published even higher estimates using similar data sets with unspecified sequence similarity between testing and training [75, 76] . Given the accuracy of our method, we imagine that it will be a good alternative to some methods for all four compartments that we target, and that it constitutes a reasonable complement for others like TargetP and PSORT II.
All methods not specialised on PDB sequences failed on these. Feeding sequences from the PDB directly into public prediction methods is very problematic. Indeed, our specialist for SWISS-PROT proteins (like all other methods tested, Table 2 ) performed significantly worse for sequences taken from the PDB and vice versa ( Table 3 ). On the one hand, this implies that we better use a specialist system when we want to predict sub-cellular localization for proteins of known structure. On the other hand, this result may also suggest that performance for sequences from public genome sequencing efforts may also be reduced, as these may differ significantly from well-characterised, functional sub-units of proteins as deposited in SWISS-PROT. However, we currently have no handle on estimating such a potential error rate.
PDB annotations not representative for entire proteomes. For the majority of eukaryotic proteins in PDB (57%) the sub-cellular localization could be annotated with 100% accuracy through an appropriate parsing of the data and our automated text-analysis program [36] . This left us with 3778 proteins for which we could not annotate localization without errors. For the majority of these (59% = 2219 proteins), sub-cellular localization could be inferred most accurately through (1) known nuclear localization signals (NLS) [8, 9] , (2) through homology to proteins of experimentally known localization [35] , and (3) through text-analysis of SWISS-PROT keywords taken from homologues [36] . This large proportion of proteins for which localization can be annotated through homology (total of 82%, Table 4 ) is due to the significant amounts of experimental knowledge available for proteins of known structure. Previously, we annotated about 25% of the proteins in six entirely sequenced eukaryotes (human, mouse, fly, worm, weed, and yeast) through either sequence homology, text analysis of keywords, or sequence motifs [36, 33, 77] . Thus, the number of proteins for which the system introduced here increases the number of annotations will be much higher for entirely sequenced organisms than it was for the proteins of known structure. For PDB, our new method predicted the compartment for 1561 proteins; about 24% of these (382) were predicted at a reliability corresponding to >80% prediction accuracy (data not shown). Another aspect of the bias of PDB was the result that over 40% of the eukaryotic proteins of known structure appeared to be extra-cellular. For entire genomes, this number has previously been estimated to be at most half this size [29, 46] .
Next step: annotate larger sequence databases and entire proteomes. In future work, we intend to investigate to which extent the GeneOntology database (GO [78] ) adds experimental annotations about localization that are not in SWISS-PROT. Such information would be extremely valuable since neural networks are at their best when applied to large data sets. Support vector machines appear to perform better for small data sets [79] . Therefore, we intend to also apply these algorithms to the problem. Finally, preliminary results suggested that it might be possible to increase prediction accuracy, by explicitly incorporating predictions from other methods such as SignalP/TargetP [13, 17] or PredictNLS [8] into our neural networks.
Good enough for annotating proteomes and for structural genomics? Our results for the SWISS-PROT based system might be valid for proteins from genome sequencing projects: about 64% of all proteins were correctly by the profile-based networks using predicted surface and secondary structure. Although, we hope to further improve this level of accuracy, we challenge that the predictions are already good enough to become useful in the context of target selection for structural genomics [47] and to bridge the sequence-annotation gap in entirely sequenced genomes [80, 81, 82, 83, 22, 84, 46, 23] .
Data sets used for development and evaluation. We selected all eukaryotic proteins with explicit annotations about sub-cellular localization in SWISS-PROT release 40 [15] . We excluded proteins annotated as MEMBRANE, POSSIBLE, PROBABLE, SPECIFIC PERIODS or BY SIMILARITY. We also excluded proteins annotated with multiple localizations. This left 8980 eukaryotic proteins in our SWISS-PROT data set of experimentally annotated localization ('trusted SWISS-PROT set', Table 5 ). Next, we assigned localization to PDB chains [85] by searching for homologues in the 'trusted SWISS-PROT set'. We transferred the annotated localization for all PDB chains, which were aligned at HSSP-distances (eqn. 1) above 10 (number of PDB chains found given in Table 5 ). Above this homology threshold sub-cellular localization annotation can be reliably transferred at over 90% accuracy [36, 35] . Training, test and validation sets were constructed such that no pair of proteins from any two sets had levels of sequence similarity above HSSP-distances of 5 ( eqn. 1 ). We picked this value, since below this threshold assigning sub-cellular localization based solely on homology leads to significant errors [35] . Furthermore, the test set was redundancy reduced at HSSP-distances <10 using a simple greedy search [86] . This ensured that no two proteins in the test set had greater than 40% sequence identity over more than 100 residues (number of sequence unique chains given in Table 5 ). The reason for this reduction was to find a balance between biased data known to yield over-estimates [87, 64] and between data sets that were too small. Note that we did not have to define thresholds for significant sequence similarity between motifs, such as signal peptides [87] , since we used the entire protein information. All non-eukaryotic proteins were also excluded for testing. We identified eukaryotic proteins by using three methods: (1) PDB to SWISS-PROT links in the SWISS-PROT database, (2) using the source and organism entry in PDB and (3) first hit is eukaryotic protein when the chain is aligned to the SWISS-PROT database. Note: all data sets are available at: http://cubic.bioc.columbia.edu/results/2003/localization/.
Sub-cellular localization | SWISS-PROT all | SWISS-PROT unique | SWISS-PROT new-unique | PDB all | PDB unique |
| |||||
Nucleus | 2673 | 556 | 178 | 769 | 124 |
Cytoplasm | 2137 | 348 | 146 | 1958 | 94 |
Extra-cellular space | 1936 | 361 | 128 | 1970 | 99 |
Mitochondria | 914 | 200 | 60 | 504 | 23 |
Lysosome | 117 | 28 | 7 | 111 | 6 |
Endoplasmic reticulum | 112 | 15 | 13 | 55 | 3 |
Golgi apparatus | 11 | 4 | 7 | 7 | 2 |
SUM | 8976 | 1512 | 539 | 5906 | 359 |
# Datasets:
SWISS-PROT: number of all eukaryotic proteinswith annotated experimentally determined sub-cellular localization taken fromSWISS-PROT release 40 (Methods) ;
SWISS-PROT unique:
SWISS-PROT new unique: number of proteins insequence-unique subset of all proteins found in SWISS-PROT release 41 and notin release 40 (chosen by same procedure as SWISS-PROT unique);
PDB: number of PDB chains that could be assignedthe given localization (Methods);
PDB unique: sequence-unique subset of previous.
SWISS-PROT-new set used only for testing. After we completed the development of all our methods, we used an additional data set to re-examine performance, namely, we collected all proteins that had been added to SWISS-PROT between release 40 and 41 (labelled 'SWISS-PROT-new'). We filtered out all of these new proteins that had HSSP-distances >5 to any previously used protein and found the sequence-unique subset of the new proteins ( Table 5 ). We never used any of these proteins for development, and it is rather unlikely that the other methods tested used any of these ( Table 3 ).
HSSP-distance to measure pairwise sequence similarity. The HSSP-distance is defined as the distance from the HSSP threshold [34] ; it is given by:
HSSP-DISTANCE = PIDE - HSSP_PIDE(q)
HSSP_PIDE(q)= q +
(eqn. 1)
where L is the length of the alignment between two proteins, PIDE the percentage of pairwise identical residues, and HSSP_PIDE(q) the revised HSSP-threshold for the level q.
Observed and predicted information about protein structure. The observed secondary structure was extracted from the DSSP assignments [2] . Exposed residue composition was calculated from the solvent accessible surface area [88] in the DSSP database [2] . Three state secondary structure was predicted using PHDsec. We predicted all residues as exposed that were predicted to have relative solvent accessibility > 10% by PHDacc [6] . We chose this threshold since it gave good prediction accuracy on a limited subset of the training sets.
Building profiles. Profile-based composition was calculated by aligning the sequences against the SWISS-PROT + TrEMBL database using MaxHom dynamic programming algorithm [10] . The aligned sequences were filtered for redundancy at 95% pairwise sequence identity, i.e. pairs exceeding this limit were removed. Finally, we included only those proteins into the alignment that were above an HSSP-distance of 5 and had a pairwise sequence identity above 50% with respect to the guide protein of known localization. These thresholds were found to be optimal on a limited subset of the training data. Finally the profile composition was calculated by replacing each amino acid residue in the protein by the residue frequencies in the profile.
Cross-validation. We separated our data into three sets: training, validation and testing set. Finally, we rotated through the sets such that each protein was used for testing exactly once. We never used any information from the test set to optimise parameters. In particular, we determined the number of hidden units based on of the validation sets and did not change it when we rotated. We stopped training when the best classification was obtained on the validation set.
Neural network training and architectures for PDB chains. We used three levels of networks ( Fig. 1 ). First, a feed-forward neural network architecture 4 with one hidden layer and two output units trained on class/non-class for each localization. Training was done with standard back-propagation including momentum term (details in [4] ). The two output units represent different strengths of yes/no predictions for each localization class. Only localization classes with sufficient training examples in the PDB were considered. The neural networks were trained on PDB chains using overall amino acid composition, surface residue composition (both twenty input units), three-state secondary structure composition (three input units), and amino acid composition in the three secondary structure states (sixty input units) as input. We applied 'balanced training', i.e. examples belonging to the ÔyesÕ and ÔnoÕ classes were alternately presented to the network during training. For networks with three and 20 input units a configuration with three hidden nodes was chosen, while for the network using amino acid composition in a secondary structure state as input, a configuration with 9 hidden nodes was chosen. For the second level, the different first level networks were combined using a jury decision (sum over all first level outputs) and combining neural networks (input first level output). The training, test and validation sets remained the same for the second level networks. The second level networks had 6 input units and three hidden units. In the third level the combination networks for the different localizations were combined in a jury to give the final four-state localization prediction.
Neural network training and architectures for SWISS-PROT proteins. We used basically the same architectures as for PDB chains, with two major differences. Firstly, we only trained and tested on predicted secondary structure and solvent accessibility (since structure is not known for most of these proteins). Secondly, we used additional input units for the final summary networks, namely each network 'saw' the composition of the 50 N-terminal (20 units) residues (for proteins shorter than this, we simply used the entire protein for both ends).
Final decision through simple winner-take-it-all on 2nd layer of networks. The second layer networks ( Fig. 6 ) all have two output units with the values:
outL and outÂL for L= {cytoplasmic, extracellular, nuclear, mitochondrial}
We converted these values into probabilities:
Then we predicted the protein in the localization L' with:
( eqn. 2)
The strength of this prediction was measured using the reliability index RI:
( eqn. 3)
Evaluating performance. Four-fold cross-validation was applied to test the neural networks. As a simple measure for performance we used the percentage accuracy (Q, number of correctly predicted test proteins as percentage of total number of test proteins). The accuracy/specificity and coverage/sensitivity of the two-state networks were measured using four ratios derived from TP (number of proteins predicted to be in localization i and observed to be in localization i, the true positives), TN (number of proteins predicted not to be in localization i and observed to be so, the true negatives), FP (number of proteins predicted to be in localization i and observed not to be in i, the false positives) and FN (number of proteins predicted not to be in localization i and observed to be in i, the false negatives). We used:
( eqn. 6)
( eqn. 7)
In other words, pL are all correctly predicted in localization L as percentage of all predicted in L, and oL all correctly predicted as percentage of those observed in L. We combined these two numbers (pL and oL) through the geometric average:
( eqn. 8)
The overall four-state accuracy was measured by the accuracy Q4:
Q4 =
( eqn. 9)
To determine the best two-st