| Title: | Annotating proteins from Endoplasmic Reticulum and Golgi apparatus in eukaryotic proteomes |
| Author: | Kazimierz O Wrzeszczynski & Burkhard Rost |
| Quote: | QUOTE |
| 1 | CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 3 | Integrated Program in Cellular, Molecular and Biophysical Studies, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| * | Corresponding author: cubic@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/ Tel: +1-212-305-4018, fax: +1-212-305-7932 |
This article is published in (CMLS, issue, 2003 and pages) © copyright Cellular and Molecular Life Sciences, BirkhŠuser (2003). OMJ is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.
The sub-cellular localization of a native protein constitutes one coarse-grained aspect of its function. Transport between compartments is often regulated through short sequence motifs. Here, we analysed experimentally characterised ER/Golgi retrieval motifs and investigated the accuracy of homology-transfer. Only the C-terminal ER retrieval motifs KDEL, HDEL and AIAKE were sufficiently specific. However, even unspecific motifs may help, provided we know the probability for localization given this motif. We provided such estimates. We also rigorously estimated the accuracy and coverage for inferring ER and Golgi localization through homology-transfer by sequence similarity. In entire proteomes, we could thereby annotate 3304 ER (3182 membrane) and 1853 Golgi proteins (759 membrane). We identified another 5157 globular and 3941 membrane putative ER or Golgi proteins. Each experimental annotation yielded, on average, 1-3 high-accuracy and 5-6 low-accuracy homology-transfers in the six proteomes. These numbers will increase with each new experimental annotation.
Key words: endoplasmic reticulum, Golgi apparatus, genome sequence analysis, sub-cellular localization, protein sequence motifs.
| BIG | merger of three databases: PDB, Swiss-Prot and TrEMBL |
| C-term | carboxy-terminal, i.e. end of protein |
| ER | Endoplasmic reticulum |
| EVAL | BLAST expectation value |
| HVAL | HSSP-value, i.e. function correlating pairwise sequence identity and alignment length ( eqn. 1) | ORF | open reading frame |
| PHDsec | Profile based neural network prediction of secondary structure [1, 2, 3] |
| PIDE | pairwise percentage of identical residues |
| PDB | database of protein structures [4] |
| Swiss-Prot | annotated protein sequence database [5] |
| TrEMBL | translated EMBL database of un-annotated protein sequences [5] . |
Notation used: globular, describes all proteins that are neither integral membrane proteins nor attached to the membrane; proteome, all the proteins in an organism as the 'proteome' of that organism; retention/recycle/retrieval, strictly motifs such as KDEL are shown to guarantee the retrieval or recycling of proteins back into the ER rather than the retention in the ER; trusted, with the term 'trusted data set' we refer to a set for which the sub-cellular localization has been annotated experimentally; [XY] means either amino acid X or Y at the given position.
Trafficking through eukaryotic cells. The major constituents of eukaryotic cells are: extra-cellular space, cytoplasm, nucleus, mitochondria, Golgi apparatus, endoplasmic reticulum, peroxisome, mitochondria, and lysosomes. The native sub-cellular localization of a protein is assumed to be determined largely by a trafficking system that is reasonably well captured experimentally for some of the organelles [6, 7, 8, 9, 10, 11, 12] . The system has two main branches [13] . On one branch, proteins are synthesised on cytoplasmic ribosomes, and from there can go to the nucleus, mitochondria or peroxisomes. The second branch leads from the ribosomes attached to the endoplasmic reticulum to the Golgi apparatus, then to lysosomes, or secretory vesicles, and on to the extra-cellular space. At each branch point in the trafficking system, a 'decision is made'; either retain the protein in the current compartment or transport it onward to the next. For many examples, we have experimental evidence that membrane transport complexes 'make these decisions' by recognising motifs on the proteins that are shuttled. The most comprehensively characterised branch point is the second one leading to secretion [14, 15, 16, 17, 18] . Most proteins destined for this branch are assumed to have an N-terminal signal peptide that causes them to be transferred into the endoplasmic reticulum as they are being synthesised; most proteins lacking this signal are synthesised in the cytoplasm and follow the former mentioned branch of protein trafficking. Note some proteins appear to be secreted through a different pathway and clearly lack signal peptides.
Protein sorting through the secretory pathway. The secretory pathway involves a complex protein transport system between its organelles while maintaining no significant loss in organelle resident proteins. This highly selective process allows for the post-translational modification and maturation of newly synthesised proteins passing through the Endoplasmic reticulum (ER) and Golgi apparatus while strictly sorting and retaining residential soluble and membrane bound proteins [19, 20] . A small fraction of proteins undergo ER/Golgi-independent protein secretion. This process is performed through at least four distinct pathways under varying cellular conditions [21] . The common assumption is that proteins are kept within the ER and to a much lesser extent within the Golgi apparatus (Golgi) through specific short peptides that act as signals important for a retention and retrieval/recycle mediated sorting mechanisms [22, 23, 24] .
Technically, groups of a few, specific residues are referred to as sequence motifs. Secreted proteins usually contain an N-terminal signal peptide of 10-30 residues that is cleaved upon successful transport through the extra-cellular membranes in the ER [22] ; these signal peptides have distinct sequence features and can therefore be predicted accurately for proteins of unknown localization [18, 25] . Conceptually, we can distinguish between three types of motifs that are recognised by transport proteins: (i) generic, sequence-consecutive motifs with common features such as cleaved signal peptides, (ii) specific, sequence-consecutive motifs like nuclear localization signals, and (iii) non-sequence consecutive motifs recognisable only after the protein has folded. While binding motifs that require details of the three-dimensional, folded protein structure are common for all kinds of protein function such as small molecule or enzyme binding surprisingly few such cases have been implicated in the regulation of protein trafficking. One prominent exception is the mannose-6-phosphate (M6P) receptors that bind M6P-containing soluble acid hydrolases in the Golgi and transport them on to the endosomal-lysosmal system [26, 27] . Only the first type of motif for regulation - generic and sequence-consecutive - can currently be predicted to identify proteins with signal peptides [18, 28] , chloroplast transit peptides [29, 28] , as well as peroxisomal [30] and mitochondrial targeting signals [31, 28] . For the second type of specific, sequence-consecutive motifs, all computational biology can do so far is to archive these in databases that can be queried to find unknown nuclear proteins [32, 33, 34, 35] .
Computational analysis of ER and Golgi retrieval signals has been vastly limited to PROSITE [36] and PSORT [37] both of which only rely specifically on the classical ER and Golgi retrieval motifs (or conservative derivations of these classical motifs); few attempts analysed Golgi and ER proteins on the level of entirely sequenced proteomes [38] . Large-scale genomic consortiums rely in part on valid function and sub-cellular localization information for their target selection decisions. Large-scale experimental efforts can adequately account for a proportion of a specific proteome [39, 40] but are often limited by the experimental design. Therefore computational efforts are often needed to complete out an entire proteomic analysis. Our lab has recently reported that sequence conservation established using the HSSP-distance value correlated well with sub-cellular localization [41] . We have further applied this technique specifically to the ER and Golgi organelles. Here, we analysed to which extent ER and Golgi proteins can be identified through such short sequence motifs and/or through sequence similarity to proteins known to reside in these two compartments. This analysis required three steps: (1) Collect known signals from literature and databases, (2) build unbiased, trusted data sets of proteins experimentally known to reside in ER and Golgi, and (3) test specificity and accuracy of the signals found. (Note: we failed to uncover novel motifs through motif-finding algorithms.) The first part of our work explored the limits of how far we can reach when trying to predict ER and Golgi proteins from experimentally known and theoretically refined signals. Next, we established thresholds for significant sequence similarity, i.e. for when we can accurately infer ER and Golgi location through homology. Finally, we applied our results to annotate ER and Golgi proteins in the proteomes of Saccharomyces cerevisiae (yeast), Drosophila melanogaster (fruit-fly), Caenorhabditis elegans (worm), Arabidopsis thaliana (weed), Homo sapiens (human), and Mus musculus (mouse).
Trusted data sets of proteins with known localization. We retrieved all proteins from Swiss-Prot [5] that had experimental annotations about sub-cellular localization, removing all with 'putatively known' localization that contained either 'PROBABLE', 'PUTATIVE', or 'BY SIMILARITY' as additional qualifiers in Swiss-Prot. We split these proteins into 'trusted ER/Golgi' and 'trusted non-ER/non-Golgi'. As another control data set, we also retrieved all non-eukaryotic Swiss-Prot proteins. The resulting trusted data sets contained 676 ER proteins, 131 lumenal ER proteins, 104 proteins with a [KH]DEL C-terminal motif, 545 ER membrane proteins, 312 trusted Golgi proteins, and 194 Golgi membrane proteins. A non-ER/non-Golgi set of 8417 localization annotated eukaryotic proteins was used to identify false positives (numbers summarised in Table in Supporting Online Material). Additional experimentally annotated yeast proteins were collected from the Yeast GFP Fusion Localization Database - http://yeastgfp.ucsf.edu. However, we considered only ORFs without the keyword 'Hypothetical Protein'. This increased the total number of ER proteins to 784 and the Golgi trusted set to 351. The trusted data sets can be obtained from 'ER-GolgiDBÕ at http://cubic.bioc.columbia.edu/db/ERGolgiDB.
Data sets for entire proteomes. The human proteins constituted the set currently available in the latest versions of Swiss-Prot (release 40) and TrEMBL (release 22) [5] ; Drosophila melanogaster was obtained from http://www.fruitfly.org/ (release 2), Mus Musculus was obtained from http://www.ensembl.org/Mus_musculus/ and Caenorhabditis elegans from http://www.sanger.ac.uk/Projects/C_elegans/wormpep/ (wormpep 65). All remaining proteomes (weed and yeast) were downloaded from ftp://ncbi.nih.gov/genbank/genomes/.
Aligning proteins. We aligned the trusted ER and Golgi proteins against all proteins of known localization using pairwise BLAST [42] . Next, we built PSI-BLAST profiles for all data sets using a filtered version of all currently known sequences with three iterations [43] . These profiles were then aligned against all proteins of known localization. After compiling the results for the sequence conservation ( Fig. 1 ), we changed these profiles such that we only included homologues with HVALs ³ 40 for ER and ³ 20 for Golgi proteins. We based our identification of ER/Golgi proteins in entire proteomes on these three data sets: ÔtrustedÕ, Ôtrusted familiesÕ, and Ôunique subset of trusted familiesÕ.
Scores for measuring sequence similarity. The simplest way to measure sequence similarity is percentage pairwise sequence identity (PIDE), i.e. the percentage of residues identical between two proteins (not counting gaps). Another measure is the statistical expectation values as reported by BLAST (EVAL, note: we typically report the logarithm of this value in our figures). As third measure we used the HSSP-value (HVAL) [44, 45] :
( eqn. 1)
where L was the number of residues aligned between two proteins, PIDE the percentage of pairwise identical residues. The HSSP-value reflects whether an alignment is above the HSSP-curve [44, 45] (HVAL >0) or below (HVAL<0). For the first case (>0) the HSSP-value can be seen as a degree of sequence-proximity or similarity (the higher the value the more similar to two proteins), whereas for the latter (HVAL<0) estimates the distance, or level of divergence between two proteins (the more negative the value, the less similar the two proteins). An HSSP-value of 0 defines the line, above which (almost) no two naturally evolved proteins differ grossly in their three-dimensional structures. To illustrate the curve: for alignment lengths around 100 residues, 33% pairwise sequence identity suffices to infer structure, above 250 residues 21% is significant, and below 11 residues even 100% identity is not enough to infer structural (or functional) similarity. Although the HSSP-curve was derived to describe structural similarity, we noted that it also constitutes a sensitive approach when distinguishing between proteins of similar and dissimilar enzymatic activity [46] , between the largest four compartments (nucleus, extra-cellular space, cytoplasm and mitochondria) [41] , and between proteins involved in cell-cycle control [47] .
Sequence-unique subsets. We built sequence-unique subsets for all types of proteins under consideration to avoid bias that is likely to skew estimates for accuracy and coverage [46] . 'Sequence-unique' was defined by that no pair of proteins in the set had HVALs>0 ( eqn. 1). Given an all-against-all pairwise alignment for the biased set, we simply used a greedy search to find the largest subset that fulfilled the above condition. (Note: a tool performing this type of reduction is available through the web [48] .)
Measuring accuracy and coverage. We used the following definition to measure accuracy/specificity:
( eqn. 2)
with the thresholds given by either (1) percentage pairwise sequence identity (PIDE), (2) BLAST expectations values (EVAL), (3) the HSSP-value (HVAL). We considered all pairs as 'true' that were experimentally found in the same sub-cellular compartment. In analogy, we used the following definitions for coverage/sensitivity:
( eqn. 3)
Collecting retention and retrieval/recycle motifs from literature and databases. We retrieved experimentally annotated retention and recycle signals for ER and Golgi from the literature, Swiss-Prot [5] and PROSITE [49] . The resulting list was tiny in comparison to that obtained previously for nuclear localization signals [32, 35] . Supposedly, the reason is that many soluble and membrane proteins in the ER have rather specific retention signals ( Table 1 ). Predominantly, the C-terminal motifs KDEL, HDEL, and closely related derivatives have been experimentally related to the retrieval mechanism [19] . Other motifs implicated in ER and Golgi targeting include [20] : (1) the C-terminal motif HDEF in the Ca2+-binding protein Calumenin [50] , (2) the Di-lysine motif KK [51] , the Di-arginine motif RR [52] or RKR (RxR) [53] , (3) the tyrosine-based tetra-peptide motif Yxxh (where x can be any amino acid and h signifies a hydrophobic residue), predominately associated in vesicular traffic sorting mechanisms [54] , has also been shown as a localization motif as evident in YQRL of TGN38 [55] and for the retrieval of UCE [56] , (4) the Di-acidic ER-export motifs [DE] often associated with the Yxxh motif [57] , (5) the cytoplasmic tail FxFxD motif in DPAP-A necessary for retrieval back to the Golgi [58] , and (6) the targeting domain GRIP, found in peripheral Golgi membrane proteins [59, 60] . The only motifs that were previously available to automatic proteome searches were KDEL and HDEL, as well as some derivatives of these deposited in PROSITE [49] and PSORT [37] . For the Golgi apparatus, other than C-terminal YQRL also used by PSORT, there is currently no other specific sequence motif available for automatic database searches [24] .
| Sequence motif (1) | Total | Eukaryotes | Non-Eukaryotes | ER/Golgi | Non-ER/Non-Golgi | Non-Annotated | |||||
| N | N | % | N | % | N | % | N | % | N | % | |
| Endoplasmic reticulum (ER) (2) | |||||||||||
| KDEL-C-term | 67 | 60 | 90 | 7 | 10 | 60 | 90 | 0 | 0 | 0 | 0 |
| KDEL | 1201 | 636 | 53 | 565 | 47 | 76 | 6 | 560 | 47 | 230 | 41 |
| HDEL-C-term | 64 | 64 | 100 | 0 | 0 | 62 | 97 | 2 | 3 | 2 | 100 |
| HDEL | 498 | 261 | 52 | 237 | 48 | 68 | 14 | 193 | 38 | 121 | 63 |
| HDEF-C-term | 4 | 3 | 75 | 1 | 25 | 2 | 50 | 1 | 25 | 0 | 0 |
| HDEF | 91 | 50 | 55 | 41 | 45 | 2 | 2 | 48 | 53 | 28 | 58 |
| KKxx-C-term | 907 | 492 | 52 | 415 | 46 | 55 | 6 | 437 | 48 | 211 | 48 |
| KKxx-C-term (membrane protein subset) | 254 | 183 | 72 | 71 | 28 | 55 | 22 | 128 | 50 | 21 | 16 |
| KKxx | 57848 | 32493 | 56 | 25355 | 44 | 810 | 1 | 31683 | 55 | 15171 | 48 |
| KxKxx-C-term | 810 | 420 | 52 | 390 | 48 | 42 | 5 | 378 | 47 | 177 | 47 |
| KxKxx-C-term (membrane protein subset) | 230 | 139 | 60 | 91 | 40 | 42 | 18 | 97 | 42 | 25 | 26 |
| xxRR | 83869 | 39769 | 47 | 44100 | 53 | 1062 | 1 | 38707 | 46 | 16050 | 41 |
| KKFF-C-term | 8 | 5 | 63 | 3 | 37 | 3 | 38 | 3 | 25 | 2 | 67 |
| KKFF | 416 | 234 | 56 | 93 | 22 | 7 | 2 | 316 | 76 | 118 | 37 |
| KKAA-C-term | 29 | 7 | 24 | 22 | 76 | 5 | 17 | 2 | 7 | 0 | 0 |
| KKAA | 1639 | 824 | 50 | 815 | 50 | 40 | 2 | 784 | 48 | 267 | 34 |
| AIAKE-C-term | 10 | 10 | 100 | 0 | 0 | 10 | 100 | 0 | 0 | 0 | 0 |
| AIAKE | 161 | 55 | 34 | 106 | 66 | 11 | 7 | 44 | 27 | 11 | 25 |
| CRAR | 199 | 127 | 64 | 72 | 36 | 0 | 0 | 127 | 64 | 42 | 33 |
| Golgi apparatus (3) | |||||||||||
| YQRL | 442 | 212 | 48 | 230 | 52 | 10 | 2 | 202 | 46 | 83 | 41 |
| YKGL | 632 | 304 | 48 | 328 | 52 | 5 | 1 | 299 | 47 | 143 | 48 |
| YHPL | 150 | 70 | 47 | 80 | 53 | 7 | 5 | 65 | 43 | 29 | 45 |
| Yxxh | 135637 | 62800 | 46 | 72837 | 54 | 859 | 1 | 62941 | 45 | 27729 | 44 |
| NPFKD | 17 | 13 | 76 | 4 | 24 | 0 | 0 | 13 | 76 | 8 | 62 |
| FxFxD | 4971 | 2513 | 51 | 2458 | 49 | 67 | 1 | 2446 | 49 | 1101 | 45 |
| FQFND | 7 | 4 | 57 | 3 | 43 | 3 | 43 | 1 | 14 | 1 | 100 |
| PxPxP | 8856 | 2766 | 31 | 4023 | 45 | 139 | 2 | 4694 | 53 | 3088 | 66 |
| [DE] | 131139 | 59784 | 46 | 71355 | 54 | 834 | 1 | 58941 | 45 | 25843 | 44 |
| GRIP-motif (5) | 11 | 11 | 100 | 0 | 0 | 10 | 90 | 1 | 10 | 1 | 100 |
| GRIP-motif (shortened) (6) | 58 | 32 | 55 | 24 | 41 | 10 | 18 | 24 | 41 | 11 | 46 |
| C-term variations (4) | |||||||||||
| PROSITE Pattern (7) | 232 | 197 | 85 | 35 | 15 | 167 | 72 | 30 | 13 | 13 | 43 |
| [KH]DEL | 131 | 124 | 95 | 7 | 5 | 122 | 93 | 2 | 2 | 2 | 100 |
| [KHR][DENQ]EL | 203 | 174 | 86 | 29 | 14 | 157 | 77 | 17 | 8 | 9 | 52 |
| [KHR][DENQ] [87] L | 230 | 187 | 81 | 43 | 19 | 159 | 72 | 28 | 1 | 13 | 46 |
| [KHRDENQAS] [DENQIYCV] [DENQ]L | 696 | 428 | 61 | 268 | 39 | 193 | 28 | 235 | 33 | 107 | 45 |
| [KRDEAVYF][KRDEVYFMQ] [KHED][DK]EL | 80 | 59 | 74 | 21 | 26 | 50 | 63 | 9 | 11 | 5 | 55 |
Æ Columns: Total: number of proteins found in Swiss-Prot that have the respective motif; N: number of proteins found in subset; %: percentage of proteins in subset (column 'Total' gives 100%); Eukaryotes: all eukaryotic proteins; Non-Eukaryotes: since only eukaryotes have ER and Golgi, this column estimates a lower bound for the false positives; ER/Golgi: subset of eukaryotic proteins that have the respective motif and are experimentally known to be in either ER (for ER motifs) or Golgi (for Golgi motifs), this column gives a lower-bound for the true positives (percentage is based on total number); Non-ER / Non-Golgi: subset of eukaryotic proteins that have the respective motif and are experimentally known to be neither in ER nor in Golgi (percentage is based on total number) or do not contain any subcellular localization information in the Swiss-Prot database. Non-Annotated: subset of non-er/non-golgi that which does not contain any localization information in Swiss-Prot. The Non-Eukaryotes and Non-ER/Non-Golgi columns provide the total FP percentage. The total numbers were: ER = 1060 of which 324 (30%) were annotated as ÒProbable, Putative or By SimilarityÓ and 72 (7%) Viral/Prokaryotic/Archaea: Golgi subcellular localization total = 495 of which 163 (33%) were annotated as ÒProbable, Putative or By SimilarityÓ and 9 (2%) Viral/Prokaryotic/Archaea.
1 'C-term' indicates the carboxy-terminal (last) residue of the protein; motifs are given by the one-letter code of the respective amino acids with the following conventions: [AG] means either A or G, 'x' stands for 'any' amino acid, 'h' stands for any hydrophobic amino acid.
2 Source of ER motifs: KDEL [19] , HDEL [19] , KKxx [51] [88] , xxRR [52] , KKFF [89] , KKAA [90, 91] , HDEF [50] , AIAKE [92] , CRAR [93] .
3 Source of Golgi motifs: YQRL [94] , YKGL [95] , YHPL [56] , Yxxh [55] , NPFKD [56] , FxFxD [58] , FQFND [58] , PxPxP [96] , shott [57] , GRIP-motif [59, 60] .
4 C-term variations: most of these motifs were compiled for this work.
5 The consensus pattern of the GRIP-motif is described by:
[DEA]Y[LIT][KR][KHN][VI][VILF]XX[YF][MIL].
6 Shortened derivative of GRIP-motif: [DEA]Y[LIT][KR][KHN][VI][VILF]
7 ER retrieval motif found in PROSITE: [KHRQSA][DENQ]EL [36] .
Validating motifs against databases. For each motif found ( Table 1 ), we retrieved all proteins with this motif deposited in Swiss-Prot [5] , TrEMBL [5] , and PDB [4] . Next, we extracted a subset of proteins annotated in Swiss-Prot by their experimentally known sub-cellular localization. This subset along with a grouping of all Swiss-Prot species into eukaryotes and non-eukaryotes provided two means of assessing the specificity/accuracy of a given motif. The most specific ER motifs were KDEL and HDEL when restricted to the carboxy-terminus ( Table 1 ). These two retrieved 131 proteins from Swiss-Prot, most of which have already been experimentally characterised as 'retained in the ER' (data not shown). While the KDEL motif was also present in a few non-eukaryotic proteins, the HDEL motif was found in only two eukaryotic non-ER protein and those two were orthologues for the protein 'Protein Kinase C Substrate' in bovine and human (g19p_human and g19p_bovin). Whereas this identification of ER and Golgi localization from such motifs clearly seems very reliable this finding illustrated the other problem of these two motifs: they occur frequently in non-ER proteins at positions other than the C-termini. In other words, in order to rely on KDEL/HDEL to infer localization, we must know the C-terminus of the full-length protein. All other ER motifs published were either very unspecific (found in many non-ER proteins) or far too specific (found in very few ER protein families), or both. For example, the Di-lysine (KKxx and KxKxx) motif retrieved all known ER proteins when located at the C-terminal position of membrane proteins however this included a set of 128 proteins (KKxx) and 378 proteins (KxKxx), most of which could not be classified as ER proteins ( Table 1 ). When including this motif (and the more difficult to distinguish Di-arginine (xxRR) N-terminal motif) among a non-membrane subset of proteins and more significantly when not limiting the motif to the terminal ends this high sensitivity is greatly compromised at the cost of an extremely low specificity: both motifs were found in most non-ER proteins. In fact, over 80% of the matches were wrong. Overall, the information contained in the published Golgi motifs was even less promising. For example, the most sensitive GRIP-motif [59, 60] was found in 11 proteins mainly orthologues of each other. A generalised GRIP-motif matched in slightly more proteins, none from the Golgi, and many from non-eukaryotic proteins. Similarly, Yxxh (matched in most known Golgi proteins, however, it also matched almost the entire Swiss-Prot database. Obviously, only C-terminal motifs KDEL, HDEL, and AIAKE suffice to accurately annotate ER proteins. All other experimentally characterised retention and recycle motifs for ER and Golgi need to be combined with other means of annotation.
We explored the power of using sequence similarity for the entire proteins to identify ER and Golgi proteins. Toward this end we had (1) to establish thresholds for sequence similarity that enable accurate inference by homology, and (2) to build family profiles of known ER/Golgi proteins. The final 'prediction step' requires searching with a query protein of unknown localization against these family profiles. We could have simplified this final step by aligning all query proteins against the known ER/Golgi proteins. However, sequence-profile alignments are more sensitive and more specific than sequence-sequence alignments. Note that we looked for similarities over the entire proteins, rather than for similarities between short signal peptides [61] .
ER and Golgi proteins correctly detected by homology at high levels of similarity. We aligned all experimentally known ER and Golgi (true positives) and all known non-ER and non-Golgi proteins (true negatives) by pairwise BLAST [62, 42] and by the more powerful PSI-BLAST [63] (Methods); alignments were ranked by expectation values [62] (EVAL), percentage pairwise sequence identity (PIDE), and the HSSP-value (HVAL eqn. 1). At HVAL=0, the accuracy for homology inference was 65% ( Fig. 1 top); it increased to 98% at HVAL>40. The majority of false positives (non-ER proteins) at high HSSP-values were of two specific types: heat shock protein 70 and elongation factor alpha. These two, large families are not exclusive to the ER rather they are also abundant in other cellular compartments. They caused the transition between the regions of Ômostly incorrect inferenceÕ (HVAL<20) and Ômostly correct inferenceÕ (HVAL>40) to be more gradual for ER than for Golgi proteins. The accuracy for Golgi proteins ( Fig. 1 bottom) was slightly higher than that for the ER proteins: 98% accuracy was reached at HVAL>20. We also investigated the effect from database bias [46] , confirming that biased data sets Ð incorrectly Ð suggested much higher levels of accuracy at all thresholds (data not shown). At high levels of accuracy, the coverage versus accuracy curve was slightly higher for HSSP-values than for expectation values (data not shown). Thus, we relied on the HSSP-value for the annotations of entire proteomes.
Fig. 1 : Sequence conservation for Endoplasmic Reticulum (ER) and Golgi apparatus.
We aligned all experimentally annotated, sequence-unique ER and Golgi proteins (ER-top graphs, Golgi-bottom graphs) against all true negatives using BLAST (squares) and PSI-BLAST (circles). Solid lines with filled symbols describe cumulative accuracy/specificity (percentage of correctly identified localized proteins at given threshold, eqn. 2>
Fig. 2 : Sequence conservation for ER and Golgi membrane proteins.
We aligned all sequence-unique ER (top graphs) and Golgi (bottom) membrane proteins against all non-ER/non-Golgi proteins (triangles) and against all non-ER/non-Golgi membrane proteins (diamonds). Solid lines with filled symbols describe cumulative accuracy/specificity ( eqn. 2); dotted lines with open symbols describe cumulative coverage/selectivity ( eqn. 3). We measured sequence similarity (A) by the HSSP-value ( eqn. 1left graphs), and (B) by the logarithm of the BLAST E-values (right graphs).
Identifying ER and Golgi proteins in six proteomes. We aimed at annotating as many ER/Golgi proteins as possible through homology and retention and recycle signals in six entirely sequenced eukaryotes (yeast, weed, worm, fly, mouse, and human). Swiss-Prot currently annotates 257 ER and 204 Golgi proteins in these six proteomes ( Table 2 column labelled 'ER(Golgi)-trusted'). Alignments using the trusted data sets added 718 potential ER and 800 potential Golgi proteins ( Table 2 rows labelled by an HVAL corresponding to 98% accuracy). 41 of these proteins were previously annotated as 'Hypothetical protein'. Swiss-Prot also contains annotations for localization based on sequence similarity to proteins of experimentally known localization. In order to establish how many of the proteins identified by our homology-inference were also annotated by Swiss-Prot, we identified the closest Swiss-Prot homologue for each protein in any of the six proteomes from the PEP database (Predictions for Entire Proteomes) [64] . This revealed that most of our annotations corresponding to 98% were also annotated by Swiss-Prot as either 'probable', 'putative' or 'by similarity'. In contrast, most putative annotations according to sequence similarity thresholds that correspond to 75% accuracy are not annotated as ER/Golgi by Swiss-Prot. At this threshold, we could propose another 3304 possible ER and 1853 possible Golgi proteins ( Table 2 , rows labelled by (Ô75%Õ) and (Ô78%Õ) for ER and Golgi respectively). While we expect that a majority of these annotations are likely to be false, these subsets constitute a good 'hunting-ground' for discovery of uncharacterised ER and Golgi proteins. Overall, each experimental annotation in our trusted set yielded about 1-3 (lower value for ER, higher for Golgi) homology-transfers as high accuracy (98%) and about 5-6 at low accuracy (>75%). The entire set of results is publicly available at http://cubic.bioc.columbia.edu.
Identifying ER and Golgi membrane proteins. We found a total of 3941 putative ER and Golgi membrane proteins in the six proteomes at a threshold corresponding to 75% accuracy. In most proteomes we could expand reliable annotations (98% accuracy threshold) for ER membrane proteins between 2.5- (human) and 8-fold (weed). At the same accuracy threshold, we also identified ER membrane proteins in worm for which our initial trusted set contained no ER membrane proteins ( Fig. 3 ). Homology inference allowed annotating between 190 (98% accuracy) and 759 (75% accuracy) Golgi membrane proteins ( Fig. 3 ). We also identified 155 possible lumenal ER proteins at 75% accuracy (data not shown) based solely on using the much smaller but less reliable motif only data sets. Identifying lumenal ER proteins is particularly relevant as 82% of the current 257 experimentally annotated ER proteins within are dataset are membrane-associated.
| Proteome | HVAL (%) | Total | ER(Golgi)-trusted | Annotated-ER(Golgi) | Annotated-other | Hypothetical |
| A. Endoplasmic reticulum: | ||||||
| Saccharomyces cerevisiae (yeast) | 45 (98) | 53 | 51 | 51 | 2 | 0 |
| 5 (75) | 149 | 64 | 85 | 14 | ||
| Arabidopsis thaliana (weed) | 45 (98) | 38 | 9 | 22 | 16 | 0 |
| 5 (75) | 570 | 126 | 444 | 9 | ||
| Caenorhabditis elegans (worm) | 45 (98) | 12 | 5 | 9 | 3 | 1 |
| 5 (75) | 394 | 96 | 298 | 138 | ||
| Drosophila melanogaster (fruit-fly) | 45 (98) | 17 | 8 | 14 | 3 | 0 |
| 5 (75) | 367 | 169 | 198 | 2 | ||
| Mus musculus (mouse) | ||||||
| 45 (98) | 289 | 82 | 269 | 20 | 0 | |
| 5 (75) | 860 | 412 | 448 | 5 | ||
| Homo sapiens (human) | ||||||
| 45 (98) | 309 | 102 | 274 | 35 | 0 | |
| 5 (75) | 964 | 426 | 538 | 8 | ||
| All 6 proteomes | 45 (98) | 718 | 257 | 639 | 79 | 1 |
| 38 (95) | 830 | 686 | 144 | 3 | ||
| 27 (90) | 1098 | 795 | 303 | 14 | ||
| 17 (85) | 1528 | 930 | 598 | 44 | ||
| 10 (80) | 2328 | 1151 | 1177 | 123 | ||
| 5 (75) | 3304 | 1293 | 2011 | 176 | ||
| B. Golgi apparatus: |