bottom - TOC - CUBIC-papers - CUBIC

Title: Annotating proteins from Endoplasmic Reticulum and Golgi apparatus in eukaryotic proteomes
Author:Kazimierz O Wrzeszczynski & Burkhard Rost
Quote: QUOTE

Kazimierz O Wrzeszczynski 1 & Burkhard Rost ?

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
3 Integrated Program in Cellular, Molecular and Biophysical Studies, Columbia University, 630 West 168th Street, New York, NY 10032, USA
* Corresponding author: cubic@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-4018, fax: +1-212-305-7932

This article is published in (CMLS, issue, 2003 and pages) © copyright Cellular and Molecular Life Sciences, BirkhŠuser (2003). OMJ is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.

Table of contents


Abstract

The sub-cellular localization of a native protein constitutes one coarse-grained aspect of its function. Transport between compartments is often regulated through short sequence motifs. Here, we analysed experimentally characterised ER/Golgi retrieval motifs and investigated the accuracy of homology-transfer. Only the C-terminal ER retrieval motifs KDEL, HDEL and AIAKE were sufficiently specific. However, even unspecific motifs may help, provided we know the probability for localization given this motif. We provided such estimates. We also rigorously estimated the accuracy and coverage for inferring ER and Golgi localization through homology-transfer by sequence similarity. In entire proteomes, we could thereby annotate 3304 ER (3182 membrane) and 1853 Golgi proteins (759 membrane). We identified another 5157 globular and 3941 membrane putative ER or Golgi proteins. Each experimental annotation yielded, on average, 1-3 high-accuracy and 5-6 low-accuracy homology-transfers in the six proteomes. These numbers will increase with each new experimental annotation.

 

Key words: endoplasmic reticulum, Golgi apparatus, genome sequence analysis, sub-cellular localization, protein sequence motifs.

 

Abbreviations used

BIGmerger of three databases: PDB, Swiss-Prot and TrEMBL
C-termcarboxy-terminal, i.e. end of protein
EREndoplasmic reticulum
EVALBLAST expectation value
HVALHSSP-value, i.e. function correlating pairwise sequence identity and alignment length ( eqn. 1)ORFopen reading frame
PHDsecProfile based neural network prediction of secondary structure [1, 2, 3]
PIDEpairwise percentage of identical residues
PDBdatabase of protein structures [4]
Swiss-Protannotated protein sequence database [5]
TrEMBLtranslated EMBL database of un-annotated protein sequences [5] .


 

Notation used: globular, describes all proteins that are neither integral membrane proteins nor attached to the membrane; proteome, all the proteins in an organism as the 'proteome' of that organism; retention/recycle/retrieval, strictly motifs such as KDEL are shown to guarantee the retrieval or recycling of proteins back into the ER rather than the retention in the ER; trusted, with the term 'trusted data set' we refer to a set for which the sub-cellular localization has been annotated experimentally; [XY] means either amino acid X or Y at the given position.

 

Introduction

Trafficking through eukaryotic cells. The major constituents of eukaryotic cells are: extra-cellular space, cytoplasm, nucleus, mitochondria, Golgi apparatus, endoplasmic reticulum, peroxisome, mitochondria, and lysosomes. The native sub-cellular localization of a protein is assumed to be determined largely by a trafficking system that is reasonably well captured experimentally for some of the organelles [6, 7, 8, 9, 10, 11, 12] . The system has two main branches [13] . On one branch, proteins are synthesised on cytoplasmic ribosomes, and from there can go to the nucleus, mitochondria or peroxisomes. The second branch leads from the ribosomes attached to the endoplasmic reticulum to the Golgi apparatus, then to lysosomes, or secretory vesicles, and on to the extra-cellular space. At each branch point in the trafficking system, a 'decision is made'; either retain the protein in the current compartment or transport it onward to the next. For many examples, we have experimental evidence that membrane transport complexes 'make these decisions' by recognising motifs on the proteins that are shuttled. The most comprehensively characterised branch point is the second one leading to secretion [14, 15, 16, 17, 18] . Most proteins destined for this branch are assumed to have an N-terminal signal peptide that causes them to be transferred into the endoplasmic reticulum as they are being synthesised; most proteins lacking this signal are synthesised in the cytoplasm and follow the former mentioned branch of protein trafficking. Note some proteins appear to be secreted through a different pathway and clearly lack signal peptides.

Protein sorting through the secretory pathway. The secretory pathway involves a complex protein transport system between its organelles while maintaining no significant loss in organelle resident proteins. This highly selective process allows for the post-translational modification and maturation of newly synthesised proteins passing through the Endoplasmic reticulum (ER) and Golgi apparatus while strictly sorting and retaining residential soluble and membrane bound proteins [19, 20] . A small fraction of proteins undergo ER/Golgi-independent protein secretion. This process is performed through at least four distinct pathways under varying cellular conditions [21] . The common assumption is that proteins are kept within the ER and to a much lesser extent within the Golgi apparatus (Golgi) through specific short peptides that act as signals important for a retention and retrieval/recycle mediated sorting mechanisms [22, 23, 24] .

Technically, groups of a few, specific residues are referred to as sequence motifs. Secreted proteins usually contain an N-terminal signal peptide of 10-30 residues that is cleaved upon successful transport through the extra-cellular membranes in the ER [22] ; these signal peptides have distinct sequence features and can therefore be predicted accurately for proteins of unknown localization [18, 25] . Conceptually, we can distinguish between three types of motifs that are recognised by transport proteins: (i) generic, sequence-consecutive motifs with common features such as cleaved signal peptides, (ii) specific, sequence-consecutive motifs like nuclear localization signals, and (iii) non-sequence consecutive motifs recognisable only after the protein has folded. While binding motifs that require details of the three-dimensional, folded protein structure are common for all kinds of protein function such as small molecule or enzyme binding surprisingly few such cases have been implicated in the regulation of protein trafficking. One prominent exception is the mannose-6-phosphate (M6P) receptors that bind M6P-containing soluble acid hydrolases in the Golgi and transport them on to the endosomal-lysosmal system [26, 27] . Only the first type of motif for regulation - generic and sequence-consecutive - can currently be predicted to identify proteins with signal peptides [18, 28] , chloroplast transit peptides [29, 28] , as well as peroxisomal [30] and mitochondrial targeting signals [31, 28] . For the second type of specific, sequence-consecutive motifs, all computational biology can do so far is to archive these in databases that can be queried to find unknown nuclear proteins [32, 33, 34, 35] .

Computational analysis of ER and Golgi retrieval signals has been vastly limited to PROSITE [36] and PSORT [37] both of which only rely specifically on the classical ER and Golgi retrieval motifs (or conservative derivations of these classical motifs); few attempts analysed Golgi and ER proteins on the level of entirely sequenced proteomes [38] . Large-scale genomic consortiums rely in part on valid function and sub-cellular localization information for their target selection decisions. Large-scale experimental efforts can adequately account for a proportion of a specific proteome [39, 40] but are often limited by the experimental design. Therefore computational efforts are often needed to complete out an entire proteomic analysis. Our lab has recently reported that sequence conservation established using the HSSP-distance value correlated well with sub-cellular localization [41] . We have further applied this technique specifically to the ER and Golgi organelles. Here, we analysed to which extent ER and Golgi proteins can be identified through such short sequence motifs and/or through sequence similarity to proteins known to reside in these two compartments. This analysis required three steps: (1) Collect known signals from literature and databases, (2) build unbiased, trusted data sets of proteins experimentally known to reside in ER and Golgi, and (3) test specificity and accuracy of the signals found. (Note: we failed to uncover novel motifs through motif-finding algorithms.) The first part of our work explored the limits of how far we can reach when trying to predict ER and Golgi proteins from experimentally known and theoretically refined signals. Next, we established thresholds for significant sequence similarity, i.e. for when we can accurately infer ER and Golgi location through homology. Finally, we applied our results to annotate ER and Golgi proteins in the proteomes of Saccharomyces cerevisiae (yeast), Drosophila melanogaster (fruit-fly), Caenorhabditis elegans (worm), Arabidopsis thaliana (weed), Homo sapiens (human), and Mus musculus (mouse).

 

Methods

Trusted data sets of proteins with known localization. We retrieved all proteins from Swiss-Prot [5] that had experimental annotations about sub-cellular localization, removing all with 'putatively known' localization that contained either 'PROBABLE', 'PUTATIVE', or 'BY SIMILARITY' as additional qualifiers in Swiss-Prot. We split these proteins into 'trusted ER/Golgi' and 'trusted non-ER/non-Golgi'. As another control data set, we also retrieved all non-eukaryotic Swiss-Prot proteins. The resulting trusted data sets contained 676 ER proteins, 131 lumenal ER proteins, 104 proteins with a [KH]DEL C-terminal motif, 545 ER membrane proteins, 312 trusted Golgi proteins, and 194 Golgi membrane proteins. A non-ER/non-Golgi set of 8417 localization annotated eukaryotic proteins was used to identify false positives (numbers summarised in Table in Supporting Online Material). Additional experimentally annotated yeast proteins were collected from the Yeast GFP Fusion Localization Database - http://yeastgfp.ucsf.edu. However, we considered only ORFs without the keyword 'Hypothetical Protein'. This increased the total number of ER proteins to 784 and the Golgi trusted set to 351. The trusted data sets can be obtained from 'ER-GolgiDBÕ at http://cubic.bioc.columbia.edu/db/ERGolgiDB.

Data sets for entire proteomes. The human proteins constituted the set currently available in the latest versions of Swiss-Prot (release 40) and TrEMBL (release 22) [5] ; Drosophila melanogaster was obtained from http://www.fruitfly.org/ (release 2), Mus Musculus was obtained from http://www.ensembl.org/Mus_musculus/ and Caenorhabditis elegans from http://www.sanger.ac.uk/Projects/C_elegans/wormpep/ (wormpep 65). All remaining proteomes (weed and yeast) were downloaded from ftp://ncbi.nih.gov/genbank/genomes/.

Aligning proteins. We aligned the trusted ER and Golgi proteins against all proteins of known localization using pairwise BLAST [42] . Next, we built PSI-BLAST profiles for all data sets using a filtered version of all currently known sequences with three iterations [43] . These profiles were then aligned against all proteins of known localization. After compiling the results for the sequence conservation ( Fig. 1 ), we changed these profiles such that we only included homologues with HVALs ³ 40 for ER and ³ 20 for Golgi proteins. We based our identification of ER/Golgi proteins in entire proteomes on these three data sets: ÔtrustedÕ, Ôtrusted familiesÕ, and Ôunique subset of trusted familiesÕ.

Scores for measuring sequence similarity.  The simplest way to measure sequence similarity is percentage pairwise sequence identity (PIDE), i.e. the percentage of residues identical between two proteins (not counting gaps). Another measure is the statistical expectation values as reported by BLAST (EVAL, note: we typically report the logarithm of this value in our figures). As third measure we used the HSSP-value (HVAL) [44, 45] :

                     ( eqn. 1)

where L was the number of residues aligned between two proteins, PIDE the percentage of pairwise identical residues. The HSSP-value reflects whether an alignment is above the HSSP-curve [44, 45] (HVAL >0) or below (HVAL<0). For the first case (>0) the HSSP-value can be seen as a degree of sequence-proximity or similarity (the higher the value the more similar to two proteins), whereas for the latter (HVAL<0) estimates the distance, or level of divergence between two proteins (the more negative the value, the less similar the two proteins). An HSSP-value of 0 defines the line, above which (almost) no two naturally evolved proteins differ grossly in their three-dimensional structures. To illustrate the curve: for alignment lengths around 100 residues, 33% pairwise sequence identity suffices to infer structure, above 250 residues 21% is significant, and below 11 residues even 100% identity is not enough to infer structural (or functional) similarity. Although the HSSP-curve was derived to describe structural similarity, we noted that it also constitutes a sensitive approach when distinguishing between proteins of similar and dissimilar enzymatic activity [46] , between the largest four compartments (nucleus, extra-cellular space, cytoplasm and mitochondria) [41] , and between proteins involved in cell-cycle control [47] .

Sequence-unique subsets. We built sequence-unique subsets for all types of proteins under consideration to avoid bias that is likely to skew estimates for accuracy and coverage [46] . 'Sequence-unique' was defined by that no pair of proteins in the set had HVALs>0 ( eqn. 1). Given an all-against-all pairwise alignment for the biased set, we simply used a greedy search to find the largest subset that fulfilled the above condition. (Note: a tool performing this type of reduction is available through the web [48] .)

Measuring accuracy and coverage. We used the following definition to measure accuracy/specificity:

                     ( eqn. 2)

with the thresholds given by either (1) percentage pairwise sequence identity (PIDE), (2) BLAST expectations values (EVAL), (3) the HSSP-value (HVAL). We considered all pairs as 'true' that were experimentally found in the same sub-cellular compartment. In analogy, we used the following definitions for coverage/sensitivity:

                     ( eqn. 3)

 

Results and Discussion



Retention and recycle signals can predict unknown ER and Golgi proteins

Collecting retention and retrieval/recycle motifs from literature and databases. We retrieved experimentally annotated retention and recycle signals for ER and Golgi from the literature, Swiss-Prot [5] and PROSITE [49] . The resulting list was tiny in comparison to that obtained previously for nuclear localization signals [32, 35] . Supposedly, the reason is that many soluble and membrane proteins in the ER have rather specific retention signals ( Table 1 ). Predominantly, the C-terminal motifs KDEL, HDEL, and closely related derivatives have been experimentally related to the retrieval mechanism [19] . Other motifs implicated in ER and Golgi targeting include [20] : (1) the C-terminal motif HDEF in the Ca2+-binding protein Calumenin [50] , (2) the Di-lysine motif KK [51] , the Di-arginine motif RR [52] or RKR (RxR) [53] , (3) the tyrosine-based tetra-peptide motif Yxxh (where x can be any amino acid and h signifies a hydrophobic residue), predominately associated in vesicular traffic sorting mechanisms [54] , has also been shown as a localization motif as evident in YQRL of TGN38 [55] and for the retrieval of UCE [56] , (4) the Di-acidic ER-export motifs [DE] often associated with the Yxxh motif [57] , (5) the cytoplasmic tail FxFxD motif in DPAP-A necessary for retrieval back to the Golgi [58] , and (6) the targeting domain GRIP, found in peripheral Golgi membrane proteins [59, 60] . The only motifs that were previously available to automatic proteome searches were KDEL and HDEL, as well as some derivatives of these deposited in PROSITE [49] and PSORT [37] . For the Golgi apparatus, other than C-terminal YQRL also used by PSORT, there is currently no other specific sequence motif available for automatic database searches [24] .



Table . 1
Table 1 : Analysing ER and Golgi retentionand retrieval signals Æ.
Sequence motif (1) Total Eukaryotes Non-Eukaryotes ER/Golgi Non-ER/Non-Golgi Non-Annotated
  N N % N % N % N % N %
                       
Endoplasmic reticulum (ER) (2)                      
KDEL-C-term 67 60 90 7 10 60 90 0 0 0 0
KDEL 1201 636 53 565 47 76 6 560 47 230 41
HDEL-C-term 64 64 100 0 0 62 97 2 3 2 100
HDEL 498 261 52 237 48 68 14 193 38 121 63
HDEF-C-term 4 3 75 1 25 2 50 1 25 0 0
HDEF 91 50 55 41 45 2 2 48 53 28 58
KKxx-C-term 907 492 52 415 46 55 6 437 48 211 48
KKxx-C-term (membrane protein subset) 254 183 72 71 28 55 22 128 50 21 16
KKxx 57848 32493 56 25355 44 810 1 31683 55 15171 48
KxKxx-C-term 810 420 52 390 48 42 5 378 47 177 47
KxKxx-C-term (membrane protein subset) 230 139 60 91 40 42 18 97 42 25 26
xxRR 83869 39769 47 44100 53 1062 1 38707 46 16050 41
KKFF-C-term 8 5 63 3 37 3 38 3 25 2 67
KKFF 416 234 56 93 22 7 2 316 76 118 37
KKAA-C-term 29 7 24 22 76 5 17 2 7 0 0
KKAA 1639 824 50 815 50 40 2 784 48 267 34
AIAKE-C-term 10 10 100 0 0 10 100 0 0 0 0
AIAKE 161 55 34 106 66 11 7 44 27 11 25
CRAR 199 127 64 72 36 0 0 127 64 42 33
                       
Golgi apparatus (3)                      
YQRL 442 212 48 230 52 10 2 202 46 83 41
YKGL 632 304 48 328 52 5 1 299 47 143 48
YHPL 150 70 47 80 53 7 5 65 43 29 45
Yxxh 135637 62800 46 72837 54 859 1 62941 45 27729 44
NPFKD 17 13 76 4 24 0 0 13 76 8 62
FxFxD 4971 2513 51 2458 49 67 1 2446 49 1101 45
FQFND 7 4 57 3 43 3 43 1 14 1 100
PxPxP 8856 2766 31 4023 45 139 2 4694 53 3088 66
[DE] 131139 59784 46 71355 54 834 1 58941 45 25843 44
GRIP-motif (5) 11 11 100 0 0 10 90 1 10 1 100
GRIP-motif (shortened) (6) 58 32 55 24 41 10 18 24 41 11 46
                       
C-term variations (4)                      
PROSITE Pattern (7) 232 197 85 35 15 167 72 30 13 13 43
[KH]DEL 131 124 95 7 5 122 93 2 2 2 100
[KHR][DENQ]EL 203 174 86 29 14 157 77 17 8 9 52
[KHR][DENQ] [87] L 230 187 81 43 19 159 72 28 1 13 46
[KHRDENQAS] [DENQIYCV] [DENQ]L 696 428 61 268 39 193 28 235 33 107 45
[KRDEAVYF][KRDEVYFMQ] [KHED][DK]EL 80 59 74 21 26 50 63 9 11 5 55

Æ Columns: Total: number of proteins found in Swiss-Prot that have the respective motif; N: number of proteins found in subset; %: percentage of proteins in subset (column 'Total' gives 100%); Eukaryotes: all eukaryotic proteins; Non-Eukaryotes: since only eukaryotes have ER and Golgi, this column estimates a lower bound for the false positives; ER/Golgi: subset of eukaryotic proteins that have the respective motif and are experimentally known to be in either ER (for ER motifs) or Golgi (for Golgi motifs), this column gives a lower-bound for the true positives (percentage is based on total number); Non-ER / Non-Golgi: subset of eukaryotic proteins that have the respective motif and are experimentally known to be neither in ER nor in Golgi (percentage is based on total number) or do not contain any subcellular localization information in the Swiss-Prot database. Non-Annotated: subset of non-er/non-golgi that which does not contain any localization information in Swiss-Prot. The Non-Eukaryotes and Non-ER/Non-Golgi columns provide the total FP percentage. The total numbers were: ER = 1060 of which 324 (30%) were annotated as ÒProbable, Putative or By SimilarityÓ and 72 (7%) Viral/Prokaryotic/Archaea: Golgi subcellular localization total = 495 of which 163 (33%) were annotated as ÒProbable, Putative or By SimilarityÓ and 9 (2%) Viral/Prokaryotic/Archaea.

1 'C-term' indicates the carboxy-terminal (last) residue of the protein; motifs are given by the one-letter code of the respective amino acids with the following conventions: [AG] means either A or G, 'x' stands for 'any' amino acid, 'h' stands for any hydrophobic amino acid.

2 Source of ER motifs: KDEL [19] , HDEL [19] , KKxx [51] [88] , xxRR [52] , KKFF [89] , KKAA [90, 91] , HDEF [50] , AIAKE [92] , CRAR [93] .

3 Source of Golgi motifs: YQRL [94] , YKGL [95] , YHPL [56] , Yxxh [55] , NPFKD [56] , FxFxD [58] , FQFND [58] , PxPxP [96] , shott [57] , GRIP-motif [59, 60] .

C-term variations: most of these motifs were compiled for this work.

5 The consensus pattern of the GRIP-motif is described by: 
[DEA]Y[LIT][KR][KHN][VI][VILF]XX[YF][MIL].

6 Shortened derivative of GRIP-motif: [DEA]Y[LIT][KR][KHN][VI][VILF]

ER retrieval motif found in PROSITE: [KHRQSA][DENQ]EL [36] .



Validating motifs against databases. For each motif found ( Table 1 ), we retrieved all proteins with this motif deposited in Swiss-Prot [5] , TrEMBL [5] , and PDB [4] . Next, we extracted a subset of proteins annotated in Swiss-Prot by their experimentally known sub-cellular localization. This subset along with a grouping of all Swiss-Prot species into eukaryotes and non-eukaryotes provided two means of assessing the specificity/accuracy of a given motif. The most specific ER motifs were KDEL and HDEL when restricted to the carboxy-terminus ( Table 1 ). These two retrieved 131 proteins from Swiss-Prot, most of which have already been experimentally characterised as 'retained in the ER' (data not shown). While the KDEL motif was also present in a few non-eukaryotic proteins, the HDEL motif was found in only two eukaryotic non-ER protein and those two were orthologues for the protein 'Protein Kinase C Substrate' in bovine and human (g19p_human and g19p_bovin). Whereas this identification of ER and Golgi localization from such motifs clearly seems very reliable this finding illustrated the other problem of these two motifs: they occur frequently in non-ER proteins at positions other than the C-termini. In other words, in order to rely on KDEL/HDEL to infer localization, we must know the C-terminus of the full-length protein. All other ER motifs published were either very unspecific (found in many non-ER proteins) or far too specific (found in very few ER protein families), or both. For example, the Di-lysine (KKxx and KxKxx) motif retrieved all known ER proteins when located at the C-terminal position of membrane proteins however this included a set of 128 proteins (KKxx) and 378 proteins (KxKxx), most of which could not be classified as ER proteins ( Table 1 ). When including this motif (and the more difficult to distinguish Di-arginine (xxRR) N-terminal motif) among a non-membrane subset of proteins and more significantly when not limiting the motif to the terminal ends this high sensitivity is greatly compromised at the cost of an extremely low specificity: both motifs were found in most non-ER proteins. In fact, over 80% of the matches were wrong. Overall, the information contained in the published Golgi motifs was even less promising. For example, the most sensitive GRIP-motif [59, 60] was found in 11 proteins mainly orthologues of each other. A generalised GRIP-motif matched in slightly more proteins, none from the Golgi, and many from non-eukaryotic proteins. Similarly, Yxxh (matched in most known Golgi proteins, however, it also matched almost the entire Swiss-Prot database. Obviously, only C-terminal motifs KDEL, HDEL, and AIAKE suffice to accurately annotate ER proteins. All other experimentally characterised retention and recycle motifs for ER and Golgi need to be combined with other means of annotation.

 



ER and Golgi localization conserved at high levels of sequence similarity

We explored the power of using sequence similarity for the entire proteins to identify ER and Golgi proteins. Toward this end we had (1) to establish thresholds for sequence similarity that enable accurate inference by homology, and (2) to build family profiles of known ER/Golgi proteins. The final 'prediction step' requires searching with a query protein of unknown localization against these family profiles. We could have simplified this final step by aligning all query proteins against the known ER/Golgi proteins. However, sequence-profile alignments are more sensitive and more specific than sequence-sequence alignments. Note that we looked for similarities over the entire proteins, rather than for similarities between short signal peptides [61] .

ER and Golgi proteins correctly detected by homology at high levels of similarity. We aligned all experimentally known ER and Golgi (true positives) and all known non-ER and non-Golgi proteins (true negatives) by pairwise BLAST [62, 42] and by the more powerful PSI-BLAST [63] (Methods); alignments were ranked by expectation values [62] (EVAL), percentage pairwise sequence identity (PIDE), and the HSSP-value (HVAL eqn. 1). At HVAL=0, the accuracy for homology inference was 65% ( Fig. 1 top); it increased to 98% at HVAL>40. The majority of false positives (non-ER proteins) at high HSSP-values were of two specific types: heat shock protein 70 and elongation factor alpha. These two, large families are not exclusive to the ER rather they are also abundant in other cellular compartments. They caused the transition between the regions of Ômostly incorrect inferenceÕ (HVAL<20) and Ômostly correct inferenceÕ (HVAL>40) to be more gradual for ER than for Golgi proteins. The accuracy for Golgi proteins ( Fig. 1 bottom) was slightly higher than that for the ER proteins: 98% accuracy was reached at HVAL>20. We also investigated the effect from database bias [46] , confirming that biased data sets Ð incorrectly Ð suggested much higher levels of accuracy at all thresholds (data not shown). At high levels of accuracy, the coverage versus accuracy curve was slightly higher for HSSP-values than for expectation values (data not shown). Thus, we relied on the HSSP-value for the annotations of entire proteomes.



Fig. 1
fig1.gif

Fig. 1 : Sequence conservation for Endoplasmic Reticulum (ER) and Golgi apparatus.
We aligned all experimentally annotated, sequence-unique ER and Golgi proteins (ER-top graphs, Golgi-bottom graphs) against all true negatives using BLAST (squares) and PSI-BLAST (circles). Solid lines with filled symbols describe cumulative accuracy/specificity (percentage of correctly identified localized proteins at given threshold, eqn. 2>

 


Detailed distinction of ER and Golgi proteins. We also collected data sets for more specific subsets of ER proteins: (1) lumenal, (2) proteins containing only the [KH]DEL motif (see below), (3) proteins containing the Swiss-Prot annotation ÔPREVENT SECRETION FROM ERÕ, and (4) membrane proteins. For the first three subsets of ER proteins, each sequence-unique set contained very few proteins: 21 of 131 total for the lumenal set, 14 of 102 total for those with a [KH]DEL motif, and 30 of 212 total for proteins with a Swiss-Prot ER retention annotation. While these sets were too specific and too small to establish reliable conservation thresholds, the detailed distinction of ER/Golgi-subtypes could be used to annotate proteomes. Although the sequence-unique sets of ER and Golgi membrane proteins were also rather small, we could analyse the sequence conservation for these subsets. We compared two different sets of true negatives: (i) all non-ER/non-Golgi proteins, and (ii) only non-ER/non-Golgi membrane proteins. Not surprisingly, inference by homology was more accurate when using the additional constraint that the protein had to be in the membrane ( Fig. 2 diamonds vs. triangles). Due to the small numbers of proteins in the set, the higher levels of accuracy may not hold in general. However, the data certainly supported the assumption that homology inference for ER/Golgi membrane proteins is at least as accurate as that for all other ER/Golgi proteins. We applied this result to searching ER/Golgi membrane proteins in proteomes.



Fig. 2
fig2.gif

Fig. 2 : Sequence conservation for ER and Golgi membrane proteins.
We aligned all sequence-unique ER (top graphs) and Golgi (bottom) membrane proteins against all non-ER/non-Golgi proteins (triangles) and against all non-ER/non-Golgi membrane proteins (diamonds). Solid lines with filled symbols describe cumulative accuracy/specificity ( eqn. 2); dotted lines with open symbols describe cumulative coverage/selectivity ( eqn. 3). We measured sequence similarity (A) by the HSSP-value ( eqn. 1left graphs), and (B) by the logarithm of the BLAST E-values (right graphs).

 


 



Annotating ER and Golgi proteins in six eukaryotic proteomes

Identifying ER and Golgi proteins in six proteomes. We aimed at annotating as many ER/Golgi proteins as possible through homology and retention and recycle signals in six entirely sequenced eukaryotes (yeast, weed, worm, fly, mouse, and human). Swiss-Prot currently annotates 257 ER and 204 Golgi proteins in these six proteomes ( Table 2 column labelled 'ER(Golgi)-trusted'). Alignments using the trusted data sets added 718 potential ER and 800 potential Golgi proteins ( Table 2 rows labelled by an HVAL corresponding to 98% accuracy). 41 of these proteins were previously annotated as 'Hypothetical protein'. Swiss-Prot also contains annotations for localization based on sequence similarity to proteins of experimentally known localization. In order to establish how many of the proteins identified by our homology-inference were also annotated by Swiss-Prot, we identified the closest Swiss-Prot homologue for each protein in any of the six proteomes from the PEP database (Predictions for Entire Proteomes) [64] . This revealed that most of our annotations corresponding to 98% were also annotated by Swiss-Prot as either 'probable', 'putative' or 'by similarity'. In contrast, most putative annotations according to sequence similarity thresholds that correspond to 75% accuracy are not annotated as ER/Golgi by Swiss-Prot. At this threshold, we could propose another 3304 possible ER and 1853 possible Golgi proteins ( Table 2 , rows labelled by (Ô75%Õ) and (Ô78%Õ) for ER and Golgi respectively). While we expect that a majority of these annotations are likely to be false, these subsets constitute a good 'hunting-ground' for discovery of uncharacterised ER and Golgi proteins. Overall, each experimental annotation in our trusted set yielded about 1-3 (lower value for ER, higher for Golgi) homology-transfers as high accuracy (98%) and about 5-6 at low accuracy (>75%). The entire set of results is publicly available at http://cubic.bioc.columbia.edu.

Identifying ER and Golgi membrane proteins. We found a total of 3941 putative ER and Golgi membrane proteins in the six proteomes at a threshold corresponding to 75% accuracy. In most proteomes we could expand reliable annotations (98% accuracy threshold) for ER membrane proteins between 2.5- (human) and 8-fold (weed). At the same accuracy threshold, we also identified ER membrane proteins in worm for which our initial trusted set contained no ER membrane proteins ( Fig. 3 ). Homology inference allowed annotating between 190 (98% accuracy) and 759 (75% accuracy) Golgi membrane proteins ( Fig. 3 ). We also identified 155 possible lumenal ER proteins at 75% accuracy (data not shown) based solely on using the much smaller but less reliable motif only data sets. Identifying lumenal ER proteins is particularly relevant as 82% of the current 257 experimentally annotated ER proteins within are dataset are membrane-associated.



Table 2
Table 2 : ER and Golgi proteinsin eukaryotic proteomes. ×

 
ProteomeHVAL (%)TotalER(Golgi)-trustedAnnotated-ER(Golgi)Annotated-otherHypothetical
A. Endoplasmic reticulum:            
             
Saccharomyces cerevisiae (yeast) 45 (98) 53 51 51 2 0
  5 (75) 149   64 85 14
             
Arabidopsis thaliana (weed) 45 (98) 38 9 22 16 0
  5 (75) 570   126 444 9
             
Caenorhabditis elegans (worm) 45 (98) 12 5 9 3 1
  5 (75) 394   96 298 138
             
Drosophila melanogaster (fruit-fly) 45 (98) 17 8 14 3 0
  5 (75) 367   169 198 2
             
Mus musculus (mouse)            
  45 (98) 289 82 269 20 0
  5 (75) 860   412 448 5
             
Homo sapiens (human)            
  45 (98) 309 102 274 35 0
  5 (75) 964   426 538 8
             
All 6 proteomes 45 (98) 718 257 639 79 1
  38 (95) 830   686 144 3
  27 (90) 1098   795 303 14
  17 (85) 1528   930 598 44
  10 (80) 2328   1151 1177 123
  5 (75) 3304   1293 2011 176
             
B. Golgi apparatus: