| Title: | Predicting protein subcellular localization using intelligent systems |
| Author: | Rajesh Nair & Burkhard Rost |
| Quote: | In: In Silico Technology in Drug Target Identification and Validation, Darryl Leon & Scott Markel (Eds.), Dekker, 2007, in press |
Predicting protein subcellular localization using intelligent systems
| 1 | Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| * | Corresponding author: nair@rostlab.org URL http://www.rostlab.org/ Tel: +1-212-851-4669 |
Key words: protein subcellular localization prediction, sorting signals, neural networks, support vector machines, hidden markov models, amino acid composition, text analysis.
| DNA | deoxyribonucleic acid |
| ER | endoplasmic reticulum |
| GO | gene ontology |
| HMMs | HMMs, hidden markov models |
| NNs | neural networks |
| SVMs | support vector machines |
Decoding protein function - a major challenge for modern biology. The genetic information for life is stored in the nucleic acids while proteins are the workhorses that are responsible for transforming this information into physical reality. Proteins are the macromolecules that perform most important tasks in organisms, such as the catalysis of biochemical reactions, transport of nutrients, recognition and transmission of signals. All the plethora of aspects of the role of any particular protein is referred to as its 'function'. The genome (DNA) sequences of over 180 organisms, including a draft sequence of the human genome, [1, 2] has now been completed. For over 105 of these, the data is publicly available and contributes about 413,000 protein sequences, i.e. about one fourth of all currently known protein sequences. [3, 4, 5] The number of entirely sequenced genomes is expected to continue growing exponentially for at least the next few years. With the availability of genome sequences of entire organisms we are for the first time in a position to understand the expression, function, and regulation of the entire set of proteins encoded by an organism. This information will be invaluable for understanding how complex biological processes occur at a molecular level, how they differ in various cell types, and how they are altered in disease states. [6] Identifying protein function is a big step toward understanding diseases and identifying novel drug targets. [7] However, experimentally determining protein function continues to be a laborious task requiring enormous resources. For example, more than a decade after its discovery, we still do not know the precise and entire functional role of the prion protein. [8] The rate at which expert annotators add experimental information into more or less controlled vocabularies of databases snails along at even slower pace. This has left a huge and rapidly widening gap between the amount of sequences deposited in databases and the experimental characterization of the corresponding proteins. [9, 10] Bioinformatics plays a central role in bridging this sequence-function gap through the development of tools for faster and more effective prediction of protein function. [11, 12, 13]
'Protein function' has myriad meanings. The function of a protein is hard to define. Proteins can perform molecular functions like catalyzing metabolic reactions and transmitting signals to other proteins or to DNA. At the same time they can also be responsible for performing physiological functions as a set of cooperating proteins, such as the regulation of gene expression, metabolic pathways and signalling cascades. [11] What makes matters worse is that although many biologists may assume that they 'know it when they see it', in fact, their conclusion is likely to be biased by the department with which they are affiliated: for example geneticists attach a different meaning to the word function than chemists, pharmacologists, medical, structural, or cell biologists. This Babylonian confusion comes about since function is a complex phenomenon that is associated with many mutually overlapping levels: chemical, biochemical, cellular, organism mediated, developmental and physiological. [14] These levels are related in complex ways, for example, protein kinases can be related to different cellular functions (such as cell cycle), and to a chemical function (transferase) plus a complex control mechanism by interaction with other proteins; the same kinase may also be the culprit that leads to mis-function, or disease. The variety of functional roles of a protein often results in confusing database annotations which makes it difficult to develop tools for predicting protein function. [15] What is needed for reliable automatic predictions are computer-readable hierarchical descriptions of function. [16, 11, 17] But defining an ontology for protein function has proved to be an extremely difficult task.
What makes subcellular localization ideal for function prediction experiments? Since biological cells are subdivided into membrane bound compartments, the subcellular localization of a protein is much more easily identifiable than its other roles in a cell. In contrast with other functional features, the protein trafficking mechanism is relatively well understood, and computer-readable subcellular localization data is available for large numbers of proteins. Proteins must be localized in the same subcellular compartment to cooperate towards a common physiological function. Though some proteins can localize in multiple compartments, the majority of proteins are localized within a single compartment for the largest part of their lifetime. Knowledge of the subcellular localization of a protein can significantly improve target identification during the drug discovery process. [18, 19] For example, secreted proteins and plasma membrane proteins are easily accessible by drug molecules due to their localization in the extracellular space or on the cell surface. A purified secreted protein or a receptor extracellular domain can be utilized directlyas a therapeutic (e.g., growth hormone), or may be targetedby specific antibodies or small molecules. Important therapeuticshave been created that target proteins present on the cell surfacein a specific cell type or disease state. [20] Rituxan is an antibodytherapeutic targeting the B lymphocyte-specific CD20 proteinand is an effective therapeutic in the treatment of non-Hodgkin's lymphoma. Aberrant subcellular localization of proteins has been observed in the cells of several diseases, such as cancer and Alzheimer's disease. Therefore, unravelling the native compartment of a protein is an important step on the long way to determining its role. [21, 11] Using experimental high-throughput methods for epitope and green fusion protein (GFP) tagging, two groups have recently reported localization data for most proteins in Saccharomyces cerevisiae (bakers yeast). [22, 23] So far, the majority of large-scale experiments suggesting localization have been restricted to yeast, or to particular compartments, such as a recent analysis of chloroplast proteins in Arabidopsis thaliana (grass). [24] As of now, these large-scale experiments cannot be repeated for mammalian or other higher eukaryotic proteomes. One significant obstacle is that large-scale production of a collection of cell lines each with a defined gene chromosomally tagged at the 3′-end is not yet possible. [25] In contrast, computational tools can provide fast and accurate localization predictions for any organism. [9, 26] This has resulted in subcellular localization prediction becoming one of the central challenges in bioinformatics. [27, 28, 29]
Protein trafficking proceeds via sorting signals. Bacterial cells generally consist of a single intracellular compartment surrounded by a plasma membrane. In contrast, eukaryotic cells are elaborately subdivided into functionally distinct, membrane-bounded compartments. The major constituents of eukaryotic cells are: extracellular space, cytoplasm, nucleus, mitochondria, Golgi apparatus, endoplasmic reticulum (ER), peroxisome, vacuoles, cytoskeleton, nucleoplasm, nucleolus, nuclear matrix and ribosomes. [30] Most eukaryotic proteins are encoded in the nuclear genome and synthesized in the cytosol, and many need to be further sorted before they reach their final destinations ( Fig. 1 ). The localization of a protein is largely determined by a trafficking system that is reasonably well understood for some organelles. [31, 32, 33, 34, 28] The system has two main branches. [35] On one, proteins are synthesized on cytoplasmic ribosomes, and from there can go to the nucleus, mitochondria, or peroxisomes. The second branch leads from the ER-ribosomes to the Golgi apparatus, and from there to lysosomes, or secretory vesicles, and on to the extracellular space. At each branch point, a decision must be made for each protein - either retain the protein in the current compartment or transport it to the next. These decisions are made by membrane transport complexes, which respond to targeting signals on the proteins themselves. In most cases, these targeting signals are short stretches of amino acid residues. The best understood branch point is the second one leading to secretion. Many proteins destined for this branch have an N-terminal signal peptide which is cleaved off proteolytically either during or after protein translocation through the membrane. Proteins lacking this signal are retained in the cytoplasm. The targeting signals used at the other branch points are not always so clear for two reasons. First, the signals are presented by folded proteins, and hence are not always contiguous in sequence. Second, even where the signals are contiguous in sequence, not all signal peptides have been documented. In the absence of a clear understanding of the principles governing protein translocation, computational methods for predicting subcellular localization have pursued a number of conceptually distinct approaches.
|
|
Fig. 1: A simplified roadmap of protein traffic. Proteins can move from one compartment to another by gated transport (white), transmembrane transport (dark grey), or vesicular transport (light grey). The signals that direct a given protein's movement through the system, and thereby determine its eventual location in the cell, are contained in each protein's amino acid sequence. The journey begins with the synthesis of a protein on a ribosome in the cytosol and terminates when the final destination is reached. At each intermediate station (boxes), a decision is made as to whether the protein is to be retained in that compartment or transported further. In principle, a signal could be required for either retention in or exit from a compartment. Proteins are synthesized in the cytosol from where they are sorted to their respective localizations. (From Alberts et al. [126] With permission.) |
No straightforward strategy for predicting localization. Methods for predicting the subcellular localization of proteins have primarily explored four avenues: 1) annotation transfer from homologous sequences, 2) predicting the sorting signals that the cell uses as Ôaddress labelsÕ, 3) mining the functional information deposited in databases and scientific literature, and 4) using the observation that the subcellular localization depends in subtle ways on the amino acid composition. Additionally, there are meta-methods which combine the outputs from a number of primary methods in an optimal way to enhance accuracy and coverage. Sequence similarity is perhaps the most frequently used method to annotate function for unknown proteins and accounts for the majority of annotations about function in public databases. [26, 36, 37] A major limitation of sequence homology based methods is that they are only applicable when another sequence similar protein with experimentally known function is available. Hence, only a small fraction of known sequences can be annotated using this approach. [27] Since protein trafficking relies on the presence of sorting signals, ideally we would like to predict the signals responsible for targeting. However, our current knowledge of sorting signals is far from perfect and recent cell biological studies seem to indicate that the protein sorting mechanism is far more complex than previously thought. This makes it extremely difficult to accurately identify sorting signals. [38] In spite of their limited applicability, methods that predict sorting signals provide the most useful predictions since by pinpointing the Ôtargeting signalÕ they shed light on the molecular mechanisms of protein translocation. Traditionally, expert human annotators have been responsible for interpreting experimental data in the scientific literature and annotating protein function in public databases. [39, 40] However, recent advances in data mining techniques have made it possible to deploy automatic methods to complement the role of Ôexpert annotatorsÕ and extract functional information directly from biological databases, MEDLINE abstracts [41] and even full scientific papers. Due to the exponential growth in the size of biological databases, a number of methods have recently been developed that infer subcellular localization using automatic text analysis. Many recent advances in predicting subcellular localization have been the result of using the amino acid composition and other sequence derived features. These ab initio methods utilize only the amino acid composition and features predicted from the primary sequence, hence, they have the advantage of being applicable to all protein sequences. A method for accurately predicting subcellular localization from the amino acid sequence alone would be invaluable in interpreting the wealth of data being provided by large-scale sequencing projects. Furthermore, predictions of localization can assist high-throughput techniques to determine localization from cDNAs. [42] However, prediction accuracy for ab initio methods still lags behind other approaches.
Next, we review the different approaches for predicting subcellular localization and describe the state-of-the-art methods for predicting localization.
Most annotations of function through homology transfer. Traditionally, the first approach for annotating function of an unknown protein relies on sequence similarity to proteins of known function. [43, 44] The method works by first identifying a database protein of experimentally known function with significant sequence similarity to a query protein, U, and then transferring the experimental annotations of function from the homologue to the unknown query U. Understanding the relation between function and sequence is of fundamental importance, since it provides insights into the underlying mechanisms of evolving new functions through changes in sequence and structure. [45] Several studies have explored the relationship of sequence and structure similarity to conservation of various aspects of protein function. [46, 47, 48, 49] One major observation is the existence of sharp Ôconservation thresholdsÕ for sequence similarity, above the threshold, sequence similar pairs of proteins share the same function, and below it, they have dissimilar functions. In practice, ad hoc thresholds of 50-60% sequence identity are often used for transferring functional annotations. Recent studies indicate that these levels of sequence similarity may not be sufficient to accurately infer function. [50, 49] Several pitfalls in transferring annotations of function have been reported, for example, inadequate knowledge of thresholds for 'significant sequence similarity', using only the best database hit, or ignoring the domain organization of proteins. [9, 51, 52, 36] In spite of this, homology based approaches continue to be among the most reliable for annotating subcellular localization. [50, 53, 54]
LOChom: database of homology based annotations. By performing a large-scale analysis of the relationship between sequence similarity and subcellular localization, Nair & Rost [50] were able to establish sequence similarity thresholds for the conservation of subcellular localization. They observed a sharp transition separating the regions of conserved and non-conserved localization, although this transition was less well defined than those previously observed for the conservation of protein structure and enzymatic activity. [49] To their surprise, they found that pairwise sequence identities of over 80% were needed to safely infer localization based on homology. A simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP-distance, [55, 56] was found to accurately distinguish between protein pairs of identical and different localizations. In fact, BLAST expectation values [57, 58] outperformed the HSSP-distance only for sequence alignments in the sub-twilight zone, which is the region of sequence similarity where structure and function can no longer be safely inferred from sequence similarity alone. LOChom [50] is a comprehensive database containing homology based subcellular localization annotations for nearly a quarter of all proteins in the SWISS-PROT database [59] and around 20% of sequences from five entirely sequenced eukaryotic genomes. [50]
Prediction possible for some cellular classes. A number of methods have tried to predict localization by identifying local sequence motifs, such as signal peptides [60, 61] or nuclear localization signals (NLS) [62, 28] that are responsible for protein targeting. The prediction of N-terminal sorting signals has a long history originating from the early work on secretory signal peptides of von Heijne. [63, 64] N-terminal signal peptides are responsible for the transport of proteins between the ER and the Golgi apparatus and also for targeting proteins to the mitochondria [65] and to chloroplasts [66] . Early methods for predicting signal peptides were essentially based on consensus signals, using linear discriminant functions with weight matrices. Modern machine learning techniques can predict whether a protein contains an N-terminal targeting peptide or not by automatically extracting correlations from the sequence data without any prior knowledge of targeting signals. This makes it impossible to gain any idea about the protein sorting mechanism by looking at the output from these predictors. The introduction of machine learning techniques like neural networks (NNs) and hidden markov models (HMMs), [67, 68] have resulted in spectacular improvements in prediction accuracy. Machine learning methods like NNs and HMMs learn to discriminate automatically from the data, using only a set of experimentally verified examples as input. It is now possible to predict secretory signal peptides (SPs), [69, 70] mitochondrial targeting peptides (mTPs) [71, 72] and chloroplast targeting peptides (cTPs) [73] quite reliably using machine learning techniques. A particular problem for methods detecting N-terminal signals is that start codons are predicted with less than 70% accuracy by genome projects. [74, 75, 1] For additional details, the reader can consult a number of excellent reviews on N-terminal sorting signal prediction. [76, 67, 77] Sorting signals also mediate the import of proteins into the nucleus. A protein is imported into the nucleus if it contains a nuclear localization signal (NLS), which is a short stretch of amino acids. Extensive experimental research on nucleo-cytoplasmic transport [33] indicates that NLSs can occur anywhere in the amino acid sequence and in general have an abundance of positively charged residues. [78, 79] Efforts at NLS prediction started with the work of Cokol et al, [62] who successfully applied Ôin silico mutagenesisÕ to discover new NLSs. Since the entire protein sequence has to be searched for NLSs, application of machine learning techniques has proved difficult. Overall, known and predicted sequence motifs enable annotating about 30% of the proteins in six eukaryotic proteomes. [80, 4] Here, we review TargetP and PredictNLS, which are the most accurate tools for predicting signal peptides and nuclear localization signals.
| Method | URL |
| Sequence homology based localization annotations | |
| LOChom [50] | cubic.bioc.columbia.edu/db/LOChom/ |
| Methods based on N-terminal sorting signals | |
| SignalP [127] | www.cbs.dtu.dk/services/SignalP/ |
| ChloroP [73] | www.cbs.dtu.dk/services/ChloroP/ |
| TargetP [68] | www.cbs.dtu.dk/services/TargetP/ |
| iPSORT [125] | biocaml.org/ipsort/iPSORT/ |
| MitoProt [128] | www.mips.biochem.mpg.de/cgibin/proj/medgen/mitofilter/ |
| Predotar [129] | www.inra.fr/Internet/Produits/Predotar/ |
| Prediction and analysis of nuclear localization signals | |
| PredictNLS [62] | cubic.bioc.columbia.edu/predictNLS/ |
| Inferring localization using text analysis | |
| LOCkey [93] | cubic.bioc.columbia.edu/services/LOCkey/ |
| Proteome Analyst | www.cs.ualberta.ca/~bioinfo/PA/ |
| GeneQuiz [44] | jura.ebi.ac.uk:8765/ext-genequiz/ |
| Meta_A [92] | mendel.imp.univie.ac.at/CELL_LOC/ |
| Methods based on amino acid composition | |
| LOCnet [116] | cubic.bioc.columbia.edu/services/LOCnet/ |
| SubLoc [109] | www.bioinfo.tsinghua.edu.cn/SubLoc/ |
| PLOC [111] | www.genome.jp/SIT/ploc.html |
| ProtComp | www.softberry.ru/berry.phtml?topic=index&group=programs&subgroup=proloc |
| General Methods | |
| PSORT II [61] | psort.nibb.ac.jp/ |
| PSORT-B [53] | www.psort.org/psortb/ |
| LOCtarget [119] | cubic.bioc.columbia.edu/services/LOCtarget/ |
TargetP: predicting N-terminal signal peptides. TargetP is a neural network based tool for predicting N-terminal sorting signals. The neural network can discriminate between proteins destined for the secretory pathway, mitochondria, chloroplast and other localizations with an accuracy of 85% (plant) or 90% (non-plant). The N-terminal signal peptide is proteolytically cleaved either during or after protein translocation. TargetP predicts the cleavage site, though cleavage site prediction accuracy is lower with 40% to 50% sites correctly predicted for chloroplastic and mitochondrial presequences and above 70% for secretory signal peptides. The neural network architecture consists of two layers. The first layer contains one dedicated network for each type of presequence (SP, mTP, cTP, Other), while the second is a decision neural network that makes the final choice between the different compartments ( Fig. 2 ). The signal peptide problem was posed to the neural networks in two ways: (1) recognition of the cleavage sites against the background of all other sequence positions, and (2) classification of amino acids as belonging to the signal peptide or not. Sequence data was presented to the neural networks using a sliding window technique; a window of residues is presented to the neural network and the network is trained to predict the state of the central residue. The sliding window approach is remarkably successful at capturing sequence features correlated over long stretches of residues. [81] The window is then moved along the amino acid sequence and predictions are made in turn for each successive residue. Window sizes used ranged from 27 residues for the SP networks to 56 residues for the cTP networks. A data set consisting of 269 SP, 368 mTP and 141 cTP sequences (for the plant version of TargetP), and 715 SP and 371 mTP sequences (for the non-plant version) was used to train pairwise feed-forward neural networks to accurately identify each type of targeting presequence. The scores for the 100 N-terminal residues were then fed to the second layer integrating network which determines the type of N-terminal targeting peptide. From a TargetP analysis of Arabidopsis Thaliana and Homo Sapiens, 10% of all plant proteins were estimated to be mitochondrial and 14% chloroplastic, and the abundance of secretory proteins in both Arabidopsis and Homo was estimated to be 10%.
|
|
Fig. 2 : TargetP localization predictor architecture. TargetP is built from two layers of feed-forward neural networks, and on top a decision making unit, taking into account cutoff restrictions (if opted for) and outputting a prediction and a reliability class, RC, which is an indication of prediction certainty (see the text). The nonplant version lacks the cTP network unit in the first layer and does not have cTP as a prediction possibility. |
PredictNLS: predicting nuclear localization signals. Over the last few years a large number of distinct NLSs have been experimentally implicated in nuclear transport. [33, 79] NLSdb [82] is the largest publicly available database of experimental NLSs. However, known experimental NLSs can account for fewer than 10% of known nuclear proteins. To remedy this, PredictNLS [62] uses a procedure of Ôin silico mutagenesisÕ to discover new NLSs. Briefly, this procedure works as follows: (I) Change or remove some residues from the experimentally characterized NLS motifs and monitor the resulting true (nuclear) and false (non-nuclear) matches. Obviously, allowing alternative residues at particular positions increased the number of nuclear proteins found. However, often this also increased the number of matching non-nuclear proteins. (II) Discard any potential NLSs that are found in known non-nuclear proteins (false matches). (III) Require that potential NLSs be found in at least two distinct nuclear protein families. The 194 potential NLSs discovered using this procedure increased the coverage of known nuclear proteins to 43%. All proteins in the PDB [83] and SWISS-PROT databases were annotated using the full list of experimental and potential NLSs. NLSdb contains over 6,000 predicted nuclear proteins and their targeting signals from the PDB and SWISS-PROT databases. The database also contains over 12,500 predicted nuclear proteins from six entirely sequenced eukaryotic proteomes (Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, and Saccharomyces cerevisiae). Approximately 20% of the NLS motifs were observed to co-localize with experimentally determined DNA-binding region of proteins. [84, 62] This observation was used to also annotate over 1,500 DNA-binding proteins. We also annotated all sequences in the yeast, worm, fruit fly and human proteomes.
Mining databases to annotate localization. Automatic text analysis methods can be classified into two broad categories: 1) extracting information directly from scientific literature and 2) inferring function from controlled vocabularies in protein databases. New experimental discoveries are first published in scientific journals. Mining scientific literature to automatically retrieve information is an appealing goal and a number of groups have worked on different aspects of this problem; machine-selection of articles of interest, [85] automated extraction of information using statistical methods [86, 87] and natural language processing techniques for extracting pathway information. [88, 89] However, usefulness of this class of methods for annotating protein function is hampered by a crucial bottleneck; the mapping of gene/protein names. [90, 37] To date no attempts have been made to directly annotate subcellular localization from scientific publications. The second class of methods has proved more successful for annotating function. Functional annotations in protein databases are mostly written in plain text using a rich biological vocabulary that often varies in different areas of research which makes it difficult to parse using computer programs. Additionally, databases like SWISS-PROT usually contain functional annotations at a very detailed level of biochemical function, e.g. a given sequence is annotated as a cdc2 kinase, but not as being involved in intra-cellular communication. [91] A number of text analysis tools have been implemented that infer various aspects of cellular function from database annotations of molecular function. Many methods explore the functional annotations in SWISS-PROT, especially the keyword annotations. [44, 92, 12, 93, 94] SWISS-PROT currently contains over 800 keyword functional descriptors. Semantic analysis of the keywords is used to categorise proteins into classes of cellular function. [95, 96] Both fully-automated and semi-automated methods have been applied to predicting subcellular localization. The fully-automatic methods extract rules from keywords by using statistical learning methods like, probabilistic Bayesian models, [97] symbolic rule learning [98] and M-ary (multiple category) classifiers like the k-Nearest Neighbour. [99] Some of the major methods in this category are LOCkey, [93] Proteome Analyst, [94] Spearmint [100, 101] and the SVM-based approach of Stapley et al. [102] The semi-automated methods are based on building dictionaries of rules. Keywords characteristic of each of the functional classes are first extracted from a set of classified example proteins. Using these keywords a library of rules is created associating a certain pattern of occurrence of keywords to a functional class. The major methods in this category are EUCLID, [44] Meta_A [92] and RuleBase. [12] Function annotations from RuleBase and Spearmint have been integrated into UniProt, [50] which is the world's most comprehensive catalog of information on proteins. Below we review the LOCkey algorithm for predicting subcellular localization.
LOCkey: information theory based classifier. The LOCkey system [93] is a novel M-ary classifier which predicts the subcellular localization of a protein based on SWISS-PROT keywords. The LOCkey algorithm can be divided into two steps ( Fig. 3 ): (1) Building data sets of trusted vectors for known proteins, and (2) classifying unknown proteins. First, a list of keywords is extracted from SWISS-PROT for all proteins with known subcellular localization. On average most proteins have between two and five keywords. A data set of binary vectors [103] is generated for each protein by representing the presence of a certain keyword in the protein by 1 and its absence by 0. Second, to infer subcellular localization of an unknown protein U all keywords for U are read from SWISS-PROT. These keywords are translated into a binary keyword vector. From this original keyword vector, LOCkey generates a set of all possible combinations of alternative vectors by flipping vector components of value 1 (presence of keyword) to 0 in all possible combinations. For example, for a protein with three keywords, there are 23-1 = 7 possible sub-vectors: 111, 110, 101, 011, 100, 010 and 001. These sub-vectors constitute all possible keyword combinations for protein U. The keyword combination, i.e. sub-vector, that yields the best classification of U into one of ten classes of subcellular localizations is then found. This is done by retrieving all exact matches of each of the sub-vectors to any of the proteins in the trusted set, i.e. by finding all proteins in the trusted set that contain all the keywords present in the sub-vector. By construction, the proteins retrieved in this way may also contain keywords not found in U. The next task is to estimate the 'surprise value' of the given assignment. Toward this end, LOCkey simply compiles the number of proteins belonging to each type of subcellular localization. This procedure is repeated in turn for each of the sub-vectors and localization is finally assigned to a protein by minimizing an entropy based objective function. The system accurately solves the classification problem when the number of data points (proteins) and dimensionality of the feature space (number of keywords) are not too large. LOCkey reached a level of more than 82% accuracy in a full cross-validation test. However, due to a lack of functional annotations, the coverage was low and the system failed to infer localization for more than half of all proteins in the test set. For five entirely sequenced proteomes, namely yeast, worm, fly, plant (Arabidopsis thaliana) and a subset of all human proteins, the LOCkey system automatically found about 8000 new annotations about sucellular localization. LOCkey has been optimized to provide fast annotations. Anotating the entire worm proteome took less than four hours on a PIII 900 MHz machine. The algorithm is limited to problems with relatively few data points (proteins) in the vector set (n<< 1000000 <<<10000).
|
|
Fig. 3 : The LOCkey algorithm. A sequence unique data set of localization annotated SWISS-PROT proteins was first compiled. Keywords were extracted for these proteins and merged with any keywords found in homologues. The keywords were represented as binary vectors in the 'Trusted Vector Set'. An unknown query was first annotated with keywords through identification of SWISS-PROT homologues. Keywords for the query were represented as binary vectors. All possible keyword combinations were constructed (the SUB vectors). The best matching vector was found based on entropy criteria (see methods). This vector was used to infer localization for the query. |
Ab initio methods predict localization for all proteins at lower accuracy. The breakthrough for ab intio prediction came from the pioneering works of Nishikawa et al. [104, 105] They observed that the total amino acid composition of a protein is correlated with its subcellular localization. An explanation for this observation was provided by Andrade et al [106] who observed that the signal for subcellular localization was almost entirely due to the surface residues. Throughout evolution each subcellular compartment has maintained its characteristic physico-chemical environment, so it is not very surprising that protein surfaces have evolved to adapt to these conditions. A wide array of methods has been developed to exploit this correlation of subcellular localization with sequence composition. The first tool to use amino acid composition was the PSORT expert system from Nakai et al [107] which used standard statistical methods. However, it is only with the recent applications of machine learning techniques that composition based methods have started approaching the prediction accuracy of other methods. One of the earliest methods to use a machine learning approach was the NNPSL predictor, [108] which used feed forward neural networks (NNs) trained on the amino acid composition. The network classified proteins from eukaryotic organisms into one of four possible subcellular compartments with an accuracy of 66% and prokaryotic proteins into one of three compartments with an accuracy of 81%. They also showed that the neural network predictions were fairly insensitive to sequencing errors near the N-terminal adding weight to the importance of the predictions. Hua et al [109] showed that support vector machines (SVMs) are even better at predicting localization from the amino acid composition. This is so since SVMs are in general better at extracting correlations when the data set is relatively small and noisy. [110] By training SVMs on the data set of Reinhardt et al [108] their SubLoc system was able to improve prediction accuracy by over 13%. Park and Kanehisa [111] have shown that adding residue pair compositions to the amino acid composition can improve prediction accuracy by over 5%. Their PLOC system classifies proteins into one of nine subcellular compartments with an accuracy of over 79%. Cai et al [112, 113, 114] have tried to incorporate higher order correlations among the amino acid residues (residues i and (i+n), n=2,3,4) by using pseudo-amino acid composition. The pseudo-amino acid composition accounts for sequence-order effects by defining a correlation factor based on various biochemical properties, for every residue and its sequence neighbours. However these methods are not publicly available and their prediction accuracy is hard to assess. With the availability of large numbers of completely sequenced genomes, phylogenetic profiles have been employed to identify sucellular localization. [115] So far, this approach has been much less accurate in predicting localization than methods based solely on composition. By incorporating predicted secondary structure, solvent accessibility and amino acid composition along with evolutionary information into a multi-level neural network architecture, Nair et al [116] were able to significantly improve prediction accuracy over existing methods. Their LOCnet system is one of the most accurate ab initio methods for predicting localization from sequence.
LOCnet: improving predictions using evolution. The LOCnet [116] system consists of three layers of neural networks and sorts proteins into one of four subcellular classes ( Fig. 4 ). The first layer consists of dedicated neural networks that use particular features from protein sequences, alignments, and structure to pre-sort proteins into L/not-L (where L = cytoplasmic, nuclear, extra-cellular, mitochondrial). Output from the first layer networks which are trained on different sequence features is combined using a second layer of networks. The third layer uses a simple jury decision [117] to assign one of four localization-states to each protein. Major sources of improvement over publicly available methods originated from using: predicted secondary structure (from PROFsec [118] ), improved predictions of solvent accessibility (from PROFacc, [118] and evolutionary information from sequence profiles. LOCnet has a module that implicitly predicts generic signal peptides (but not the cleavage sites) and target peptides. [119] Although LOCnet performs better for extra-cellular proteins with signal peptides, it can also identify proteins that are secreted using alternative pathways, such as fibroblast growth factors and the interleukin family of cytokines. In combination with other methods, it can distinguish between proteins with signal peptides that are retained in the Endoplasmic reticulum or Golgi apparatus and those that are actually secreted. [54] LOCnet was found to be over 7% more accurate than the best publicly available system [119] on an independent test set of newly annotated proteins in the SWISS-PROT database. The LOCnet system has been applied to annotate subcellular localization for all proteins in the PDB [120] and in TargetDB. [119] TargetDB [121] is a database of structural genomics targets and provides registration and tracking information for the NIH structural genomics centers.
|
|
Fig. 4 : Neural network architecture of LOCnet. The first level of pairwise neural networks use an architecture of 20-60 input units and 2 output units with a hidden layer consisting of 3-9 units. The output from the different first level pairwise neural networks are used as input to the second level integrating neural network. The second level pairwise networks consist of 6 input units and 2 output units with a hidden layer consisting of 3 units. The final localization prediction is based on a jury decision of the outputs from the different pairwise integrating networks. |
Improving accuracy through combinations. The different strategies for predicting localization have their own strengths and weaknesses. High accuracy methods like those based on sequence motifs and homology are plagued by the problem of low coverage and can provide annotations for less than one third of known sequences. In this era of whole-genome sequencing what is needed are high quality annotations for all proteins in an organism. Currently the best solution available is to combine low coverage methods with state-of-the-art high coverage methods, like those based on composition. This approach was pioneered by Nakai et al [122, 123, 61] with their PSORT system. PSORT II is an expert system that combines a comprehensive database of sorting signals with predictions based on composition. The LOCtarget [119] system combines predictions based on sequence motifs, homology, text analysis and neural networks and can distinguish between nine localization classes. In its current implementation the only sequence motifs used by LOCtarget are those responsible for sorting to the nucleus. Drawid & Gerstein [124] have proposed a system which uses Bayesian statistics for integrating multiple kinds of information (integrates 30 different features which include everything from SignalP predictions to microarray expression profiles). They applied their method to predicting localization of the full Saccharomyces cerevisiae proteome and provide estimates of the fraction of all yeast proteins found in different compartments. Below we review PSORT II which is one of the most widely used methods for predicting localization.
PSORT II: expert system for predicting localization. The PSORT system [61] predicts the localization of proteins from gram-negative bacteria, gram-positive bacteria, yeasts, animals and plants. For a query sequence the program calculates the values of feature variables that reflect various characteristics of the sequence ( Table 2 ). Next, it uses the k-nearest-neighbor algorithm to interpret the set of values obtained and estimates the likelihood of the protein being sorted to each candidate site. Finally, it displays some of the most probable sites. The program achieved an overall prediction accuracy of 57% and can distinguish eleven subcellular classes. One reason for the lower accuracy of PSORT is our current incomplete knowledge of sorting signals. Extensions to PSORT II have been proposed: iPSORT [125] for extensive feature detection of N-terminal sorting signals and PSORT-B [53] for predicting localization of gram-negative bacteria.
| Feature | Criteria |
| N-terminal signal peptide | Modified McGeochÕs method and the cleavage-site consensus |
| Mitochondrial-targeting signal | Amino acid composition of the N-terminal 20 residues and some weak cleavage site consensus |
| Nuclear-localization signals | Combined score for various empirical rules |
| ER-lumen-retention signal | The KDEL-like motif at the C-terminus |
| ER-membrane-retention signal | Motifs: XXRR-like (N-terminal) or KKXX-like (C-terminal) |
| Peroxisomal-targeting signal | PTS1 motif at the C-terminus and the PTS2 motif |
| Vacuolar-targeting signal | [TIK]LP[NKI] motif |
| Golgi-transport signal |
The YQRL motif (preferentially at the cytoplasmic tail) |
| Tyrosine-containing motif | Number of tyrosine residues in the cytoplasmic tail |
| Dileucine motif | At the cytoplasmic tail |
| Membrane span(s)/topology | Maximum hydrophobicity and the number of predicted spans; charge difference across the most N-terminal transmembrane segment |
| RNA-binding motif | RNP-1 motif |
| Actinin-type actin-binding motifs | From PROSITE |
| DNA-binding motifs | 63 motifs from PROSITE |
| Ribosomal-protein motifs | 71 motifs from PROSITE |
| Prokaryotic DNA-binding motifs | 33 motifs from PROSITE |
| N-myristoylation motif | At the N-terminus |
| Amino acid composition | Neural network score that discriminates between cytoplasmic and nuclear proteins |
| Coiled coil structure length | Number of residues in the predicted coile-coil state |
| Length | Length of sequence |
Several pitfalls in assessing quality of annotations. To draw reliable inferences from a prediction it is essential that the accuracy of the method be properly established. To obtain accurate estimates of performance the testing procedure should mimic a blindfold prediction exercise as far as possible. One way of ensuring this is to choose the training data such that the test sequences have no sequence similarity to proteins in the training set. However this is often not the case and many methods test their performance only on a small sample of selected proteins resulting in overestimates of prediction accuracy. Another problem that affects prediction accuracy is the number of redundant sequences in public databases. Adequate care must be taken during development to avoid biased predictions towards large families of redundant protein sequences by using sequence unique test sets. Otherwise estimated accuracy is likely to be much higher than the true prediction accuracy. Benchmarking prediction methods proves to be a difficult task since the methods have been developed at different times and database annotations of function are constantly growing. In addition, there are no standard procedures for reporting prediction accuracy with some methods only reporting the overall prediction accuracy which can be quite uninformative due to the large differences in the sizes of the datasets for the different subcellular classes. Functional annotations in standard databases usually contain large numbers of incorrect annotations which makes development of prediction tools all the more difficult. Another problem without any obvious solution is choosing an appropriate tradeoff between sensitivity and specificity. Depending on the application, either high specificity or sensitivity might be desirable. Hence, caution should be exercised when using predictions from automatic servers especially in cases where little is known about the function of the protein and the sequence signals that are involved in sorting. It is sometimes instructive to compare predictions from multiple servers which use different prediction strategies. Similar predictions from the servers might indicate some propensity of the protein for the predicted localization, while conflicting predictions might call for further research.
Prediction accuracy continues to grow. In spite of the difficulties in correctly assessing the accuracy of prediction methods, during the last few years significant strides have been made in tackling the problem of subcellular localization prediction. One reason is the application of advanced machine learning techniques which can recognize subtle correlations among different kinds of sequence features. The second reason is the steady growth in the amount of functional information deposited in databases. Already prediction tools are proving useful for automatic annotations of sequence databases and for screening potentially interesting genes from genome data. In the near future it might be possible to predict the subcellular location of almost any given protein with high confidence. Future improvements are likely to result through the use of integrated prediction methods that cleverly combine the output from programs that predict different functional features to provide a comprehensive prediction of subcellular localization. Integrated prediction methods better capture biological reality since events affecting the fate of proteins are interrelated. For example, it is evident that a modification enzyme will not modify its potential substrates when the membrane separates them. Moreover, combination methods can be designed to naturally fall into an ontological scheme which would help us achieve the goal of a unified framework for protein function prediction.
Thanks to the following members of our group: Jinfeng Liu, Dariusz Przybylski and Kazimierz Wrzeszczynski for helpful discussions. RN would also like to thank Christina Schlecht for proof reading the manuscript. Last, but not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases and to the world wide web for making so many resources easily accessible.
| 1. | Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W.,Mural, R. J. et al. (2001). The sequence of the human genome. Science, 291, 1304-51. |
| 2. | Istrail, S., Sutton, G. G., Florea, L., Halpern, A.L., Mobarry, C. M. et al. (2004). Whole-genome shotgun assembly and comparisonof human genome assemblies. Proc Natl Acad Sci U S A, 101, 1916-21. |
| 3. | Liu, J. & Rost, B. (2001). Comparing function andstructure between entire proteomes. Protein Science, 10, 1970-1979. |
| 4. | Carter, P., Liu, J. & Rost, B. (2003). PEP:Predictions for Entire Proteomes. Nucleic Acids Res, 31, 410-3. |
| 5. | Pruess, M., Fleischmann, W., Kanapin, A.,Karavidopoulou, Y., Kersey, P. et al. (2003). The Proteome Analysis database: atool for the in silico analysis of whole proteomes. Nucleic Acids Res, 31, 414-7. |
| 6. | Zhu, H., Bilgin, M. & Snyder, M. (2003).Proteomics. Annu Rev Biochem, 72,783-812. |
| 7. | Brutlag, D. L. (1998). Genomics and computationalmolecular biology. Curr Opin Microbiol,1, 340-5. |
| 8. | Harrison, P. M., Bamborough, P., Daggett, V.,Prusiner, S. & Cohen, F. E. (1997). The prion folding problem. CurrentOpinion in Structural Biology, 7,53-59. |
| 9. | Bork, P. & Koonin, E. V. (1998). Predictingfunctions from protein sequences--where are the bottlenecks? Nat Genet, 18, 313-8. |
| 10. | Smith, T. F. (1998). Functionalgenomics--bioinformatics is ready for the challenge. Trends Genet, 14, 291-3. |
| 11. | Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber,F., Huynen, M. et al. (1998). Predicting function: from genes to genomes andback. J Mol Biol, 283,707-25.. |
| 12. | Fleischmann, W., Moller, S., Gateau, A. &Apweiler, R. (1999). A novel method for automatic functional annotation ofproteins. Bioinformatics, 15,228-33.. |
| 13. | Luscombe, N. M., Greenbaum, D. & Gerstein, M.(2001). What is bioinformatics? A proposed definition and overview of thefield. Methods Inf Med, 40,346-58. |
| 14. | Rost, B., Liu, J., Nair, R., Wrzeszczynski, K. O.& Ofran, Y. (2003). Automatic prediction of protein function. Cell MolLife Sci, 60, 2637-50. |
| 15. | Apweiler, R., Attwood, T. K., Bairoch, A., Bateman,A., Birney, E. et al. (2000). InterPro--an integrated documentation resourcefor protein families, domains and functional sites. Bioinformatics, 16, 1145-50. |
| 16. | Overbeek, R., Larsen, N., Smith, W., Maltsev, N.& Selkov, E. (1997). Representation of function: the next step. Gene, 191, GC1-GC9.. |
| 17. | Ashburner, M., Ball, C. A., Blake, J. A., Botstein,D., Butler, H. et al. (2000). Gene ontology: tool for the unification ofbiology. The Gene Ontology Consortium. Nat Genet, 25, 25-9. |
| 18. | Ohlstein, E. H., Ruffolo, R. R., Jr. & Elliott,J. D. (2000). Drug discovery in the next millennium. Annu Rev PharmacolToxicol, 40, 177-91. |
| 19. | Maliepaard, M., Scheffer, G. L., Faneyte, I. F., vanGastelen, M. A., Pijnenborg, A. C. et al. (2001). Subcellular localization anddistribution of the breast cancer resistance protein transporter in normalhuman tissues. Cancer Res, 61,3458-64. |
| 20. | Clark, H. F., Gurney, A. L., Abaya, E., Baker, K.,Baldwin, D. et al. (2003). The secreted protein discovery initiative (SPDI), alarge-scale effort to identify novel human secreted and transmembrane proteins:a bioinformatics assessment. Genome Res,13, 2265-70. |
| 21. | Bork, P., Ouzounis, C. & Sander, C. (1994). Fromgenome sequences to protein function. Current Opinion in Structural Biology, 4, 393-403. |
| 22. | Kumar, A., Agarwal, S., Heyman, J. A., Matson, S.,Heidtman, M. et al. (2002). Subcellular localization of the yeast proteome. GenesDev, 16, 707-19. |
| 23. | Huh, W. K., Falvo, J. V., Gerke, L. C., Carroll, A.S., Howson, R. W. et al. (2003). Global analysis of protein localization inbudding yeast. Nature, 425,686-91. |
| 24. | Kleffmann, T., Russenberger, D., von Zychlinski, A.,Christopher, W., Sjolander, K. et al. (2004). The Arabidopsis thalianachloroplast proteome reveals pathway abundance and novel protein functions. CurrBiol, 14, 354-62. |
| 25. | Davis, T. N. (2004). Protein localization inproteomics. Curr Opin Chem Biol, 8,49-53. |
| 26. | Koonin, E. V. (2000). Bridging the gap betweensequence and function. Trends Genet,16, 16. |
| 27. | Eisenhaber, F. & Bork, P. (1998). Wanted:subcellular localization of proteins based on sequence. Trends in CellBiology, 8, 169-170. |
| 28. | Nakai, K. (2000). Protein sorting signals andprediction of subcellular localization. Adv Protein Chem, 54, 277-344. |
| 29. | Schneider, G. & Fechner, U. (2004). Advances inthe prediction of protein targeting signals. Proteomics, 4, 1571-80. |
| 30. | Lodish, H., Berk, A., Baltimore, D. & Darnell, J.(2000). Molecular Cell Biology. W H Freeman & Co, New York. |
| 31. | Bar-Peled, M., Bassham, D. C. & Raikhel, N. V.(1996). Transport of proteins in eukaryotic cells: more questions ahead. PlantMol Biol, 32, 223-49. |
| 32. | Schatz, G. & Dobberstein, B. (1996). Commonprinciples of protein translocation across membranes. Science, 271, 1519-26. |
| 33. | Mattaj, I. W. & Englmeier, L. (1998).Nucleocytoplasmic transport: the soluble phase. Annu Rev Biochem, 67, 265-306. |
| 34. | Bauer, M. F., Hofmann, S., Neupert, W. & Brunner,M. (2000). Protein translocation into mitochondria: the role of TIM complexes. Trendsin Cell Biology, 10, 25-31. |
| 35. | Darnell, J., Lodish, H. & Baltimore, D. (1990).Molecular cell biology. Freeman, New York. |
| 36. | Devos, D. & Valencia, A. (2001). Intrinsic errorsin genome annotation. Trends in Genetics,17, 429-431. |
| 37. | Valencia, A. & Pazos, F. (2002). Computationalmethods for the prediction of protein interactions. Curr Opin Struct Biol, 12, 368-73. |
| 38. | Nakai, K. (2001). Review: prediction of in vivo fatesof proteins in the era of genomics and proteomics. J Struct Biol, 134, 103-16. |
| 39. | Apweiler, R., Gateau, A., Contrino, S., Martin, M.J., Junker, V. et al. (1997). Protein sequence annotation in the genome era:the annotation concept of SWISS-PROT+TREMBL. Proc Int Conf Intell Syst MolBiol, 5, 33-43. |
| 40. | Bairoch, A. & Apweiler, R. (1997). The SWISS-PROTprotein sequence data bank and its new supplement TrEMBL. Nucleic Acids Research, 25, 31-36. |
| 41. | Airozo, D., Allard, R., Brylawski, B., Canese, K.,Kenton, D. et al. (1999). MEDLINE. 1999,. |
| 42. | Simpson, J. C., Wellenreuther, R., Poustka, A.,Pepperkok, R. & Wiemann, S. (2000). Systematic subcellular localization ofnovel proteins identified by large-scale cDNA sequencing. EMBO Rep, 1, 287-92. |
| 43. | Koonin, E. v., Tatusov, R. L. & Rudd, K. E.(1996). Protein sequence comparison at genome scale. Methods in Enzymology, 266, 295-322. |
| 44. | Tamames, J., Ouzounis, C., Casari, G., Sander, C.& Valencia, A. (1998). EUCLID: automatic classification of proteins infunctional classes by their database annotations. Bioinformatics, 14, 542-3. |
| 45. | Thornton, J. M., Orengo, C. A., Todd, A. E. &Pearl, F. M. (1999). Protein folds, functions and evolution. J Mol Biol, 293, 333-42. |
| 46. | Orengo, C. A., Todd, A. E. & Thornton, J. M.(1999). From protein structure to function. Curr Opin Struct Biol, 9, 374-82. |
| 47. | Wilson, C. A., Kreychman, J. & Gerstein, M.(2000). Assessing annotation transfer for genomics: quantifying the relationsbetween protein sequence, structure and function through traditional andprobabilistic scores. J Mol Biol, 297,233-49. |
| 48. | Pawlowski, K. & Godzik, A. (2001). Surface mapcomparison: studying function diversity of homologous proteins. J Mol Biol, 309, 793-806. |
| 49. | Rost, B. (2002). Enzyme function less conserved thananticipated. J Mol Biol, 318,595-608. |
| 50. | Nair, R. & Rost, B. (2002). Sequence conservedfor subcellular localization. Protein Sci,11, 2836-47. |
| 51. | Doerks, T., Bairoch, A. & Bork, P. (1998).Protein annotation: detective work for function prediction. Trends Genet, 14, 248-50. |
| 52. | Galperin, M. Y. & Koonin, E. V. (2000). Who'syour neighbor? New computational approaches for functional genomics. NatBiotechnol, 18, 609-13.. |
| 53. | Gardy, J. L., Spencer, C., Wang, K., Ester, M.,Tusnady, G. E. et al. (2003). PSORT-B: Improving protein subcellularlocalization prediction for Gram-negative bacteria. Nucleic Acids Res, 31, 3613-7. |
| 54. | Wrzeszczynski, K. O. & Rost, B. (2004).Annotating proteins from endoplasmic reticulum and Golgi apparatus ineukaryotic proteomes. Cell Mol Life Sci,61, 1341-53. |
| 55. | Sander, C. & Schneider, R. (1991). Database ofhomology-derived protein structures and the structural meaning of sequencealignment. Proteins, 9,56-68. |
| 56. | Rost, B. (1999). Twilight zone of protein sequencealignments. Protein Eng, 12,85-94. |
| 57. | Altschul, S. F. & Gish, W. (1996). Localalignment statistics. Methods in Enzymology,266, 460-480. |
| 58. | Altschul, S., Madden, T., Shaffer, A., Zhang, J.,Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a new generation ofprotein database search programs. Nucleic Acids Research, 25, 3389-3402. |
| 59. | Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A. et al. (2003). The SWISS-PROT protein knowledgebase and itssupplement TrEMBL in 2003. Nucleic Acids Res, 31, 365-70.. |
| 60. | von Heijne, G. (1995). Protein sorting signals:simple peptides with complex functions. Exs,73, 67-76. |
| 61. | Nakai, K. & Horton, P. (1999). PSORT: a programfor detecting sorting signals in proteins and predicting their subcellularlocalization. Trends Biochem Sci, 24,34-6. |
| 62. | Cokol, M., Nair, R. & Rost, B. (2000). Findingnuclear localization signals. EMBO Rep,1, 411-5. |
| 63. | von Heijne, G. (1981). On the hydrophobic nature ofsignal sequences. Eur. J. Biochem.,116, 419-422. |
| 64. | von Heijne, G. (1985). Signal sequences. The limitsof variation. J. Mol. Biol., 184,99-105. |
| 65. | Voos, W., Martin, H., Krimmer, T. & Pfanner, N.(1999). Mechanisms of protein translocation into mitochondria. BiochimBiophys Acta, 1422, 235-54. |
| 66. | Bruce, B. D. (2000). Chloroplast transit peptides:structure, function and evolution. Trends Cell Biol, 10, 440-7. |
| 67. | Nielsen, H., Brunak, S. & von Heijne, G. (1999).Machine learning approaches for the prediction of signal peptides and otherprotein sorting signals. Protein Engineering, 12, 3-9. |
| 68. | Emanuelsson, O., Nielsen, H., Brunak, S. & vonHeijne, G. (2000). Predicting subcellular localization of proteins based ontheir N-terminal amino acid sequence. J Mol Biol, 300, 1005-16. |
| 69. | Nielsen, H., Engelbrecht, J., Brunak, S. & vonHeijne, G. (1997). A neural network method for identification of prokaryoticand eukaroytoic signal peptides and prediction of their cleavage sites. InternationlJournal of Neural Systems, 8,581-599. |
| 70. | Kall, L., Krogh, A. & Sonnhammer, E. L. (2004). Acombined transmembrane topology and signal peptide prediction method. J MolBiol, 338, 1027-36. |
| 71. | Fujiwara, Y., Asogawa, M. & Nakai, K. (1997).Prediction of Mitochondrial Targeting Signals Using Hidden Markov Model. GenomeInform Ser Workshop Genome Inform, 8,53-60. |
| 72. | Emanuelsson, O., von Heijne, G. & Schneider, G.(2001). Analysis and prediction of mitochondrial targeting peptides. MethodsCell Biol, 65, 175-87. |
| 73. | Emanuelsson, O., Nielsen, H. & von Heijne, G.(1999). ChloroP, a neural network-based method for predicting chloroplasttransit peptides and their cleavage sites. Protein Science, 8, 978-984. |
| 74. | Gaasterland, T. & Oprea, M. (2001). Whole-genomeanalysis: annotations and updates. Curr Opin Struct Biol, 11, 377-81. |
| 75. | Lander, E. S., Linton, L. M., Birren, B., Nusbaum,C., Zody, M. C. et al. (2001). Initial sequencing and analysis of the humangenome. Nature, 409,860-921. |
| 76. | Durbin, R., Eddy, S. R., Krogh, A. & Mitchison,G. (1998). Biological Sequence Analysis. Cambridge University Press, Cambridge. |
| 77. | Emanuelsson, O. & von Heijne, G. (2001).Prediction of organellar targeting signals. Biochim Biophys Acta, 1541, 114-9. |
| 78. | Moroianu, J. (1999). Nuclear import and export:transport factors, mechanisms and regulation. Crit Rev Eukaryot Gene Expr, 9, 89-106. |
| 79. | Jans, D. A., Xiao, C. Y. & Lam, M. H. (2000).Nuclear targeting signal recognition: a key control point in nuclear transport?Bioessays, 22, 532-44. |
| 80. | Liu, J. & Rost, B. (2002). Target space forstructural genomics revisited. Bioinformatics, 18, 922-33. |
| 81. | Qian, N. & Sejnowski, T. J. (1988). Predictingthe secondary structure of globular proteins using neural network models. Journalof Molecular Biology, 202,865-884. |
| 82. | Nair, R., Carter, P. & Rost, B. (2003). NLSdb:database of nuclear localization signals. Nucleic Acids Res, 31, 397-9. |
| 83. | Berman, H. M., Westbrook, J., Feng, Z., Gillliland,G., Bhat, T. N. et al. (2000). The Protein Data Bank. Nucleic Acids Research, 28, 235-42. |
| 84. | LaCasse, E. C. & Lefebvre, Y. A. (1995). Nuclearlocalization signals overlap DNA- or RNA-binding domains in nucleicacid-binding proteins. Nucleic Acids Res,23, 1647-56. |
| 85. | Iliopoulos, I., Enright, A. J. & Ouzounis, C. A.(2001). Textquest: document clustering of Medline abstracts for conceptdiscovery in molecular biology. Pac Symp Biocomput,384-95. |
| 86. | Stapley, B. J. & Benoit, G. (2000).Biobibliometrics: information retrieval and visualization from co-occurrencesof gene names in Medline abstracts. Pac Symp Biocomput,529-40. |
| 87. | Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R.& Mostafa, J. (2001). Detecting gene relations from Medline abstracts. PacSymp Biocomput,483-95. |
| 88. | Ng, S. K. & Wong, M. (1999). Toward RoutineAutomatic Pathway Discovery from On-line Scientific Text Abstracts. GenomeInform Ser Workshop Genome Inform, 10,104-112. |
| 89. | Friedman, C., Kra, P., Yu, H., Krauthammer, M. &Rzhetsky, A. (2001). GENIES: a natural-language processing system for theextraction of molecular pathways from journal articles. Bioinformatics, 17 Suppl 1, S74-82. |
| 90. | Hatzivassiloglou, V., Duboue, P. A. & Rzhetsky,A. (2001). Disambiguating proteins, genes, and RNA in text: a machine learningapproach. Bioinformatics, 17 Suppl 1,S97-106. |
| 91. | Apweiler, R. (2001). Functional information inSWISS-PROT: the basis for large-scale characterisation of protein sequences. BriefBioinform, 2, 9-18.. |
| 92. | Eisenhaber, F. & Bork, P. (1999). Evaluation ofhuman-readable annotation in biomolecular sequence databases with biologicalrule libraries. Bioinformatics, 15,528-35.. |
| 93. | Nair, R. & Rost, B. (2002). Inferringsub-cellular localization through automated lexical analysis. Bioinformatics, 18 Suppl 1, S78-S86. |
| 94. | Lu, Z., Szafron, D., Greiner, R., Lu, P., Wishart, D.S. et al. (2004). Predicting subcellular localization of proteins usingmachine-learned classifiers. Bioinformatics,20, 547-56. |
| 95. | Ouzounis, C., Casari, G., Sander, C., Tamames, J.& Valencia, A. (1996). Computational comparisons of model genomes. Trendsin Biotechnology, 14,280-285. |
| 96. | Andrade, M. A., Ouzounis, C., Sander, C., Tamames, J.& Valencia, A. (1999). Functional classes in the three domains of life. Journalof Molecular Evolution, 49,551-557. |
| 97. | Lewis, D. D. & Ringuette, M. (1994). Comparisonof two learning algorithms for text categorization. Proceedings of the ThirdAnnual Symposium on Document Analysis and Information Retrieval (SDAIR'94),. |
| 98. | Apte, C., Damerau, F. & Weiss, S. (1994). Towardslanguage independent automated learning of text categorization models. Proceedingsof the 17th Annual ACM/SIGIR conference,. |
| 99. | Dasarathy, B. V. (1991). Nearest Neighbor (NN) Norms:NN Pattern Classification Techniques. IEEE Computer Society Press, LasAlamitos, California. |
| 100. | Kretschmann, E., Fleischmann, W. & Apweiler, R.(2001). Automatic rule generation for protein annotation with the C4.5 datamining algorithm applied on SWISS-PROT. Bioinformatics, 17, 920-6. |
| 101. | Bazzan, A. L., Engel, P. M., Schroeder, L. F. &Da Silva, S. C. (2002). Automated annotation of keywords for proteins relatedto mycoplasmataceae using machine learning techniques. Bioinformatics, 18 Suppl 2, S35-43. |
| 102. | Stapley, B. J., Kelley, L. A. & Sternberg, M. J.(2002). Predicting the sub-cellular location of proteins from text usingsupport vector machines. Pac Symp Biocomput,374-85. |
| 103. | Salton, G. (1989). Automatic Text Processing.Addison-Wesley, Reading, MA.. |
| 104. | Nishikawa, K. & Ooi, T. (1982). Correlation ofthe amino acid composition of a protein to its structural and biological characteristics.Journal of Biochemistry, 91,1821-1824. |
| 105. | Nakashima, H. & Nishikawa, K. (1994).Discrimination of intracellular and extracellular proteins using amino acidcomposition and residue-pair frequencies. J Mol Biol, 238, 54-61. |
| 106. | Andrade, M. A., O'Donoghue, S. I. & Rost, B.(1998). Adaptation of protein surfaces to subcellular location. J Mol Biol, 276, 517-25. |
| 107. | Nakai, K. & Kanehisa, M. (1991). Expert systemfor predicting protein localization sites in gram-negative bacteria. Proteins:Structure, Function, and Genetics, 11,95-110. |
| 108. | Reinhardt, A. & Hubbard, T. (1998). Using neuralnetworks for prediction of the subcellular location of proteins. NucleicAcids Res, 26, 2230-6. |
| 109. | Hua, S. & Sun, Z. (2001). Support vector machineapproach for protein subcellular localization prediction. Bioinformatics, 17, 721-8. |
| 110. | Vapnik, V. N. (1995). The Nature of StatisticalLearning Theory. Springer-Verlag, . |
| 111. | Park, K. J. & Kanehisa, M. (2003). Prediction ofprotein subcellular locations by support vector machines using compositions ofamino acids and amino acid pairs. Bioinformatics, 19, 1656-63. |
| 112. | Cai, Y. D., Liu, X. J., Xu, X. B. & Chou, K. C.(2002). Support vector machines for prediction of protein subcellular locationby incorporating quasi-sequence-order effect. J Cell Biochem, 84, 343-8. |
| 113. | Chou, K. C. & Cai, Y. D. (2003). Prediction andclassification of protein subcellular location-sequence-order effect and pseudoamino acid composition. J Cell Biochem,90, 1250-60. |
| 114. | Pan, Y. X., Zhang, Z. Z., Guo, Z. M., Feng, G. Y.,Huang, Z. D. et al. (2003). Application of pseudo amino acid composition forpredicting protein subcellular location: stochastic signal processing approach.J Protein Chem, 22,395-402. |
| 115. | Marcotte, E. M., Xenarios, I., van Der Bliek, A. M.& Eisenberg, D. (2000). Localizing proteins in the cell from theirphylogenetic profiles. Proc Natl Acad Sci U S A, 97, 12115-20. |
| 116. | Nair, R. & Rost, B. (2003). Better prediction ofsub-cellular localization by combining evolutionary and structural information.Proteins, 53, 917-30. |
| 117. | Rost, B. & Sander, C. (1993). Prediction ofprotein secondary structure at better than 70% accuracy. Journal ofMolecular Biology, 232,584-599. |
| 118. | Rost, B., Yachdav, G. & Liu, J. (2004). ThePredictProtein server. Nucleic Acids Res,32, W321-6. |
| 119. | Nair, R. & Rost, B. (2004). LOCnet andLOCtarget: sub-cellular localization for structural genomics targets. NucleicAcids Res, 32, W517-21. |
| 120. | Nair, R. & Rost, B. (2003). LOC3D: annotatesub-cellular localization for protein structures. Nucleic Acids Res, 31, 3337-40. |
| 121. | Westbrook, J., Feng, Z., Chen, L., Yang, H. &Berman, H. M. (2003). The Protein Data Bank and structural genomics. NucleicAcids Res, 31, 489-91. |
| 122. | Nakai, K. & Kanehisa, M. (1992). A knowledgebase for predicting protein localization sites in eukaryotic cells. Genomics, 14, 897-911. |
| 123. | Horton, P. & Nakai, K. (1997). Better predictionof protein cellular localization sites with the k nearest neighbors classifier.Ismb, 5, 147-52. |
| 124. | Drawid, A. & Gerstein, M. (2000). A Bayesiansystem integrating expression data with sequence patterns for localizingproteins: comprehensive application to the yeast genome. J Mol Biol, 301, 1059-75. |
| 125. | Bannai, H., Tamada, Y., Maruyama, O., Nakai, K.& Miyano, S. (2002). Extensive feature detection of N-terminal proteinsorting signals. Bioinformatics, 18,298-305. |
| 126. | Alberts, B., Bray, D., Roberts, K. & Watson, J.(1994). Molecular Biology of the Cell. Garland Publishing, New York and London. |
| 127. | Nielsen, H., Engelbrecht, J., Brunak, S. & vonHeijne, G. (1997). A neural network method for identification of prokaryoticand eukaryotic signal peptides and prediction of their cleavage sites. Int JNeural Syst, 8, 581-99. |
| 128. | Claros, M. G. (1995). MitoProt, a Macintoshapplication for studying mitochondrial proteins. Comput Appl Biosci, 11, 441-7. |
| 129. | Small, I., Peeters, N., Legeai, F. & Lurin, C.(2004). Predotar: A tool for rapidly screening proteomes for N-terminaltargeting sequences. Proteomics, 4,1581-90. |
| Contact: admin@rostlab.org | Version: Apr 9, 2007 |