LocTree2 - Protein sub-cellular localization prediction for all domains of life

From Rost Lab Open
(Redirected from Loctree2)
Jump to: navigation, search

Important Note: This version is outdated. You should probably use new LocTree3.


Contents

Web server

Latest version:

https://rostlab.org/services/loctree3/

Introduction

Subcellular localization is one and easily definable aspect of protein function. Computational prediction of localization continues to provide an invaluable help especially in whole genome analyses and annotations. Several methods have been developed to predict localization, yet many challenges remain to be tackled.

We -Tatyana Goldberg, Tobias Hamp and Burkhard Rost- at RostLab developed a novel method, LocTree2 that predicts localization for all proteins in all domains of life. Similar to our previous method, LocTree, we incorporate a system of hierarchically organized Support Vector Machines to mimic the protein trafficking mechanism in cells. Please note that other than the hierarchy and the name LocTree and LocTree2 have nothing in common.

Amongst the novel aspects of LocTree2 are:

  • the stunning number of 18 classes predicted for Eukaryota
  • 6 classes for Bacteria and 3 classes for Archaea
  • incorporation of no other information than evolutionary profiles
  • very accurate in distinction: membrane/water-soluble globular proteins
  • high robustness against sequencing errors
  • top performance even for protein fragments

Method design

LocTree2 combines three different systems of classification trees to predict 3 localization classes in Archaea (cytosol, plasma membrane and extra-cellular), 6 classes in Bacteria (cytosol, plasma membrane, periplasmic space, outer membrane, fimbrium and extra-cellular) and 18 classes in Eukaryota (ER, Golgi, extra-cellular, vacuole, peroxisome, mitochondria, chloroplast, plastid, cytosol, nucleus, ER membrane, plasma membrane, Golgi membrane, nucleus membrane, vacuole membrane, peroxisome membrane, chloroplast membrane and mitochondria membrane) (Figure 1).

(thumbnail)
Fig 1: Hierarchical architecture of LocTree2. Prediction of protein localization follows a different tree for each of the three domains of life: (a) Archaea, (b) Bacteria and (c) Eukaryota. Abbreviations: CHL, chloroplast; CHLM, chloroplast membrane; CYT, cytosol; ER, endoplasmic reticulum; ERM, endoplasmic reticulum membrane; EXT, extra-cellular; FIM, fimbrium; GOL, Golgi apparatus; GOLM, Golgi apparatus membrane; MIT, mitochondria; MITM, mitochondria membrane; NUC, nucleus; NUCM, nucleus membrane; OM, outer membrane; PERI, periplasmic space; PER, peroxisome; PERM, peroxisome membrane; PM, plasma membrane; PLAS, plastid; VAC, vacuole; VACM, vacuole membrane.

Input

LocTree2 requires a .fasta file and a .profile file for each sequence to be predicted as input. A profile file can be obtained by e.g. using PSI-BLAST (Position-Specific Iterated BLAST). We built our profiles using an 80% non-redundant database combining SWISS-PROT, TrEMBL and PDB.

Example input files can be viewed here.

Profile Kernel

In order to assign a localization class to a query protein Support Vector Machines (SVMs; implemented as internal nodes in LocTree2 decision trees) take a protein in its input space (as a sequence-profile tuple) and map it to a higher dimensional feature space where it is represented as a feature vector. The vector of a query protein is then 'compared' to the vectors of proteins used for the training. The predicted localization class of a query protein is then the class of the most 'similar' vector of a protein used for training.

In our case, the mapping and the comparison are carried out by a string kernel function, called the Profile Kernel. A feature vector built by the The Profile Kernel is indexed by all possible subsequences of length k from the alphabet of 20 amino acids. Each element in the vector represents one particular k-mer with a score below a user-defined threshold sigma. This score is calculated as the ungapped cumulative substitution score in the corresponding sequence profile. The similarity between training and a test protein is calculated as the dot product between their k-mer vector representations and is given as a single positive integer number.

In short, the Profile Kernel identifies sets of k-mers (stretches of k adjacent residues) that are most informative for the prediction of localization and then matches these in a query protein.

Additional information about the kernel function can be found in the LocTree2 manuscript and the corresponding Profile Kernel publication.

Prediction algorithm

LocTree2 combines three different systems of decision trees, one for each domain of life (Figure 1). The trees were built by incorporating a hierarchical ontology of localization classes modeled onto the biological sorting mechanism in that domain. In eukaryotes pathways for membrane and non-membrane proteins are treated separately. The branches represent paths of the protein sorting, the leaves (rectangles) the final prediction of one localization class, and the internal nodes (circles) are the decision points along the path.

As in LocTree, biological similarities were incorporated from the description of cellular components provided by the Gene Ontology Consortium (GO). In cases of ambiguous relations (e.g. PER, MIT, CHL) we explored different trees in which these classes were placed at different levels in the hierarchy and selected the hierarchy with the highest prediction performance. LocTree2 was extremely successful at learning evolutionary similarities among subcellular localization classes and was significantly more accurate than other traditional networks at predicting localization.

Hierarchy of SVMs

The decision points along the path in the hierarchical trees were implemented as binary Support Vector Machines (SVMs). As the SVM model, we chose the WEKA version of Sequential Minimal Optimization. Each SVM was trained on a different set of proteins. For example, the SVM at the root node in the archaeal tree (Fig. 1a) was trained on the full set of proteins (comprising cytoplasmic and non-cytoplasmic classes), while the SVM at a lower level in the tree was trained on plasma-membrane and extra-cellular proteins only.

Figure 2
Fig 2: More reliable predictions better. The curves show the percentage Accuracy vs. Coverage for LocTree2 predictions above a given RI threshold (from 0=unreliable to 100=most reliable). The curves were obtained on cross-validated test sets of bacterial (gray line) and eukaryotic (black line) proteins. Half of all eukaryotic proteins are predicted at RI>80; for these Q18 is above 92% (black arrow). As the number of localization classes is lower for Bacteria, the corresponding number in accuracy is higher (Q6 is above 95% at 50% coverage, gray arrow).

Reliability index

In addition to the predicted localization class LocTree2 provides a Reliability Index (RI) measuring the strength of a prediction. For a predicted class (leaf node) the RI is compiled as the product over the reliabilities of all parental nodes. The RI is a value between 0 and 100, with 100 denoting the most confident predictions.

We rigorously evaluated the reliability of LocTree2 predictions on a non-redundant test set of proteins (Fig. 2). We observed that 50% of proteins with the highest reliability reached levels of overall accuracy Q6=98% for Bacteria (gray arrow) and Q18=92% for Eukaryota (black arrow). To pick another point, almost 40% of all eukaryotic proteins were predicted at RI greater than 85; for these, Q18 was above 95%. Thus, two in the top 40 predictions in 100 were wrong in one of 18 states (e.g. nuclear instead of nuclear membrane).

  • Q6 is six-state accuracy for predicting localization to six classes
  • Q18 is eighteen-state accuracy

Accuracy of localization prediction

We evaluated the performance of LocTree2 in a stratified five-fold cross-validation using non-redundant test sets of archaeal, bacterial and eukaryotic proteins. In each fold we learned a new classification model with four training splits and tested on the fifth. In doing so, we never used any information of the test split during a training phase.

LocTree2 achieved a sustained level of 65% accuracy for predicting eighteen localization classes for eukaryotes. The first decision that is made for eukaryotic proteins is whether it is a membrane-spanning protein or not. This decision was correct for 94% of all proteins. This performance is comparable to what best methods, designed explicitly for the prediction of integral membrane helices, achieve. The most accurately predicted class for eukaryotes was extra-cellular, followed by nucleus. Overall, LocTree2 tended to predict membrane-bound classes better than the corresponding non-membrane bound classes (e.g. MITM vs. MIT).

LocTree2 performed extremely well also for prokaryotes, predicting six classes at 84% accuracy for Bacteria and three classes at 100% accuracy for Archaea. We assume, that 100% is an over-estimate of the performance for Archaea due to the limited data we had. For Bacteria, the most accurate predictions were made for plasma membrane followed by cytosolic proteins.

We rigorously benchmarked LocTree2 in comparison to the best alternative methods for localization prediction. LocTree2 outperformed all other methods in nearly all benchmarks. We could show on a few examples that LocTree2 may discover annotation mistakes of high-throughput experiments. Finally, we suggest using our tool for large-scale genome projects as it proved to sustain high levels of performance and to surpass its competitors even for protein fragments.

Additional information can be found in the LocTree2 manuscript.

Runtime analysis

The runtime was measured on a Dell M605 machine with a Six-Core AMD Opteron processor (2.4 GHz, 6MB and 75W ACP) running on Linux.

1 Sequence 100 Sequences 500 Sequences 1000 Sequences 3000 Sequences 5000 Sequences 10000 Sequences
Archaea 0.8s 3.0s 10.4s 18.8s 51m2s 1m36s 3m43s
Bacteria 3.6s 1m.09s 5m25s 9m12s 27m01s 1h4m 2h10m
Eukaryota 1m37s 8m43s 44m 1h13m 4h17m 7h47m 15h6m

Availability/ Download

  • The program can be accessed online via the PredictProtein service or LocTree2 server
  • Standalone version can be downloaded as a Debian package here

Data

Data sets used for development and evaluation of LocTree2 can be accessed here.

Reference

LocTree2 predicts localization for all domains of life

Tatyana Goldberg; Tobias Hamp; Burkhard Rost

Bioinformatics 2012 28: i458-i465 (Full Text, PDF)

Supporting Online Material

Contact

For questions, please contact localization@rostlab.org

Personal tools