Loctree3 for PP

From Rost Lab Open

How we predict the sub-cellular localization?

LocTree3 predicts the sub-cellular localization by combining de-novo and homology-based predictions. LocTree3 classifies proteins in the so far largest number of cellular compartments:

  • 18 classes for Eukaryota: chloroplast, chloroplast membrane, cytosol, ER, Golgi, ER membrane, Golgi membrane, extra-cellular, mitochondria, mitochondria membrane, nucleus, nucleus membrane, peroxisome, peroxisome membrane, plasma membrane, plastid, vacuole and vacuole membrane
  • 6 classes for Bacteria: cytosol, extra-cellular, fimbrium, outer membrane, periplasmic space and plasma membrane
  • 3 classes for Archaea: cytosol, extra-cellular and plasma membrane


What is predicted?

LocTree3 predicts the sub-cellular localization for all proteins in all domains of life. Water-soluble globular and trans-membrane proteins are predicted in one 18 classes in Eukaryota (chloroplast, chloroplast membrane, cytosol, ER, Golgi, ER membrane, Golgi membrane, extra-cellular, mitochondria, mitochondria membrane, nucleus, nucleus membrane, peroxisome, peroxisome membrane, plasma membrane, plastid, vacuole and vacuole membrane), 6 classes in Bacteria (cytosol, extra-cellular, fimbrium, outer membrane, periplasmic space and plasma membrane) and 3 classes in Archaea (cytosol, extra-cellular and plasma membrane). Each prediction is accompanied by a confidence score (ranging from 0=unreliable to 100=reliable) and a Gene Ontology term of the predicted localization class.


Prediction visualization

Dependent on the identified source organism (listed in the Dashboard Summary) we provide the sub-cellular localization prediction in one of the three domains of life (Archaea, Bacteria and Eukaryota) and highlight the results in green. For example:

Nucleus in Eukaryota Plasma membrane in Bacteria Extra-cellular in Archaea
Euk-Cell-3D-nucleus.png Bacteria-outer-membrane.png Archaea-secreted.png

Prediction algorithm

LocTree3 is an extension of LocTree2 that is de-novo predictor, whose architecture has been inspired by the sorting machinery in the cell (Figure 1). LocTree2 is a hierarchy of Support Vector Machines that make their predictions based on short stretches of consecutive amino acids (k-mers) extracted from protein sequence profiles. The new version LocTree3 adds a module that infers localization from experimentally annotated sequence homologs using PSI-BLAST. In the absence of significant PSI-BLAST hits, LocTree2 is used.

Implementation

For each query sequence, LocTree3 first retrieves a PSI-BLAST profile through the PredictProtein pipeline. This profile is used for a PSI-BLAST search (E-value<=10-3) of a close homolog in the database of experimentally annotated proteins. If a homolog is identified its annotation is transferred to the query protein; if no homolog is identified, a LocTree2 prediction is used. LocTree2 implements Support Vector Machines (SVM) using the Sequential Minimal Optimization algorithm in WEKA. Each SVM was trained on a different set of proteins. For example, the SVM at the root node in the archaeal tree (Figure 1a) was trained on the full set of proteins (comprising cytoplasmic and non-cytoplasmic classes), while the SVM at a lower level in the tree was trained on plasma-membrane and extra-cellular proteins only.

The SVM classification is based on the Profile Kernel, a kernel that identifies sets of k-mers (stretches of k adjacent residues) that are most informative for the prediction of localization and then matches these in a query protein.

Data Sets

Data sets of proteins used for the development and evaluation of LocTree3 were extracted from SWISS-PROT release 2011_04. Proteins with non-experimental or ambiguous annotations were excluded. Homology reduction was performed at BLAST E-value<=10-3 and HSSP-value>0.

Figure 2‎

Fig 1: Hierarchical architecture of LocTree2. Prediction of protein localization follows a different tree for each of the three domains of life: (a) Archaea, (b) Bacteria and (c) Eukaryota. Abbreviations: CHL, chloroplast; CHLM, chloroplast membrane; CYT, cytosol; ER, endoplasmic reticulum; ERM, endoplasmic reticulum membrane; EXT, extra-cellular; FIM, fimbrium; GOL, Golgi apparatus; GOLM, Golgi apparatus membrane; MIT, mitochondria; MITM, mitochondria membrane; NUC, nucleus; NUCM, nucleus membrane; OM, outer membrane; PERI, periplasmic space; PER, peroxisome; PERM, peroxisome membrane; PM, plasma membrane; PLAS, plastid; VAC, vacuole; VACM, vacuole membrane.

Preditcion confidence score

In addition to the predicted localization class we provide a Reliability Index (RI) measuring the strength of a prediction. The RI is a value between 0 and 100, with 100 denoting the most confident predictions.

We rigorously evaluated the reliability of LocTree3 predictions on a non-redundant test set of proteins. We observed that 50% of proteins with the highest reliability were predicted for bacteria at RI>80 at an overall accuracy Q6=95% (Figure 2; gray arrow) and for eukaryotes at RI>65 at Q18=95% (Figure 2; black arrow).

  • Q6 is six-state accuracy for predicting localization to six classes
  • Q18 is eighteen-state accuracy

Figure 2‎

Fig 2: More reliable predictions better. The curves show the percentage Accuracy vs. Coverage for LocTree3 predictions above a given RI threshold. The curves were obtained on cross-validated test sets of bacterial (gray line) and eukaryotic (black line) proteins. Half of all eukaryotic proteins are predicted at RI>65; for these Q18 is above 95% (black arrow). 50% of all bacterial proteins are predicted at RI>80 and Q18 above 95% (black arrow).

Accuracy of localization prediction

We evaluated the performance of LocTree3 in a stratified five-fold cross-validation, never using any information from a test split during the training phase.

LocTree3 - our simple protocol that combines PSI-BLAST if applicable and LocTree2 if not - outperformed both its sources, reaching overall accuracy Q18=80±3% in classifying eukaryotic proteins in 18 classes (10 non-membrane and 8 membrane classes) and bacterial proteins in 6 classes at Q6=89±4%. LocTree3 predicted eukaryotic extra-cellular proteins best (Acc: 88% and Cov: 96%), followed by nuclear proteins (Acc: 81% and Cov: 86%). For bacteria, the prediction of plasma membrane proteins was most accurate (Acc: 96% and Cov: 95%), followed by cytosolic proteins (Acc: 91% and Cov: 90%).

For proteins with little evolutionary information available (<11 homologs in the PSI-BLAST alignment), we observed only a slight drop in the performance:

Number homologs in the PSI-BLAST alignment Number proteins in the test data set LocTree3 performance (Q18)
0-10 hits 451 78+-5%
11-100 hits 467 82+-4%
101-1000 hits 517 80+-4%
>1000 hits 247 80+-6%
LocTree3's average performance 1682 80+-3%