Loctree3 for PP
How we predict the sub-cellular localization?
LocTree3 predicts the sub-cellular localization by combining de-novo and homology-based predictions. LocTree3 classifies proteins in the so far largest number of cellular compartments:
- 18 classes for Eukaryota: chloroplast, chloroplast membrane, cytosol, ER, Golgi, ER membrane, Golgi membrane, extra-cellular, mitochondria, mitochondria membrane, nucleus, nucleus membrane, peroxisome, peroxisome membrane, plasma membrane, plastid, vacuole and vacuole membrane
- 6 classes for Bacteria: cytosol, extra-cellular, fimbrium, outer membrane, periplasmic space and plasma membrane
- 3 classes for Archaea: cytosol, extra-cellular and plasma membrane
What is predicted?
LocTree3 predicts the sub-cellular localization for all proteins in all domains of life. Water-soluble globular and trans-membrane proteins are predicted in one 18 classes in Eukaryota (chloroplast, chloroplast membrane, cytosol, ER, Golgi, ER membrane, Golgi membrane, extra-cellular, mitochondria, mitochondria membrane, nucleus, nucleus membrane, peroxisome, peroxisome membrane, plasma membrane, plastid, vacuole and vacuole membrane), 6 classes in Bacteria (cytosol, extra-cellular, fimbrium, outer membrane, periplasmic space and plasma membrane) and 3 classes in Archaea (cytosol, extra-cellular and plasma membrane). Each prediction is accompanied by a confidence score (ranging from 0=unreliable to 100=reliable) and a Gene Ontology term of the predicted localization class.
Dependent on the identified source organism (listed in the Dashboard Summary) we provide the sub-cellular localization prediction in one of the three domains of life (Archaea, Bacteria and Eukaryota) and highlight the results in green. For example:
|Nucleus in Eukaryota||Plasma membrane in Bacteria||Extra-cellular in Archaea|
LocTree3 is an extension of LocTree2 that is de-novo predictor, whose architecture has been inspired by the sorting machinery in the cell (Figure 1). LocTree2 is a hierarchy of Support Vector Machines that make their predictions based on short stretches of consecutive amino acids (k-mers) extracted from protein sequence profiles. The new version LocTree3 adds a module that infers localization from experimentally annotated sequence homologs using PSI-BLAST. In the absence of significant PSI-BLAST hits, LocTree2 is used.
For each query sequence, LocTree3 first retrieves a PSI-BLAST profile through the PredictProtein pipeline. This profile is used for a PSI-BLAST search (E-value<=10-3) of a close homolog in the database of experimentally annotated proteins. If a homolog is identified its annotation is transferred to the query protein; if no homolog is identified, a LocTree2 prediction is used. LocTree2 implements Support Vector Machines (SVM) using the Sequential Minimal Optimization algorithm in WEKA. Each SVM was trained on a different set of proteins. For example, the SVM at the root node in the archaeal tree (Figure 1a) was trained on the full set of proteins (comprising cytoplasmic and non-cytoplasmic classes), while the SVM at a lower level in the tree was trained on plasma-membrane and extra-cellular proteins only.
The SVM classification is based on the Profile Kernel, a kernel that identifies sets of k-mers (stretches of k adjacent residues) that are most informative for the prediction of localization and then matches these in a query protein.
Data sets of proteins used for the development and evaluation of LocTree3 were extracted from SWISS-PROT release 2011_04. Proteins with non-experimental or ambiguous annotations were excluded. Homology reduction was performed at BLAST E-value<=10-3 and HSSP-value>0.
Fig 1: Hierarchical architecture of LocTree2. Prediction of protein localization follows a different tree for each of the three domains of life: (a) Archaea, (b) Bacteria and (c) Eukaryota. Abbreviations: CHL, chloroplast; CHLM, chloroplast membrane; CYT, cytosol; ER, endoplasmic reticulum; ERM, endoplasmic reticulum membrane; EXT, extra-cellular; FIM, fimbrium; GOL, Golgi apparatus; GOLM, Golgi apparatus membrane; MIT, mitochondria; MITM, mitochondria membrane; NUC, nucleus; NUCM, nucleus membrane; OM, outer membrane; PERI, periplasmic space; PER, peroxisome; PERM, peroxisome membrane; PM, plasma membrane; PLAS, plastid; VAC, vacuole; VACM, vacuole membrane.
Preditcion confidence score
In addition to the predicted localization class we provide a Reliability Index (RI) measuring the strength of a prediction. The RI is a value between 0 and 100, with 100 denoting the most confident predictions.
We rigorously evaluated the reliability of LocTree3 predictions on a non-redundant test set of proteins. We observed that 50% of proteins with the highest reliability were predicted for bacteria at RI>80 at an overall accuracy Q6=95% (Figure 2; gray arrow) and for eukaryotes at RI>65 at Q18=95% (Figure 2; black arrow).
- Q6 is six-state accuracy for predicting localization to six classes
- Q18 is eighteen-state accuracy
Fig 2: More reliable predictions better. The curves show the percentage Accuracy vs. Coverage for LocTree3 predictions above a given RI threshold. The curves were obtained on cross-validated test sets of bacterial (gray line) and eukaryotic (black line) proteins. Half of all eukaryotic proteins are predicted at RI>65; for these Q18 is above 95% (black arrow). 50% of all bacterial proteins are predicted at RI>80 and Q18 above 95% (black arrow).
Accuracy of localization prediction
We evaluated the performance of LocTree3 in a stratified five-fold cross-validation, never using any information from a test split during the training phase.
LocTree3 - our simple protocol that combines PSI-BLAST if applicable and LocTree2 if not - outperformed both its sources, reaching overall accuracy Q18=80±3% in classifying eukaryotic proteins in 18 classes (10 non-membrane and 8 membrane classes) and bacterial proteins in 6 classes at Q6=89±4%. LocTree3 predicted eukaryotic extra-cellular proteins best (Acc: 88% and Cov: 96%), followed by nuclear proteins (Acc: 81% and Cov: 86%). For bacteria, the prediction of plasma membrane proteins was most accurate (Acc: 96% and Cov: 95%), followed by cytosolic proteins (Acc: 91% and Cov: 90%).
For proteins with little evolutionary information available (<11 homologs in the PSI-BLAST alignment), we observed only a slight drop in the performance:
|Number homologs in the PSI-BLAST alignment||Number proteins in the test data set||LocTree3 performance (Q18)|
|LocTree3's average performance||1682||80+-3%|