Difference between revisions of "PEffect - prediction of bacterial type III effector proteins"

From Rost Lab Open
Jump to: navigation, search
(Introduction)
(Reference)
 
(28 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Web server ==
 
== Web server ==
   
https://rostlab.org/services/pEffect/
+
http://bromberglab.org/services/pEffect/
   
 
== Introduction ==
 
== Introduction ==
 
The type III secretion system is one of the causes of a wide range of bacterial infections in human, animals and plants. This system comprises a hollow needle-like structure localized on the surface of bacterial cells that injects specific bacterial proteins, the so-called effectors, directly into the cytoplasm of a host cell. During infection, effectors convert host resources to their advantage and promote pathogenicity.
 
The type III secretion system is one of the causes of a wide range of bacterial infections in human, animals and plants. This system comprises a hollow needle-like structure localized on the surface of bacterial cells that injects specific bacterial proteins, the so-called effectors, directly into the cytoplasm of a host cell. During infection, effectors convert host resources to their advantage and promote pathogenicity.
   
We - Tatyana Goldberg, Burkhard Rost and Yana Bromberg - at [http://bromberglab.org BrombergLab] and [http://rostlab.org/cms/ RostLab] developed a novel method, '''pEffect''' that predicts bacterial type III effector proteins. In our method, we combine sequence-based homology searches and advanced machine learning to accurately predict effector proteins. We use information encoded in the entire protein sequence for a prediction.
+
We - Tatyana Goldberg, Burkhard Rost and Yana Bromberg - at [http://bromberglab.org BrombergLab] and [http://rostlab.org/cms/ RostLab] developed a novel method, '''pEffect''' that predicts bacterial type III effector proteins. In our method, we combine sequence-based homology searches and advanced machine learning to accurately predict effector proteins. We use information encoded in the entire protein sequence for our predictions.
  +
  +
== Method design ==
  +
pEffect is a method that combines sequence similarity-based inferences (PSI-BLAST) with ''de-novo'' predictions using machine learning techniques (Support Vector Machines; SVM). For a query protein it first runs PSI-BLAST to identify a homolog in the set of known and annotated effector proteins. If such a homolog is available, then its annotation (''i.e.'' type III effector) is being transferred to a query protein. If a homolog is not available, pEffect triggers an SVM that predicts effector proteins through searches of ''k''-consecutive residues that are known from annotated proteins.
   
 
=== Input ===
 
=== Input ===
   
pEffect requires a .fasta file and a .profile file for each sequence to be predicted as input. A profile file can be obtained by e.g. using PSI-BLAST (Position-Specific Iterated BLAST). We built our profiles using an 80% non-redundant database combining [http://www.uniprot.org/ SWISS-PROT], [http://www.uniprot.org/ TrEMBL] and [http://www.rcsb.org/pdb/home/home.do PDB].
+
The input to the server is:
   
Example input files can be viewed [https://rostlab.org/~loctree2/lc2_data/example_files here].
+
1. one or more FASTA-formatted protein sequences. The sequences must be in one-letter amino acid code (not case-sensitive). The allowed amino acids are: ACDEFGHIKLMNPQRSTVWY and X (unknown). [https://rostlab.org/services/peffect/help Example.]
   
=== Profile Kernel ===
+
a. e-mail address: a notification for completed prediction result and the access link are sent to the provided email address (Optional)
   
In order to assign a localization class to a query protein Support Vector Machines (SVMs; implemented as internal nodes in LocTree2 decision trees) take a protein in its input space (as a sequence-profile tuple) and map it to a higher dimensional feature space where it is represented as a feature vector. The vector of a query protein is then 'compared' to the vectors of proteins used for the training. The predicted localization class of a query protein is then the class of the most 'similar' vector of a protein used for training.
+
[[File:AccCov.pEffect.v4.png|500px|right|alt=Figure 1|frame| Fig 1: Reliable predictions are more accurate. The figure shows the cumulative percent of accuracy/coverage of pEffect predictions at or above a given reliability index (RI). The graphs were obtained using the homology-reduced sets of 115 type III effector and 3,460 non-effector proteins in five-fold cross-validation. At the default reliability score of RI=50 (black vertical line), 95% of type III effectors are identified at 87% accuracy (black arrow). At a higher reliability score of RI=80 (gray vertical line), prediction accuracy increases to 97% at the cost of lower coverage of 78% (gray arrow).]]
   
In our case, the mapping and the comparison are carried out by a string kernel function, called the Profile Kernel. A feature vector built by the The Profile Kernel is indexed by all possible subsequences of length ''k'' from the alphabet of 20 amino acids. Each element in the vector represents one particular ''k''-mer with a score below a user-defined threshold ''sigma''. This score is calculated as the ungapped cumulative substitution score in the corresponding sequence profile. The similarity between training and a test protein is calculated as the dot product between their ''k''-mer vector representations and is given as a single positive integer number.
+
=== Output ===
   
In short, the Profile Kernel identifies sets of ''k''-mers (stretches of ''k'' adjacent residues) that are most informative for the prediction of localization and then matches these in a query protein.
+
For every query protein, result contains four basic values:
   
Additional information about the kernel function can be found in the [http://bioinformatics.oxfordjournals.org/content/28/18/i458.full?keytype=ref&ijkey=IbXBaW9ZaHmz6dU LocTree2 manuscript] and the corresponding [http://www.ncbi.nlm.nih.gov/pubmed/16448009 Profile Kernel] publication.
+
1. the protein identifier as provided by the user
   
=== Prediction algorithm ===
+
2. the reliability score of a prediction on a 0-100 scale with 100 being the most confident prediction
   
LocTree2 combines three different systems of decision trees, one for each domain of life (Figure 1). The trees were built by incorporating a hierarchical ontology of localization classes modeled onto the biological sorting mechanism in that domain. In eukaryotes pathways for membrane and non-membrane proteins are treated separately. The branches represent paths of the protein sorting, the leaves (rectangles) the final prediction of one localization class, and the internal nodes (circles) are the decision points along the path.
+
3. prediction of a protein to be a type III effector
   
As in [https://www.rostlab.org/owiki/index.php/Loctree LocTree], biological similarities were incorporated from the description of cellular components provided by the Gene Ontology Consortium [http://www.geneontology.org/ (GO)]. In cases of ambiguous relations (e.g. PER, MIT, CHL) we explored different trees in which these classes were placed at different levels in the hierarchy and selected the hierarchy with the highest prediction performance. LocTree2 was extremely successful at learning evolutionary similarities among subcellular localization classes and was significantly more accurate than other traditional networks at predicting localization.
+
4. annotation type of the prediction (PSI-BLAST or SVM)
   
=== Hierarchy of SVMs ===
+
For PSI-BLAST predictions, the web site provides ‘per click’ on the annotation type (''i.e.'' PSI-BLAST) the information about the closest homolog and its PSI-BLAST alignment to query. [https://rostlab.org/services/peffect/help Example.]
   
The decision points along the path in the hierarchical trees were implemented as binary Support Vector Machines (SVMs). As the SVM model, we chose the [http://www.cs.waikato.ac.nz/ml/weka/ WEKA] version of [http://www.bradblock.com/Sequential_Minimal_Optimization_A_Fast_Algorithm_for_Training_Support_Vector_Machine.pdf Sequential Minimal Optimization]. Each SVM was trained on a different set of proteins. For example, the SVM at the root node in the archaeal tree (Fig. 1a) was trained on the full set of proteins (comprising cytoplasmic and non-cytoplasmic classes), while the SVM at a lower level in the tree was trained on plasma-membrane and extra-cellular proteins only.
+
=== Prediction reliability ===
  +
Every prediction result is supported by a Reliability Index (RI) measuring the strength of a prediction. The RI is a value between 0 and 100, with 100 denoting the most confident predictions.
   
[[Image:LocTree2_ReliabilityIndices.jpg |frame|right|alt=Figure 2| Fig 2: More reliable predictions better. The curves show the percentage Accuracy vs. Coverage for LocTree2 predictions above a given RI threshold (from 0=unreliable to 100=most reliable). The curves were obtained on cross-validated test sets of bacterial (gray line) and eukaryotic (black line) proteins. Half of all eukaryotic proteins are predicted at RI>80; for these Q18 is above 92% (black arrow). As the number of localization classes is lower for Bacteria, the corresponding number in accuracy is higher (Q6 is above 95% at 50% coverage, gray arrow).]]
+
We rigorously evaluated the reliability of pEffect predictions on a non-redundant test set of proteins (Fig. 1). We observed that at the default threshold of RI>50, over 87% of all predictions of type III effectors are correct and 95% of all effectors in our set are identified (Fig. 1; black arrow). At a higher RI>80 effector predictions are correct 96% of the time, but only 78% of all effectors in the set are identified (Fig. 1; gray arrow).
   
=== Reliability index ===
+
=== Runtime analysis ===
In addition to the predicted localization class LocTree2 provides a Reliability Index (RI) measuring the strength of a prediction. For a predicted class (leaf node) the RI is compiled as the product over the reliabilities of all parental nodes. The RI is a value between 0 and 100, with 100 denoting the most confident predictions.
 
   
We rigorously evaluated the reliability of LocTree2 predictions on a non-redundant test set of proteins (Fig. 2). We observed that 50% of proteins with the highest reliability reached levels of overall accuracy Q6=98% for Bacteria (gray arrow) and Q18=92% for Eukaryota (black arrow). To pick another point, almost 40% of all eukaryotic proteins were predicted at RI greater than 85; for these, Q18 was above 95%. Thus, two in the top 40 predictions in 100 were wrong in one of 18 states (e.g. nuclear instead of nuclear membrane).
+
pEffect is built to run a homology-based PSI-BLAST; if no hit is identified then a ''de-novo'' SVM prediction is used.
   
* Q6 is six-state accuracy for predicting localization to six classes
+
While PSI-BLAST searches are fast, SVM's runtime depends on the number of query protein sequences. We measured pEffect's runtime on a Dell M605 machine with a Six-Core AMD Opteron processor (2.4 GHz, 6MB and 75W ACP) running on Linux.
* Q18 is eighteen-state accuracy
 
   
=== Accuracy of localization prediction===
+
{| border="1" cellpadding="5" class="pEffect's runtime" center
 
We evaluated the performance of LocTree2 in a stratified five-fold cross-validation using non-redundant test sets of archaeal, bacterial and eukaryotic proteins. In each fold we learned a new classification model with four training splits and tested on the fifth. In doing so, we never used any information of the test split during a training phase.
 
 
LocTree2 achieved a sustained level of 65% accuracy for predicting eighteen localization classes for eukaryotes. The first decision that is made for eukaryotic proteins is whether it is a membrane-spanning protein or not. This decision was correct for 94% of all proteins. This performance is comparable to what best methods, designed explicitly for the prediction of integral membrane helices, achieve. The most accurately predicted class for eukaryotes was extra-cellular, followed by nucleus. Overall, LocTree2 tended to predict membrane-bound classes better than the corresponding non-membrane bound classes (e.g. MITM vs. MIT).
 
 
LocTree2 performed extremely well also for prokaryotes, predicting six classes at 84% accuracy for Bacteria and three classes at 100% accuracy for Archaea. We assume, that 100% is an over-estimate of the performance for Archaea due to the limited data we had. For Bacteria, the most accurate predictions were made for plasma membrane followed by cytosolic proteins.
 
 
We rigorously benchmarked LocTree2 in comparison to the best alternative methods for localization prediction. LocTree2 outperformed all other methods in nearly all benchmarks. We could show on a few examples that LocTree2 may discover annotation mistakes of high-throughput experiments. Finally, we suggest using our tool for large-scale genome projects as it proved to sustain high levels of performance and to surpass its competitors even for protein fragments.
 
 
Additional information can be found in the [http://bioinformatics.oxfordjournals.org/content/28/18/i458.full?keytype=ref&ijkey=IbXBaW9ZaHmz6dU LocTree2 manuscript].
 
 
=== Runtime analysis ===
 
 
The runtime was measured on a Dell M605 machine with a Six-Core AMD Opteron processor (2.4 GHz, 6MB and 75W ACP) running on Linux.
 
 
{| border="1" cellpadding="5" class="LocTree2's runtime" center
 
 
|-
 
|-
 
! style="background:#efefef;" |
 
! style="background:#efefef;" |
Line 53: Line 53:
 
! style="text-align: center;" style="background:#efefef;"| 10000 Sequences
 
! style="text-align: center;" style="background:#efefef;"| 10000 Sequences
 
|-
 
|-
| Archaea || style="text-align: center;" | 0.8s || style="text-align: center;" | 3.0s || style="text-align: center;" | 10.4s || style="text-align: center;" | 18.8s || style="text-align: center;" | 51m2s || style="text-align: center;" | 1m36s || style="text-align: center;" | 3m43s
+
| Run 1 || style="text-align: center;" | 2.3s || style="text-align: center;" | 13.0s || style="text-align: center;" | 1m8.7s || style="text-align: center;" | 2m26.7s || style="text-align: center;" | 7m42.4s || style="text-align: center;" | 13m11.8s || style="text-align: center;" | 25m15.6s
 
|-
 
|-
| Bacteria || style="text-align: center;" | 3.6s || style="text-align: center;" | 1m.09s || style="text-align: center;" | 5m25s || style="text-align: center;" | 9m12s || style="text-align: center;" | 27m01s || style="text-align: center;" | 1h4m || style="text-align: center;" | 2h10m
+
| Run 2 || style="text-align: center;" | 1.5s || style="text-align: center;" | 13.1s || style="text-align: center;" | 1m13.2s || style="text-align: center;" | 2m22.3s || style="text-align: center;" | 7m37.0s || style="text-align: center;" | 13m1.5s || style="text-align: center;" | 25m43.1s
 
|-
 
|-
| Eukaryota || style="text-align: center;" | 1m37s || style="text-align: center;" | 8m43s || style="text-align: center;" | 44m || style="text-align: center;" | 1h13m || style="text-align: center;" | 4h17m || style="text-align: center;" | 7h47m || style="text-align: center;" | 15h6m
+
| Run 3 || style="text-align: center;" | 1.5s || style="text-align: center;" | 13.5s || style="text-align: center;" | 1m13.1s || style="text-align: center;" | 2m27.3s || style="text-align: center;" | 7m34.5s || style="text-align: center;" | 12m53.5s || style="text-align: center;" | 24m57.6s
 
|}
 
|}
  +
  +
Note: to increase server's response time we store all PSI-BLAST profile files in the [https://predictprotein.org/ PredictProtein] cache (current size: results for >11Mio sequences). These can be retrieved from the cache very fast. For novel protein sequences for which we don't have PSI-BLAST profiles in the cache the runtimes increases substantially.
   
 
== Availability/ Download ==
 
== Availability/ Download ==
* The program can be accessed online via the [http://www.predictprotein.org/ PredictProtein] service or [https://rostlab.org/services/loctree2/ LocTree2] server
+
* pEffect's web server is available at http://bromberglab.org/services/pEffect/
* Standalone version can be downloaded as a Debian package [https://rostlab.org/owiki/index.php/Packages here]
+
* The standalone version of pEffect can be downloaded as a [https://rostlab.org/services/peffect/downloads zip] or [ftp://rostlab.org/peffect/ tar.gz] file.
  +
* The Debian package can be downloaded from [https://rostlab.org/services/peffect/db/peffect_1.0.0_amd64.deb here]
   
 
== Data ==
 
== Data ==
Data sets used for development and evaluation of LocTree2 can be accessed [https://rostlab.org/~loctree2/lc2_data/data_sets here].
+
* Data sets used for development and evaluation of pEffect can be accessed [https://rostlab.org/services/peffect/downloads here].
  +
* Predictions of whole proteomes in Gram-negative and Gram-positive Bacteria, as well as in Archaea can be downloaded from [https://rostlab.org/services/peffect/proteomes here] or [https://rostlab.org/services/peffect/downloads here].
   
 
== Reference ==
 
== Reference ==
''LocTree2 predicts localization for all domains of life''
+
[https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/27713481/ Goldberg T, Rost B, Bromberg Y. Computational prediction shines light on type III secretion origins. Sci Rep. 2016 Oct 7;6:34516.]
 
Tatyana Goldberg; Tobias Hamp; Burkhard Rost
 
 
Bioinformatics 2012 28: i458-i465 ([http://bioinformatics.oxfordjournals.org/cgi/content/full/bts390?ijkey=IbXBaW9ZaHmz6dU&keytype=ref Full Text], [http://bioinformatics.oxfordjournals.org/cgi/reprint/bts390?ijkey=IbXBaW9ZaHmz6dU&keytype=ref PDF])
 
   
[https://rostlab.org/~loctree2/lc2_data/loctree2_SOM.pdf Supporting Online Material]
+
doi: 10.1038/srep34516; PubMed PMID: 27713481; PubMed Central PMCID: PMC5054392.
   
 
== Contact ==
 
== Contact ==

Latest revision as of 22:14, 6 May 2017

Web server

http://bromberglab.org/services/pEffect/

Introduction

The type III secretion system is one of the causes of a wide range of bacterial infections in human, animals and plants. This system comprises a hollow needle-like structure localized on the surface of bacterial cells that injects specific bacterial proteins, the so-called effectors, directly into the cytoplasm of a host cell. During infection, effectors convert host resources to their advantage and promote pathogenicity.

We - Tatyana Goldberg, Burkhard Rost and Yana Bromberg - at BrombergLab and RostLab developed a novel method, pEffect that predicts bacterial type III effector proteins. In our method, we combine sequence-based homology searches and advanced machine learning to accurately predict effector proteins. We use information encoded in the entire protein sequence for our predictions.

Method design

pEffect is a method that combines sequence similarity-based inferences (PSI-BLAST) with de-novo predictions using machine learning techniques (Support Vector Machines; SVM). For a query protein it first runs PSI-BLAST to identify a homolog in the set of known and annotated effector proteins. If such a homolog is available, then its annotation (i.e. type III effector) is being transferred to a query protein. If a homolog is not available, pEffect triggers an SVM that predicts effector proteins through searches of k-consecutive residues that are known from annotated proteins.

Input

The input to the server is:

1. one or more FASTA-formatted protein sequences. The sequences must be in one-letter amino acid code (not case-sensitive). The allowed amino acids are: ACDEFGHIKLMNPQRSTVWY and X (unknown). Example.

a. e-mail address: a notification for completed prediction result and the access link are sent to the provided email address (Optional)

Figure 1
Fig 1: Reliable predictions are more accurate. The figure shows the cumulative percent of accuracy/coverage of pEffect predictions at or above a given reliability index (RI). The graphs were obtained using the homology-reduced sets of 115 type III effector and 3,460 non-effector proteins in five-fold cross-validation. At the default reliability score of RI=50 (black vertical line), 95% of type III effectors are identified at 87% accuracy (black arrow). At a higher reliability score of RI=80 (gray vertical line), prediction accuracy increases to 97% at the cost of lower coverage of 78% (gray arrow).

Output

For every query protein, result contains four basic values:

1. the protein identifier as provided by the user

2. the reliability score of a prediction on a 0-100 scale with 100 being the most confident prediction

3. prediction of a protein to be a type III effector

4. annotation type of the prediction (PSI-BLAST or SVM)

For PSI-BLAST predictions, the web site provides ‘per click’ on the annotation type (i.e. PSI-BLAST) the information about the closest homolog and its PSI-BLAST alignment to query. Example.

Prediction reliability

Every prediction result is supported by a Reliability Index (RI) measuring the strength of a prediction. The RI is a value between 0 and 100, with 100 denoting the most confident predictions.

We rigorously evaluated the reliability of pEffect predictions on a non-redundant test set of proteins (Fig. 1). We observed that at the default threshold of RI>50, over 87% of all predictions of type III effectors are correct and 95% of all effectors in our set are identified (Fig. 1; black arrow). At a higher RI>80 effector predictions are correct 96% of the time, but only 78% of all effectors in the set are identified (Fig. 1; gray arrow).

Runtime analysis

pEffect is built to run a homology-based PSI-BLAST; if no hit is identified then a de-novo SVM prediction is used.

While PSI-BLAST searches are fast, SVM's runtime depends on the number of query protein sequences. We measured pEffect's runtime on a Dell M605 machine with a Six-Core AMD Opteron processor (2.4 GHz, 6MB and 75W ACP) running on Linux.

1 Sequence 100 Sequences 500 Sequences 1000 Sequences 3000 Sequences 5000 Sequences 10000 Sequences
Run 1 2.3s 13.0s 1m8.7s 2m26.7s 7m42.4s 13m11.8s 25m15.6s
Run 2 1.5s 13.1s 1m13.2s 2m22.3s 7m37.0s 13m1.5s 25m43.1s
Run 3 1.5s 13.5s 1m13.1s 2m27.3s 7m34.5s 12m53.5s 24m57.6s

Note: to increase server's response time we store all PSI-BLAST profile files in the PredictProtein cache (current size: results for >11Mio sequences). These can be retrieved from the cache very fast. For novel protein sequences for which we don't have PSI-BLAST profiles in the cache the runtimes increases substantially.

Availability/ Download

Data

  • Data sets used for development and evaluation of pEffect can be accessed here.
  • Predictions of whole proteomes in Gram-negative and Gram-positive Bacteria, as well as in Archaea can be downloaded from here or here.

Reference

Goldberg T, Rost B, Bromberg Y. Computational prediction shines light on type III secretion origins. Sci Rep. 2016 Oct 7;6:34516.

doi: 10.1038/srep34516; PubMed PMID: 27713481; PubMed Central PMCID: PMC5054392.

Contact

For questions, please contact localization@rostlab.org