PEffect - prediction of bacterial type III effector proteins

From Rost Lab Open
Revision as of 20:06, 18 November 2015 by Goldberg (Talk | contribs)

Jump to: navigation, search


Web server


The type III secretion system is one of the causes of a wide range of bacterial infections in human, animals and plants. This system comprises a hollow needle-like structure localized on the surface of bacterial cells that injects specific bacterial proteins, the so-called effectors, directly into the cytoplasm of a host cell. During infection, effectors convert host resources to their advantage and promote pathogenicity.

We - Tatyana Goldberg, Burkhard Rost and Yana Bromberg - at BrombergLab and RostLab developed a novel method, pEffect that predicts bacterial type III effector proteins. In our method, we combine sequence-based homology searches and advanced machine learning to accurately predict effector proteins. We use information encoded in the entire protein sequence for our predictions.

Method design

pEffect is a method that combines sequence similarity-based inferences (PSI-BLAST) with de-novo predictions using machine learning techniques (Support Vector Machines; SVM). For a query protein it first runs PSI-BLAST to identify a homolog in the set of known and annotated effector proteins. If such a homolog is available, then its annotation (i.e. type III effector) is being transferred to a query protein. If a homolog is not available, pEffect triggers an SVM that predicts effector proteins through searches of k-consecutive residues that are known from annotated proteins.


The input to the server is:

1. one or more FASTA-formatted protein sequences. The sequences must be in one-letter amino acid code (not case-sensitive). The allowed amino acids are: ACDEFGHIKLMNPQRSTVWY and X (unknown). Example.

a. e-mail address: a notification for completed prediction result and the access link are sent to the provided email address (Optional)


For every query protein, result contains four basic values:

1. the protein identifier as provided by the user

2. the reliability score of a prediction on a 0-100 scale with 100 being the most confident prediction

3. prediction of a protein to be a type III effector

4. annotation type of the prediction (PSI-BLAST or SVM)

For PSI-BLAST predictions, the web site provides ‘per click’ on the annotation type (i.e. PSI-BLAST) the information about the closest homolog and its PSI-BLAST alignment to query.


Figure 1
Fig 1: More reliable predictions better. The curves show the percentage Accuracy vs. Coverage for LocTree3 predictions above a given RI threshold (from 0=unreliable to 100=most reliable). The curves were obtained on cross-validated test sets of bacterial (gray line) and eukaryotic (black line) proteins. Half of all eukaryotic proteins are predicted at RI>65; for these Q18 is above 95% (black arrow). Half of all bacterial proteins are predicted at RI>80 and Q18 above 95% (black arrow).

Prediction reliability

Every prediction result is supported by a Reliability Index (RI) measuring the strength of a prediction. The RI is a value between 0 and 100, with 100 denoting the most confident predictions.

We rigorously evaluated the reliability of pEffect predictions on a non-redundant test set of proteins (Fig. 1). We observed that at the default threshold of RI>50, over 87% of all predictions of type III effectors are correct and 95% of all effectors in our set are identified (Fig. 1; black arrow). At a higher RI>80 effector predictions are correct 96% of the time, but only 78% of all effectors in the set are identified (Fig. 1; gray arrow).

Runtime analysis

pEffect is built to run a homology-based PSI-BLAST; if no hit is identified then a de-novo SVM prediction is used.

While PSI-BLAST searches are fast, SVM's runtime depends on the number of query protein sequences. We measured pEffect's runtime on a Dell M605 machine with a Six-Core AMD Opteron processor (2.4 GHz, 6MB and 75W ACP) running on Linux.

1 Sequence 100 Sequences 500 Sequences 1000 Sequences 3000 Sequences 5000 Sequences 10000 Sequences
Run 1 na na na na na na na
Run 2 na na na na na na na
Run 3 na na na na na na na

Note: to increase server's response time we store all PSI-BLAST profile files in the PredictProtein cache (current size: results for >11Mio sequences). These can be retrieved from the cache very fast. For novel protein sequences for which we don't have PSI-BLAST profiles in the cache the runtimes increases substantially.

Availability/ Download


  • Data sets used for development and evaluation of pEffect can be accessed here.
  • Predictions of whole proteomes in Gram-negative and Gram-positive Bacteria, as well as in Archaea can be downloaded from here.



For questions, please contact

Personal tools