Evaluation of transmembrane helix predictions in 2014

J Reeb, E Kloppmann, M Bernhofer, & B Rost (2015). Proteins: Structure, Function, and Bioinformatics, 83(3), 473–484. doi:10.1002/prot.24749
Pubmed
PDF

Poster ISMB 2015, TMSEG is available in predictprotein or as download

Abstract

Experimental structure determination continues to be challenging for membrane proteins. Computational prediction methods are, therefore, needed and widely used to supplement experimental data. Here, we re-examine the state-of-the-art in transmembrane helix prediction based on a non-redundant dataset with 190 high-resolution structures. Analyzing 12 widely-used and well-known methods using a stringent performance measure, we largely confirmed the expected high level of performance. All methods performed worse for proteins that could not have been used for development.

A few results stood out. Firstly, all methods predicted proteins in eukaryotes better than those in bacteria. Secondly, methods worked less well for proteins with many transmembrane helices. Thirdly, most methods correctly discriminated between globular water-soluble and transmembrane proteins. However, several older methods often mistook signal peptides for transmembrane helices. Some newer methods have overcome this shortcoming. In our hands, PolyPhobius and MEMSAT-SVM appeared better than other methods.

Dataset

The dataset of 190 unique alpha-helical transmembrane proteins can be downloaded as excel spreadsheet or tab separated file.

The localization of transmembrane helices (TMHs) for our dataset has been annotated using OPM and PDBTM and can be dowloaded as FASTA file (subset of 44 new proteins).

Evaluation scores

$$Q_{htm}^{\%obs} = \frac{\text{number of correctly predicted TMHs in data set}}{\text{number of TMHs observed in data set}}$$ $$Q_{htm}^{\%pred} = \frac{\text{number of correctly predicted TMHs in data set}}{\text{number of TMHs predicted in data set}}$$ $$Q_{ok} = \frac{1}{N_{prot}} \cdot \sum_{i=0}^{N_{prot}}{\delta_i} ; \delta_i = \begin{cases}1&\text{, if } Q_{htm}^{\%obs} = 1 = Q_{htm}^{\%pred}\\0&\text{, else}\end{cases}$$ A helix counts as correctly predicted if the predicted helix' end points both deviate by a maximum of 5 residues from the observed helix. The implementation of this can be run interactively here.

Transmembrane helix prediction performance

Qok scores for all 12 prediction methods on various sets of TMPs. Qok denotes the percentage of proteins for which all TMHs were correctly predicted (A, TMH endpoints within five or less residues of either OPM or PDBTM annotation for the whole protein, Methods). Above the bars are the numbers of proteins in each dataset. Error bars are the sample standard deviation generated by bootstrapping with 1000 draws of half the set size each (cf. Methods). Qok is plotted for B: 190 redundancy-reduced TMPs followed by 44 new (not used for development) and 146 old (used for development, either the protein itself or homologous proteins) TMPs. All methods clearly performed worse for more recently determined protein structures. The old-new difference for TopPred2 suggested that a significant fraction of the differences might not be explained by over-training C: All methods reached higher Qok’s for eukaryotes than for bacteria. Note that we excluded the 9 archaeal and 2 sequences of viral origin. D: Performance declines from bitopic TMPs to those with 2-5 TMHs or more. For D, the number in brackets behind the set size denotes the number of TMHs in the respective subset.

Transmembrane helix prediction methods

Evaluated transmembrane helix prediction methods. Names link to the respective webservers.
Name Publication
1Superseeded by TMSEG (unpublished)
TopPred2

Claros, M. G., & Von Heijne, G. (1994). TopPred II: an improved software for membrane protein structure predictions. Computer Applications in the Biosciences CABIOS DOI

PHDhtm1

Rost, B., Casadio, R., Fariselli, P., & Sander, C. (1995). Transmembrane helices predicted at 95% accuracy. Protein Science  DOI

HMMTOP 2

Tusnády, G. E., & Simon, I. (2001). The HMMTOP transmembrane topology prediction server. Bioinformatics DOI

TMHMM 2

Krogh, A., Larsson, B., von Heijne, G., & Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. Journal of Molecular Biology DOI

SOSUI

Hirokawa, T., Boon-Chieng, S., & Mitaku, S. (1998). SOSUI: classification and secondary structure prediction system for membrane proteins. Bioinformatics DOI

Phobius

Käll, L., Krogh, A., & Sonnhammer, E. L. L. (2004). A combined transmembrane topology and signal peptide prediction method. Journal of Molecular Biology DOI

PolyPhobius

Käll, L., Krogh, A., & Sonnhammer, E. L. L. (2005). An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics DOI

MEMSAT3

Jones, D. T. (2007). Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics DOI

Philius

Reynolds, S. M., Käll, L., Riffle, M. E., Bilmes, J. a, & Noble, W. S. (2008). Transmembrane topology and signal peptide prediction using dynamic bayesian networks. PLoS Computational Biology DOI

SCAMPI

Bernsel, A., Viklund, H., Falk, J., Lindahl, E., Von Heijne, G., & Elofsson, A. (2008). Prediction of membrane-protein topology from first principles. Proceedings of the National Academy of Sciences DOI

SPOCTOPUS

Viklund, H., Bernsel, A., Skwark, M., & Elofsson, A. (2008). SPOCTOPUS: a combined predictor of signal peptides and membrane protein topology. Bioinformatics DOI

MEMSAT-SVM

Nugent, T., & Jones, D. T. (2009). Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics DOI