Tmseg

From Rost Lab Open

The TMSEG method

TMSEG predicts transmembrane proteins (TMP) and transmembrane helices (TMH) using position-specific scoring matrices (PSSM) generated by PSI-BLAST, as well as the physico-chemical properties of the amino acids. The prediction is divided into three steps performed by three different classifiers.

In the first step a random forest (RF) predicts the probability of each residues to be in one of three states: transmembrane, soluble, and signal peptide. The RF uses a sliding window of 19 residues for the PSSM scores and 9 residues for the physico-chemical properties (charge, hydrophobicity, polarity). The protein sequence is then divided into transmembrane and soluble segments (and signal peptide, if applicable) based on the probabilities.

In the second step a neural network (NN) refines the prediction by adjusting the position of the TMHs or potentially splitting very long TMHs (>34 residues). This NN is specifically trained on the length, amino acid composition, and physico-chemical properties of TMHs.

In the last step another RF predicts the inside/outside topology of the N-terminus. The prediction is based on the amino acid composition and positive charge of the residues on the two sides of the membrane (separated by the TMHs).

Performance

TMSEG was compared to three established methods: PolyPhobius [1], MEMSAT3 [2], MEMSAT-SVM [3], and PHDhtm [4]. Its performance was at least comparable to and often better than the other three methods (Bernhofer et al., submitted). The evaluation was performed on a dataset with 41 transmembrane proteins and 285 soluble proteins. The PSSM profiles were generated by running PSI-BLAST against the UniProt [5] Reference Cluster with 90% sequence identity (UniRef90).

TMSEG correctly identified 98±2% of the transmembrane proteins (40 out of 41 TMPs) and had a false positive rate of only 3±1% (8 out of 285 soluble proteins). Transmembrane helices were predicted with a precision of 87±4% and recall of 85±4%, and 66±7% of all transmembrane proteins were predicted with all their helices at the correct positions (i.e. no false positives/negatives).

A predicted helix was considered to be correct if its end-points did not deviate by more than five residues from the observed helix, and if the overlap between the predicted and observed helix was at least half of the length of the longer helix.

Influence of database size

TMSEG uses only the PSI-BLAST PSSM scores and features derived from those scores. Therefore, the quality of the prediction strongly depends on the quality of the PSSM. In order to estimate the effect of the database size on the prediction accuracy, PSSMs from a PSI-BLAST run against the UniRef50 Cluster and Swiss-Prot were used.

These PSSMs mainly affected the recall of transmembrane proteins and helices. The protein recall dropped to 95% (UniRef50) and 90% (Swiss-Prot), and the helix recall to 79% (UniRef50) and 77% (Swiss-Prot). The precision of the transmembrane helices dropped to 83% (UniRef50) and 82% (Swiss-Prot), and the percentage of transmembrane proteins with all helices at their correct positions was only 59% (UniRef50) and 49% (Swiss-Prot). However, the false positive rate (i.e. soluble proteins predicted as transmembrane proteins) was mostly unaffected and remained at 3% (UniRef50) and 2% (Swiss-Prot).

Tutorial on how to use PSI-BLAST and TMSEG

In order to use TMSEG you will also need a PSI-BLAST PSSM file for the query sequence. If you do not already have a PSSM file for your sequence, it can be obtained by running PSI-BLAST against a search database. A suitable search database would be the UniRef90 cluster from UniProt.

How to run PSI-BLAST

There are two commonly used executables for PSI-BLAST:

Although they are functionally the same, the parameter names have changed. See the table below for parameters which are most relevant for the use with TMSEG.

psiblast blastpgp Short description
-db -d Path to search database
-query -i Path to input FASTA file
-out -o Path to output BLAST file
-out_ascii_pssm -Q Path to output (ASCII) PSSM file
-evalue -e Evalue threshold to retain alignment for BLAST output
-inclusion_ethresh -h Evalue threshold to retain alignment for PSSM generation
-num_iterations -j Number of PSI-BLAST iterations to perform
-num_threads -a Number of CPU cores to use for PSI-BLAST search
-seg -F Use SEG filter on input sequence
-use_sw_tback -s Calculate locally optimal Smith-Waterman alignments

We recommend to calculate locally optimal Smith-Waterman alignments. While this increses the runtime, it often helps to generate better PSSMs. The E-value thresholds should be between 1e-3 and 1e-5. Lower values reduce runtime, but also reduce the number of alignments to generate the PSSM (and thus its complexity). Furthermore, we recommend to perform 3 PSI-BLAST iterations.

The needed PSSM file is then generated with the -Q or -out_ascii_pssm parameter.

Sample PSI-BLAST calls:

blastpgp -d path/to/database -i path/to/fasta -o path/to/output -Q path/to/pssm -e 1e-3 -h 1e-3 -j 3 -F F -s T -a 1

psiblast -db path/to/database -query path/to/fasta -out path/to/output -out_ascii_pssm path/to/pssm -evalue 1e-3 -inclusion_ethresh 1e-3 -num_iterations 3 -seg no -use_sw_tback -num_threads 1

Set up a search database

If you do not already have set up a search database, you can use the makeblastdb (BLAST+) or formatdb (legacy BLAST) applications and a FASTA file (e.g. the uniref90.fasta).

formatdb -i path/to/fasta

makeblastdb -in path/to/fasta -dbtype prot

How to run TMSEG

TMSEG parameter Short description
-i Path to input FASTA file/folder
-p Path to input (ASCII) PSSM file/folder
-o Path to output file/folder (human readable)
-r Path to output file/folder (raw output)
-m FLAG: switch to batch mode (multiple input files)
-x FLAG: improve prediction provided in input FASTA file
-t FLAG: improve inside/outside topology prediction only (requires -x)

Once you have a PSSM file for your protein sequence of interest, running TMSEG is easy:

tmseg -i path/to/fasta -p path/to/pssm -o path/to/output [-r path/to/raw-output]

The normal (human readable) output is generated with the -o parameter. You can also generate an output file with the raw prediction values of the neural network and random forests by supplying the -r parameter.

Multiple input sequences

TMSEG can also be used to predict multiple sequences at once. This significantly reduces runtime as TMSEG does not need to reload the prediction models for each call.

To use this feature, you have to add the -m parameter flag and prepare the input files as follows: Each protein sequence must be in a separate .fasta file. In addition, the PSSM files for each sequence must end with .pssm and have the same filename as the sequence file. For example, you could have query1.fasta, query2.fasta, and query3.fasta in one directory and query1.pssm, query2.pssm, and query3.pssm in the same or another directory. Instead of the usual path to the input/output files, you must then give the corresponding directories:

tmseg -i path/to/fasta-dir -p path/to/pssm-dir -o path/to/output-dir [-r path/to/raw-output-dir] -m

Applying TMSEG to other methods

In order to apply TMSEG to other methods' predictions, you have to parse their output. The input file must have the following three lines (in that order):

  • A header that must start with >
  • The protein sequence
  • The other methods' prediction (as a single string)

All lines prior to the header are ignored. See below for an example:

>P26789|1nkzE|LHA4_RHOAC PDBTM_annotations
MNQGKIWTVVNPAIGIPALLGSVTVIAILVHLAILSHTTWFPAYWQGGVKKAA
1111111111111111111HHHHHHHHHHHHHHHHHH2222222222222222

Legal characters for the prediction line are:

  • H/h/T/t for transmembrane helices
  • S/s for signal peptides
  • L/l for re-entrant regions
  • U/u for regions of unknown structure
  • all other charaters are interpreted as non-TM

Sample TMSEG calls:

Process inside/outside topology only: tmseg -i path/to/prediction -p path/to/pssm -o path/to/output [-r path/to/raw-output] -x -t

Process TMHs and inside/outside topology: tmseg -i path/to/prediction -p path/to/pssm -o path/to/output [-r path/to/raw-output] -x

Outside resources

Availability

Publication

References

[1] L. Käll, A. Krogh, and E. L. Sonnhammer. An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics, 21 Suppl 1:i251–257, Jun 2005. [DOI:10.1093/bioinformatics/bti1014] [PubMed:15961464].

[2] D. T. Jones. Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics, 23(5):538–544, Mar 2007. [DOI:10.1093/bioinformatics/btl677] [PubMed:17237066].

[3] T Nugent, D. T. Jones. Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics 2009;10:159. [DOI:10.1186/1471-2105-10-159] [PubMed:19470175]

[4] B. Rost, P. Fariselli, and R. Casadio. Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci., 5(8):1704–1718, Aug 1996. [DOI:10.1002/pro.5560050824] [PubMed:8844859] [PubMed Central:PMC2143485].

[5] UniProt C. UniProt: a hub for protein information. Nucleic Acids Res. 2015, 43:D204-212. [DOI:10.1093/nar/gku989] [PubMed:25348405] [PubMed Central:PMC4384041]