| Title: | Online tools for predicting integral membrane proteins |
| Author: | Henry Bigelow and Burkhard Rost |
| Quote: |
In: M Peirce & R Wait (Eds.): |
Online tools for predicting integral membrane proteins
| 1 | Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| * | Corresponding author: hrbigelow@gmail.com URL http://www.rostlab.org/ Tel: +1-212-851-4669 |
We identify and describe a set of tools readily available for integral membrane protein prediction. These tools address two problems: finding potential transmembrane proteins in a pool of new sequences, and identifying their transmembrane regions. All methods involve comparing the query protein against one or more target models. In the simplest of these, the target "model" is another protein sequence, while the more elaborate methods group together the entire set of transmembrane helical or transmembrane beta barrel proteins. In general, prediction accuracy either in identifying new integral membrane proteins or transmembrane regions of known integral membrane proteins, depends strongly on how closely the query fits the model. Because of this, the best approach is an opportunistic one: submit the protein of interest to all methods and choose the results with the highest confidence scores.
Key words: membrane protein structure prediction; transmembrane helix, transmembrane beta barrel, hidden markov model, neural network, remote homolog detection, proteome searching
Basic concept of alignments. At a basic level, all methods work by the same paradigm. The simplest of these is BLAST. BLAST aligns the query sequence with each target sequence in a database. The alignment algorithm assigns a score to each alignment of query and target using a 20 x 20 matrix of scores called a "substitution matrix". The substitution matrix quantifies how often proteins whose sequences are aligned, based on known structure, have the same or different amino acids at each position. The alignment score involves summing substitution matrix values along with scores associated with gaps. Finally, taking all these alignments, a score threshold identifies a subset as target homologs.
Homology-transfer through alignments. Available experimental information for any of the targets can be transferred to the query (homology-transfer). For example, if one of the targets (database proteins) is experimentally known to be a transmembrane helical (TMH) protein, the homologous query is likely to also be a TMH protein. Moreover, if particular regions of a target protein are known to be TMHs, the regions in the query aligned to these regions are likely to also be TMHs. Of course, both inferences are subject to the accuracy of the alignment and the similarity between the two proteins.
As with all elements of living things, protein sequences originate from an evolutionary process of divergence and selection, creating a tree of proteins related in hierarchical fashion. Extending this idea to the homology search, a query protein can be compared to an entire family of related target proteins that are pre-aligned. Often, where a query might not have apparent similarity to any individual target protein in a family, it may have similarity to the target family taken as a whole. Essentially all advanced methods implement this idea.
Improved profile-based alignment methods A well-known example of this extension is PSI-BLAST [1] , which works as follows. First, the query is searched against a database of individual sequences using ordinary BLAST, resulting in a set of query-target alignments. Next, the query and set of target proteins are aligned to each other in a single multiple sequence alignment. The frequencies of each amino acid as occurring in the columns of the multiple sequence alignment are calculated, resulting in a set of 20-element vectors, one for each position in the original query. This statistical representation, called a position specific score matrix (PSSM) can be seen as a substitution matrix, custom-designed for each position in the query protein. In subsequent rounds, the PSSM, rather than the original query, is searched against the original database of individual sequences. For statistical reasons, conserved regions tend to be more influential in scoring subsequent alignments, allowing for improved detection of more diverged sequences.
Like PSI-BLAST, Pfam [2] uses multiple sequence alignments. There are two differences, however. First, while PSI-BLAST iteratively re-queries a database of individual sequences with a PSSM, Pfam is the inverse: it is a database of protein families, and the individual query protein is aligned against each family in the database. Second, while PSI-BLAST uses PSSMs to represent a protein family multiple sequence alignment, Pfam uses a hidden Markov model (HMM). An HMM extends the idea of position specific substitution scores to include gap insertion and deletion scores that are also position-specific. These are possible to derive from the original multiple sequence alignment by observing how many aligned proteins contain insertions or deletions relative to the query protein at each position in the query. As in PSI-BLAST or BLAST, the query protein is aligned against each HMM in the Pfam database and assigned an e-value comparable to BLAST e-value, representing the expected number of matches as good or better occurring by chance. Since HMM-based alignment methods are often more sensitive than BLAST or PSI-BLAST, they may succeed in finding a homologous family.
BLAST, PSI-BLAST and Pfam are very general methods capable of identifying sequence or family homologs of virtually any kind of protein, including specific kinds of membrane proteins. For integral membrane protein prediction however, another generalization yields further improvement.
Two major classes of transmembrane proteins: TMB and TMH. Integral membrane proteins come in two general structural classes. Transmembrane alpha helical (TMH) proteins span the plasma membrane in one or more alpha helices in alternating direction ( Fig. 2 >
Specific prediction methods. Methods designed to predict TMH or TMB proteins in general are built on each class taken as a group. Because of the diversity in specific structure (different numbers of transmembrane helices or strands), it is impossible to derive a single multiple sequence alignment for such a class. Instead, these methods extract features in common to all TMH or all TMB proteins without the need for explicit multiple sequence alignment. Technically, this is achieved by assigning one of a set of discrete labels to each position in each sequence, based on its structure. For example, the set of labels T, I, and O can be assigned one per residue to each TMH sequence, identifying the transmembrane helices, inner, and outer loops. From the resulting set of labeled protein sequences, a general model (also often a HMM) can be derived that recognizes features common to all labeled protein sequences. Such general models are potentially able to detect TMH or TMB proteins even further diverged from any sequence homolog, (perhaps an example of a previously undiscovered subfamily), than sequence-alignment based methods such as PSI-BLAST and Pfam.
A different homology approach is exemplified in the PROSITE [3] and PRINTS [4] databases. They contain a set of local sequence patterns defined by strong association with a specific protein function or structure. Because protein function and structure can be modular, some of these patterns may be found within a collection of proteins differing in overall structure. Others are very well correlated with overall structure despite their sequence-local nature. For identifying TMH or TMB proteins, several patterns prove useful (Methods). Potentially, such patterns may be conserved in a protein whose overall sequence is so diverged from any homologs as to be unidentifiable by alignment-based methods.
In general, all methods relying on alignment of proteins work optimally in aligning proteins in a specific similarity range corresponding to the range of sequences from which they are derived. In a degenerate sense, BLAST can be thought of as searching a database of "models" consisting of individual sequences. It is optimized to find close-range homologs. PSI-BLAST and Pfam build statistical models from multiple alignments of very similar sequences, and they work best to find medium-range homologs. TMH and TMB-specific methods are single statistical models built from a diverse set of TMH or TMB proteins only related by broad structural category. Thus, they are optimized to find long-range homologs.
Optimal results of each of these methods will be obtained fortuitously when the query happens to have a single sequence homolog, homology to a sequence family, or homology to a structure family. It is impossible to know in advance which if any of these will be the case. Because of this, we recommend an opportunistic approach: run all prediction methods and select those giving highest confidence scores. We provide a guide for obtaining as much relevant information about your protein as possible, and some general principles for interpreting the information.
This guide is in three parts. Firstly, we describe how to obtain a quick, comprehensive set of homology based information and possible experimental information about your protein, and how to use it to identify whether it is an integral membrane protein. Secondly, we describe those methods suitable for screening an entire set of proteins for potential TMH or TMB proteins. Thirdly, we present the methods for predicting which residues in a known or suspected transmembrane protein are in the membrane, and the overall orientation in the membrane. For quick reference, we provide a list of selected programs ( Table 1 ) and databases ( Table 2 ).
| Method | Scope | Service | URL | Ref. |
| BLAST and PSI-BLAST | general | WP | www.ncbi.nlm.nih.gov/BLAST | [16, 1] |
| TMHMM | TMH | PR3, S | www.cbs.dtu.dk/services/TMHMM | [17] |
| PiMohtm | TMH | PR3 | www.predictprotein.org | [18] |
| Phobius | TMH | PR3, SP, S | phobius.cgb.ki.se | [19] |
| HMMTOP | TMH | PR3 | www.enzim.hu/hmmtop | [20] |
| MEMSAT | TMH | PR5 | bioinf.cs.ucl.ac.uk/psipred | [21] |
| Split4 | TMH | PR2 | split.pmfst.hr/split/4 | [22] |
| PRED-TMBB | TMB | PR3 | bioinformatics2.biol.uoa.gr/PRED-TMBB | [23, 24] |
| HMM-B2TMR | TMB | PR3 | gpcr.biocomp.unibo.it | [25] |
| PROFtmb | TMB | PR3, S | rostlab.org/services/proftmb | [26] |
| TMB-HUNT | TMB | S | www.bioinformatics.leeds.ac.uk | [27, 28] |
| BOMP | TMB | S | www.bioinfo.no/tools/bomp | [29] |
| SignalP | SP | SP, S | www.cbs.dtu.dk/services/SignalP | [30] |
| Pfam | domain | WP | www.sanger.ac.uk/Software/Pfam | [2] |
| Superfamily | domain | WP | supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY | [31] |
| Panther | domain | WP | www.pantherdb.org | [32] |
| SMART | domain | WP | smart.embl-heidelberg.de | [33, 34] |
| PROSITE | motif | WP | ca.expasy.org/prosite | [3] |
| PRINTS | motif | WP | umber.sbs.man.ac.uk/dbbrowser/PRINTS | [4] |
* Selected programs and databases for identification and per-residue prediction of integral membrane proteins. Scope. TMH, TMB: built on a representative collection of TMH or TMB proteins. motif: built on short sequence motifs associated with particular function or structure. domain: built on medium to long sequence regions of particular structure. Service. Per residue predictions PRn: all residues are assigned to one of a number of discrete structural states. PR2: (TM, non-TM). PR3: (TMB: TM-strand, extracellular loop, periplasmic loop; TMH: TM-helix, cytoplasmic loop, non-cytoplasmic loop. PR5: PR3, but distinguishing non-TM portions of helical overhang on both sides. SP: Signal peptide and cleavage site prediction. S: suitable for whole-proteome screening; these methods all allow multiple-sequence submission and have been evaluated for accuracy and coverage in whole protein discrimination. WP: whole protein prediction of individual proteins.
| Database | Common Name/ Description | URL | Ref. |
| GO | Gene Ontology | www.geneontology.org | [35] |
| PIR | Protein Information Resource | pir.georgetown.edu | [6] |
| PDB | Protein Data Bank | www.rcsb.org/pdb/Welcome.do | [36] |
| InterPro | Database of Protein Families, Domains and Functional Sites | www.ebi.ac.uk/interpro | [37] |
| SCOP | Structural Classification of Proteins | scop.mrc-lmb.cam.ac.uk/scop | [38, 39, 40] |
| InterProScan | Scanning of InterPro Database | www.ebi.ac.uk/InterProScan | [41, 42] |
| UniProt | Universal Protein Resource | www.pir.uniprot.org | [43] |
| OPM | Orientations of Proteins in Membranes | opm.phar.umich.edu | [44] |
| PDBTM | Protein Data Bank of Transmembrane Proteins | pdbtm.enzim.hu | [45, 46] |
| MPtopo | Membrane Protein Topology Database | blanco.biomol.uci.edu/mptopo | [47] |
There are TMH- or TMB-specific and general methods available. The general methods are motif- and domain-based, and potentially identify the protein as one of a subtype of TMH or TMB proteins. TMH- or TMB-specific methods are designed to identify features common to all TMH (or all TMB) proteins, and do not identify subtypes. InterProScan is a portal that allows querying the general methods at once. UniProt provides a comprehensive view of previously analyzed results on many proteins and accompanying experimental information on structure or function.
TMB-specific methods. BOMP (β-barrel outer membrane protein predictor), TMB-HUNT and PROFtmb are specially designed to identify TMB proteins in a pool. They have all been evaluated for accuracy in discriminating TMBs from background. Unfortunately, a definitive comparison is complicated by the fact that the evaluations are all done on different data sets. It is recommended that you submit your query to all three and scrutinize the results. Taking a consensus of predictors has been found consistently to yield better accuracy than relying on one individual predictor.
TMH-specific methods. Of the six TMH-specific methods, only TMHMM has been rigorously evaluated for accuracy in discriminating TMH proteins from others. While all methods implicitly predict whether a protein is a TMH by the presence of one or more predicted TM-helices, since the others are not evaluated for accuracy, it is not recommended to use them to screen a pool for potential TMH proteins.
InterProScan. InterProScan submits your query to up to 13 individual predictors at once. Go to InterProScan, make sure all Applications to Run are selected, paste your sequence, and Submit. When results are returned, select Table View to see the individual scores associated with each hit. Individual scores are unfortunately not in any standard units. Though a thorough statistical comparison between different scoring systems has not been done, we will discuss this issue below.
UniProt. UniProt joins together all sequences from SWISS-PROT, TrEMBL [5] and PIR [6] (Protein Information Resource). Release 6.0 of September 2005 contains 2,299,834 sequences (see www.ebi.uniprot.org/support/docs/rel_notes/relnote6.0.html). Each protein is linked with a set of pre-run predictions and annotations in databases in an advanced searchable framework. Results of searches contain links to the original sources of prediction or annotation.
Analyzing the results. Looking at your InterProScan and UniProt results, your protein has hopefully matched homologs with associated structural or functional annotation in databases PIR, PDB, InterPro, SCOP, and Gene Ontology (GO). Matches to alignment-based models of interest are Pfam, SMART, Superfamily and TigrFAM. Finally, functional motif databases are PROSITE and PRINTS.
The goal is to identify whether your query is indeed a TMH or TMB. But, it is not always easy to determine how closely associated these features are with integral membrane status. Because of the diversity of integral membrane proteins, there is no simple way to identify which annotations or features definitively identify with integral membrane status. Therefore, it will be necessary to do a careful reading of the descriptions, available through the links on the InterProScan and UniProt results tables. Below, we describe some of these in more detail. For PROSITE we present our own quick analysis of particular motifs closely associated with TMH or TMB proteins.
Pfam. As discussed above, Pfam is a database of HMMs, with each HMM built from an alignment of sequence-related proteins. As of December 2005, Pfam contains 8,183 families. 94% of all known protein sequences match at least one Pfam family. Its main purpose is to identify to what family or families a protein of unknown structure or function belongs. To search Pfam, one submits the protein of interest, and just as in BLAST, it is aligned to each HMM and significant alignments are reported with an accompanying log2-odds (or "bits") and e-value scores. The Pfam E-value is comparable to a BLAST e-value. Bits is the log2 (logarithm of base 2) of the odds of the sequence being a true match. For example, a bits score of 3 means the protein is 8 times more likely to be a true match than false match (8/9, or 89% chance of being a true match).
Pfam families are categorized as "family", "domain", "motif" and "repeat" and have an accompanying average length of alignment. Family and domain type families are most closely associated with structure. Families are organized into clans, one of which is "Outer membrane beta-barrel". There is no corresponding clan covering TMH associated Pfam models, but a keyword search "membrane" reveals many relevant families.
Using Pfam to determine integral membrane status is complicated by the possibility that TMH or TMB proteins may contain N- or C-terminal domains or extra-membrane loops that form domains also found in soluble proteins.
SMART. Like Pfam, SMART is a collection of HMMs built from seed alignments. SMART focuses exclusively on protein domains, and describes them as "extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues". A large fraction of SMART domains will overlap with Pfam.
Superfamily. Similar to Pfam and SMART, Superfamily is a database of HMMs built on multiply aligned protein sequences. While Pfam uses similar sequences, Superfamily groups together sequences which have no detectable homology but are structurally similar at the SCOP superfamily level of structural classification. Because these HMMs are built from more diverse groups of sequences, they may have greater power to detect TMH or TMB proteins evolutionarily more distant from known homologs than Pfam. Depending on whether your sequence has a close or remote homolog, you may see reliable hits to any or all of these HMM-based databases.
Gene Ontology (GO). According to their homepage, "The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism". It is useful in computer searches in which the existing words typically used to describe a given function are imprecise or ambiguous. Using the Gene Ontology, the European Bioinformatics Institute in a project called GOA [7, 8] (Gene Ontology Annotation) manually assigns GO terms to existing proteins in UniProt based on the literature, thus allowing comprehensive searches for proteins by GO terms.
PROSITE. PROSITE is a database of usually short sequence patterns tightly associated with function or overall protein structure. The patterns are all evaluated for their predictive power of protein class. For example, PROSITE pattern [LIVMFYC]-{A}-[HY]-x-D-[LIVMFY]-[RSTAC]-{D}-x-N-[LIVMFYC](3), called "Tyrosine protein kinases specific active-site signature" detects 97.9% of all known protein tyrosine kinases at 95.4% accuracy. Since each PROSITE pattern is designed to identify specific elements of function or structure, it would be useful to know if any of these happen to correlate well with TMH or TMB proteins ( Table 3 and 4 and below).
PROSITE motifs specific for integral membrane proteins. Since PROSITE motifs are defined by association to specific function or structure which is often local, some motifs are found in more than one overall protein structure. Many other motifs are well correlated with overall structure, and are potentially useful in identifying integral membrane proteins.
To estimate how well each motif correlates with integral membrane protein structure, we did the following. We prepared lists of TMH and TMB proteins by querying UniProt as follows. For TMBs, we queried with GO: integral to membrane AND GO: outer membrane, returning 3,464 proteins. For TMH the query was GO: integral to membrane NOT GO: outer membrane, returning 309,360 proteins. Technically, we carried out these queries by parsing a file called gene_association.goa_uniprot from the GOA project at http://www.ebi.ac.uk/GOA/goaHelp.html since TMH proteins exceeded the download limit.
With the lists of known TMH and TMB proteins, we counted, for each PROSITE motif, the number of TMH (or TMB) proteins containing the motif, (true positives) and the number of non-TMH (or non-TMB) proteins containing the motif (false positives). We calculated the accuracy of a given motif in identifying TMH (TMB) proteins, selecting those patterns with a significant number of true positives at a given accuracy ( Table 3 and 4). This is the same procedure for PROSITE pattern accuracies but taken with respect to TMH or TMB proteins. Since our lists of TMH and TMB will be incomplete due to missing GO annotation, the accuracies are necessarily lower bound estimates.
Similar analysis can be done for PRINTS but we did not perform this. Since Pfam, SMART and Superfamily domains tend to match long (>100 residues) stretches of sequence, they tend to correlate well with overall protein structure. We advise simply to read the descriptions of each domain your protein may match.
Overall recommendation. Taking all information you can gather from both homology-based models and individual sequence homology to any proteins with useful annotation, decide which annotations you trust based on the respective E-values and other scores given by the search methods. Read the descriptions carefully, keeping in mind the possibility that integral membrane proteins contain motifs also found in soluble proteins. In general, matches to short regions of your protein are less reliable indicators of overall protein structure. This includes Pfam families of the "motif" and "repeat" types. Hopefully, this first step will give a very comprehensive view of what kind of domains or motifs your protein contains and what kind of protein it is likely to be.
| TP | FP | Minimum Accuracy | Accession | PROSITE Motif Description |
| 303 | 0 | 100% | PS50928 | ABC transporter integral membrane type-1 domain profile. |
| 212 | 0 | 100% | PS00236 | Neurotransmitter-gated ion-channels signature. |
| 209 | 0 | 100% | PS50261 | G-protein coupled receptors family 2 profile 2. |
| 209 | 0 | 100% | PS00077 | Heme-copper oxidase catalytic subunit, copper B binding region signature. |
| 1,273 | 5 | 100% | PS51003 | Cytochrome b/b6 C-terminal region profile. |
| 1,326 | 9 | 99% | PS51002 | Cytochrome b/b6 N-terminal region profile. |
| 265 | 2 | 99% | PS50999 | Cytochrome oxidase subunit II transmembrane region profile. |
| 2,136 | 20 | 99% | PS50262 | G-protein coupled receptors family 1 profile. |
| 209 | 2 | 99% | PS50855 | Cytochrome oxidase subunit I profile. |
| 250 | 4 | 98% | PS00232 | Cadherin domain signature. |
| 646 | 11 | 98% | PS50850 | Major facilitator superfamily (MFS) profile. |
| 253 | 5 | 98% | PS50268 | Cadherins domain profile. |
| 205 | 6 | 97% | PS00238 | Visual pigments (opsins) retinal binding site. |
| 334 | 11 | 97% | PS00154 | E1-E2 ATPases phosphorylation site. |
| 2,122 | 95 | 96% | PS00237 | G-protein coupled receptors family 1 signature. |
| 274 | 14 | 95% | PS50857 | Cytochrome oxidase subunit II copper A binding domain profile. |
| 271 | 14 | 95% | PS00078 | CO II and nitrous oxide reductase dinuclear copper centers signature. |
| 263 | 46 | 85% | PS00216 | Sugar transport proteins signature 1. |
| 256 | 45 | 85% | PS50929 | ABC transporter integral membrane type-1 fused domain profile. |
| 223 | 97 | 70% | PS50109 | Histidine kinase domain profile. |
| 245 | 113 | 68% | PS00217 | Sugar transport proteins signature 2. |
| 305 | 192 | 61% | PS50853 | Fibronectin type-III domain profile. |
| 993 | 754 | 57% | PS50835 | Ig-like domain profile. |
| 461 | 417 | 53% | PS01186 | EGF-like domain signature 2. |
| 209 | 204 | 51% | PS00109 | Tyrosine protein kinases specific active-site signature. |
| 292 | 289 | 50% | PS00290 | Immunoglobulins and major histocompatibility complex proteins signature. |
| 355 | 355 | 50% | PS50026 | EGF-like domain profile. |
| 418 | 441 | 49% | PS00022 | EGF-like domain signature 1. |
| 215 | 321 | 40% | PS00152 | ATP synthase alpha and beta subunits signature. |
| 337 | 522 | 39% | PS00142 | Neutral zinc metallopeptidases, zinc-binding region signature. |
| 317 | 1,301 | 20% | PS50893 | ATP-binding cassette, ABC transporter-type domain profile. |
| 333 | 1,412 | 19% | PS00211 | ABC transporters family signature. |
| 338 | 1,809 | 16% | PS50011 | Protein kinase domain profile. |
| 326 | 1,800 | 15% | PS00107 | Protein kinases ATP-binding region signature. |
* PROSITE motifs specific for TMH proteins. TP: True Positives; the number of TMH proteins containing the motif. FP: False Positives; the number of non-TMH proteins also containing the motif. Minimum Accuracy: TP / (TP + FP), a lower bound estimate of the probability that an unknown protein containing the motif in question is a TMH. It is a lower bound because some TMH proteins will be missing the appropriate GO terms, and will incorrectly be considered false positives in these lists (when in fact they should be true positives).
| TP | FP | Minimum Accuracy | Accession | PROSITE Motif Description |
| 39 | 0 | 100% | PS00576 | General diffusion Gram-negative porins signature. |