Design and Training of Profile-based
HMM.   We use an HMM whose parameters are trained on a set
of labelled sequence profiles, and which accepts sequence profiles as
input for prediction. Training is achieved using the Baum-Welch
(Expectation Maximization) algorithm considering only valid paths to
calculate the expectations at each iteration. See the
Web supplement for the mathematical derivation. This idea was
originally proposed by Anders Krogh (ref. below) and adapted to
profile trained/profile fed HMMs by Martelli et. al.
Clustering and Profiles.  
We started with the 56 transmembrane beta barrel structures in the
PDB, clustering them at an HSSP distance of 3 to obtain
11 families, 3 of which were discarded as explained in the associated
PROFtmb
paper. Then, we used a representative of each of these 8 families
to build a PSI-BLAST profile.
Structure-based Labelling  
We labelled each amino acid position in the profile based on its
structural environment, recognizing individual latitutes along
the transmembrane strands, two abundant types of hairpins in
the periplasmic side (4- and 5-hairpins, and a 'general' hairpin),
and extracellular loops. There were 75 total labels, composed of
10 hairpin states, 1 extracellular loop state, 32 up-strand states,
and 32 down-strand states. The process is depicted in this
picture:
|
|
Baum-Welch Parameter Estimation
Having specified the model architecture, we used the Baum-Welch
Parameter Estimation procedure to train the model parameters.
During training, the expected number of times each parameter is
used to generate the training profile is calculated. The resulting
expectations are normalized over all emission parameters from
a given node, or all transition parameters from a given node.
The model parameters are then assigned these normalized expectations.
This cycle is iterated until the total probability of the profile
converges within a given step size.
As mentioned above, only valid paths (paths through the architecture
consistent with the structure-based sequence labelling) are
used to calculate the expectations. For good introductions to
these procedures see:
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
Lawrence Rabiner
Proceedings of the IEEE vol. 77 no. 2 pp. 257-286
and
Bioinformatics: The Machine Learning Approach
Pierre Baldi and Søren Brunak
MIT Press, Feb. 1998
For the details specific to valid-path ('clamped') training,
see:
Hidden Markov Models for Labeled Sequences
Anders Krogh
Proceedings - International Conference on Pattern Recognition
vol. 2, 1994 pp. 140-144
Per-residue 4-state Prediction.  
To use the trained model to predict the labelling of a new
sequence, we first generate the PsiBlast profile of that sequence,
then input it to the model. The prediction is achieved by
a process generally called Decoding. In our case, decoding
was achieved in two steps. In the first step, we calculate
the probability of each position in the profile to be in a
beta-strand state as the sum of probabilities of all
64 individual beta-strand states. Then, having a sequence of
probabilities, we use the Viterbi algorithm combined with
the original transition probabilities to find the highest-probability
two-state path through the model. It is this path which we use
as the Per-residue two-state prediction. See this
Prediction showing true labelling.
Whole-protein Prediction.  
To detect TMBs in a database of proteins of unknown structure, we
start with the bits (log-odds) score, calculated as
log_2(P(s|M)/P(s|B)), where P(s|M) is the full sum-over-paths
probability of the model generating the sequence profile s. P(s|B) is
the corresponding background model, designated as a single Markov
state with emission probabilities equivalent to the Database amino
acid composition. Since extreme bits scores seem to depend on protein
length, the background bits score distribution must be characterized
as a function of protein length. The Z-score adjustment we use is
that described by Anders Krogh and Richard Hughey here
or from:
Hidden Markov models for sequence analysis:
extension and analysis of the basic method
Richard Hughey and Anders Krogh
Computer Applications in the biosciences: CABIOS
Issue 12, vol 2 1996 Apr pp. 95-107
In this procedure, the average length and bits score is calculated for
moving windows of 500 proteins, using an threshold of 2.0 standard
deviations as the cutoff for outlier removal. In this way, a sequence
of means and standard deviations for every protein length encountered
is generated. See the z-score
calibration curve for PROFtmb.
Using the z-score, we ran PROFtmb on a non-redundant database of
well-annotated proteins as regards subcellular location. This
dataset, called
SetROC contained the following numbers of proteins:
- 13 Integral Outer Membrane
- 21 Peripheral Outer Membrane
- 106 Inner Membrane
- 197 Single Membrane
- 1455 Nonmembrane
Below is shown the cluster plot using PROFtmb Z-score:

Shown here is the original cluster plot of protein length
vs. Z-score. This plot reveals the overall shape of the background
distribution, as well as the fact that about half of the 13 TMBs
actually score moderately to very poorly.

Performance Evaluation.  
We evaluated 4-state per-residue performance using the jack-knife
(leave-one-out) procedure. To do this, a model is generated which
is trained on 7 of the 8 training profiles, and tested on the 8th.
This is repeated for each of the 8 profiles, and the results are
compiled together. We use Q2, MCC (Matthew's Correlation Coefficient),
and Sov
(segment-overlap measure of prediction accuracy)
to evaluate the accuracy of this compiled set of results. Such a
set of jack-knifed predictions and the compiled results are shown
here.
Whole-protein discrimination was evaluated on a few different
datasets as described in the accompanying paper.
In each test, ROCn curves were calculated using bits scores as the
criterion for cutoff, and the set of positives were well-annotated
TMBs (none of which had significant homology to the 8 proteins
in the training set), the negatives were well-annotated non-TMBs.
Annotations were based on SWISS-PROT keywords or by manual
inspection of the SUBCELLULAR LOCATION field. As can be seen in
the figure, the estimate of coverage vs. accuracy widely varies
depending on which evaluation set is used.
|