EMBL, 69012 Heidelberg, Germany, rost@embl-heidelberg.de
EBI, Hinxton Hall, Hinxton, Cambridge CB10 1RQ, England
Today, we have a detailed and ever-widening knowledge of the evolution of DNA sequences, but what do we really know about the evolution of protein structure? Until recently, the answer was: not much. The first detailed structures were determined 26 years ago 1, 2; 13 years ago, the database of atomic-resolution protein structures contained just 312 structures (PDB 3). Since then, due to advances in determination methods, the PDB has grown exponentially; presently it holds over 4000 entries. With this size, we can just begin to analyse the evolution of protein structure. Here, we report an analysis of all pairs of proteins in the PDB which have similar three-dimensional (3D) structures 4. For each pair, we aligned the 3D structures, and measured the sequence identity (pairwise identical residues) in the aligned regions. The resulting distribution of pair identity scores shows one prominent and unexpected feature: most pairs cluster in an approximately Gaussian peak centred at 8-9% sequence identity. The distribution is surprisingly similar to that expected for 'random' pairs of completely unrelated sequences. This result has implications for our understanding of protein folding, and of the effect of convergent (different ancestor) and divergent (same ancestor) evolution on protein structure.
It has long been accepted that each protein sequence folds into a unique 3D structure, and that proteins with similar sequences have similar structures. Indeed, for any two sequences with more than 30 out of 100 pairwise identical residues, it can be safely predicted that they will have similar structures 5. This high robustness of structures with respect to residue exchanges explains partly the robustness of organisms with respect to gene-replication errors, and it allows much scope for variety in evolution. Protein pairs below 25-30% pairwise sequence identity often have similar structures, however this cannot be reliably predicted. The surprising lesson of the recent years has been that similar folds can arise even from sequences with as low as 5% sequence identity, which is the level expected for two randomly related sequences (Fig. 1A). In the present work, we sought to establish the distribution of sequence similarity in similar folds.
Our analysis is based on a subset of 125 unique structures (i.e. distinctly different folds) in the current PDB 3. For each of the 125 structures, we searched for all similar structures in (1) the largest subset of PDB containing only proteins with less than 25% pairwise identical residues (to avoid bias from highly populated folds); and (2) the subset of the PDB containing only proteins with more than 25% sequence identity to the target structure. The FSSP database 4 was used to define structural similarity, and sequence identity based on structural alignments. The final set comprised over 1000 pairs of similar structure.
The complete distribution of sequence identity scores (Fig. 1C) has three distinct regions: (1) a large, approximately Gaussian, peak centred at 8%; (2) many smaller, sharp peaks between 15-95%; and (3) a large peak near 100%. This last peak arises largely due to the use of engineered mutants to facilitate structure determination. But how are we to explain the first two regions?

A priori, we might suppose that divergent evolution of sequences from the same ancestor would give rise to a distribution of sequence homology scores with a peak value, D, at some probably low value (e.g. D< 30%), and a smooth, relatively flat tail for high values. Then the small, sharp peaks we see between 30 and 95% would be explained as 'incoherent noise' peaks arising from uneven sampling and the still relatively small size of the current PDB. In the case of convergent evolution, where two unrelated sequences evolve the same structure, we would expect to see a sharp Gaussian distribution with a peak value, C, at very low identity, e.g. CC < 10%.
Surprisingly, we see only one peak; it occurs at very low average identity (8%) and is remarkably symmetric (Fig. 1A). The peak is also very similar to the distribution of sequence identity for pairs of protein sequences chosen at random from the database with a peak value, R, at about 5% (Fig. 1A). Since measuring sequence identity ignores the physico-chemical properties of amino acids, we repeated the analysis using sequence similarity (captured by the McLachlan similarity metric 6). As with sequence identity, the average residue similarity between remote homologues (35%) is also close to the random average (31%, Fig. 1B). One possible explanation is that pairs arising from divergent evolution are somehow under-represented in this range. However, the most obvious interpretation seems to be that both divergent and convergent evolution give rise to Gaussian distributions which peak at similar values, say, C= 8 and D= 10%, and that what we observe here is a superposition of both distributions (with an average Obetween Cand D). The dips in the Gaussian curves for the remote homologues (around 10% for identity, Fig. 1A for similarity, Fig. 1B) may indicate the separation of the two events. Convergent evolution would then constitute the dominant effect we observe.
Above 30% sequence identity, the observed distribution may be
biased by the way in which sequences are chosen for experimental
determination of protein structure. However, the distribution
below 25% sequence identity should be largely unbiased. Thus,
we draw the following conclusions from this analysis. (1) Most
pairs of similar structures have sequence identity as low as expected
from randomly related sequences. This does not imply that sequence
changes were random but that to us - as observers of the effects
of evolutionary history - the sequence variations look random.
(2) On average only three to four percent of all residues (O-
R) are 'anchor' residues (residues crucial for maintaining the
structure). (3) Since most structural homologues have less than
15% pairwise sequence identity, this implies that the rate of
creation of new structures is much slower than the drift towards
the mean (D). Furthermore, the symmetric shape of the distribution
at low sequence identity suggests that for most structures, four
billion years of evolution was sufficient to reach an equilibrium
between these two processes. (4) Naïvely, we may have assumed
that the level of pairwise sequence identity for remote homologues
could be used to distinguish between convergent and divergent
evolution. However, our results suggest that the mean identities
for convergent and divergent evolution (D and C) are quite close,
and hence, in most cases it is difficult to distinguish between
the two effects. (5) The low value of the average pairwise sequence
identity (O) was surprising to us and other biologists. Clearly,
then, this distribution is an important lesson in advancing our
understanding of the evolution of protein structures.



