Midnight zone of protein structure evolution

OR: Protein structures evolve at random - almost

Burkhard Rost , Sean I. O'Donoghue , and Chris Sander

EMBL, 69012 Heidelberg, Germany, rost@embl-heidelberg.de

EBI, Hinxton Hall, Hinxton, Cambridge CB10 1RQ, England

Dear Madam, Dear Sir

Today, we have a detailed and ever-widening knowledge of the evolution of DNA sequences, but what do we really know about the evolution of protein structure? Until recently, the answer was: not much. The first detailed structures were determined 26 years ago 1, 2; 13 years ago, the database of atomic-resolution protein structures contained just 312 structures (PDB 3). Since then, due to advances in determination methods, the PDB has grown exponentially; presently it holds over 4000 entries. With this size, we can just begin to analyse the evolution of protein structure. Here, we report an analysis of all pairs of proteins in the PDB which have similar three-dimensional (3D) structures 4. For each pair, we aligned the 3D structures, and measured the sequence identity (pairwise identical residues) in the aligned regions. The resulting distribution of pair identity scores shows one prominent and unexpected feature: most pairs cluster in an approximately Gaussian peak centred at 8-9% sequence identity. The distribution is surprisingly similar to that expected for 'random' pairs of completely unrelated sequences. This result has implications for our understanding of protein folding, and of the effect of convergent (different ancestor) and divergent (same ancestor) evolution on protein structure.

It has long been accepted that each protein sequence folds into a unique 3D structure, and that proteins with similar sequences have similar structures. Indeed, for any two sequences with more than 30 out of 100 pairwise identical residues, it can be safely predicted that they will have similar structures 5. This high robustness of structures with respect to residue exchanges explains partly the robustness of organisms with respect to gene-replication errors, and it allows much scope for variety in evolution. Protein pairs below 25-30% pairwise sequence identity often have similar structures, however this cannot be reliably predicted. The surprising lesson of the recent years has been that similar folds can arise even from sequences with as low as 5% sequence identity, which is the level expected for two randomly related sequences (Fig. 1A). In the present work, we sought to establish the distribution of sequence similarity in similar folds.

Our analysis is based on a subset of 125 unique structures (i.e. distinctly different folds) in the current PDB 3. For each of the 125 structures, we searched for all similar structures in (1) the largest subset of PDB containing only proteins with less than 25% pairwise identical residues (to avoid bias from highly populated folds); and (2) the subset of the PDB containing only proteins with more than 25% sequence identity to the target structure. The FSSP database 4 was used to define structural similarity, and sequence identity based on structural alignments. The final set comprised over 1000 pairs of similar structure.

The complete distribution of sequence identity scores (Fig. 1C) has three distinct regions: (1) a large, approximately Gaussian, peak centred at 8%; (2) many smaller, sharp peaks between 15-95%; and (3) a large peak near 100%. This last peak arises largely due to the use of engineered mutants to facilitate structure determination. But how are we to explain the first two regions?


Fig. 1

fig 1


A Distribution of pairwise sequence identity in remote homologues and in random alignments (¥). The average sequence identity of all remote homologues was about 8.3% (s = 15%); the hypothetical averages for convergent and divergent evolution are marked by Cand D. The dashed line is a Gaussian envelope fitted to the observed distribution. The average sequence identity of random alignments was about 5.6% (s = 7%; R).
B Distribution of pairwise sequence similarity in remote homologues and in random alignments (¥). The average similarity was 35.3% (s = 11%) for the remote homologues and 31.1% (s = 7.3%) for the random alignments. The alignments were weighted by the McLachlan metric 6, i.e., non-identical amino acids with similar physico-chemical properties were treated as similar. The percentage similarity refers to normalising weighted similarity scores by the maximal score possible rather than to the percentage of residues for which the similarity metric yields a value above a certain cut-off.
C Distribution of sequence identity of all structural homologous pairs used in this analysis (cut at 98% to avoid biasing from engineered mutants). For these figures, sequence identity scores were taken from structural alignments from the FSSP database. Random 'alignments' (¥) were generated by randomly selecting more than 50,000 pairs of proteins from the same data set as the remote homologues, and peak values were scaled to match those of the distributions of structural homologues.


A priori, we might suppose that divergent evolution of sequences from the same ancestor would give rise to a distribution of sequence homology scores with a peak value, D, at some probably low value (e.g. D< 30%), and a smooth, relatively flat tail for high values. Then the small, sharp peaks we see between 30 and 95% would be explained as 'incoherent noise' peaks arising from uneven sampling and the still relatively small size of the current PDB. In the case of convergent evolution, where two unrelated sequences evolve the same structure, we would expect to see a sharp Gaussian distribution with a peak value, C, at very low identity, e.g. CC < 10%.

Surprisingly, we see only one peak; it occurs at very low average identity (8%) and is remarkably symmetric (Fig. 1A). The peak is also very similar to the distribution of sequence identity for pairs of protein sequences chosen at random from the database with a peak value, R, at about 5% (Fig. 1A). Since measuring sequence identity ignores the physico-chemical properties of amino acids, we repeated the analysis using sequence similarity (captured by the McLachlan similarity metric 6). As with sequence identity, the average residue similarity between remote homologues (35%) is also close to the random average (31%, Fig. 1B). One possible explanation is that pairs arising from divergent evolution are somehow under-represented in this range. However, the most obvious interpretation seems to be that both divergent and convergent evolution give rise to Gaussian distributions which peak at similar values, say, C= 8 and D= 10%, and that what we observe here is a superposition of both distributions (with an average Obetween Cand D). The dips in the Gaussian curves for the remote homologues (around 10% for identity, Fig. 1A for similarity, Fig. 1B) may indicate the separation of the two events. Convergent evolution would then constitute the dominant effect we observe.

Above 30% sequence identity, the observed distribution may be biased by the way in which sequences are chosen for experimental determination of protein structure. However, the distribution below 25% sequence identity should be largely unbiased. Thus, we draw the following conclusions from this analysis. (1) Most pairs of similar structures have sequence identity as low as expected from randomly related sequences. This does not imply that sequence changes were random but that to us - as observers of the effects of evolutionary history - the sequence variations look random. (2) On average only three to four percent of all residues (O- R) are 'anchor' residues (residues crucial for maintaining the structure). (3) Since most structural homologues have less than 15% pairwise sequence identity, this implies that the rate of creation of new structures is much slower than the drift towards the mean (D). Furthermore, the symmetric shape of the distribution at low sequence identity suggests that for most structures, four billion years of evolution was sufficient to reach an equilibrium between these two processes. (4) Naïvely, we may have assumed that the level of pairwise sequence identity for remote homologues could be used to distinguish between convergent and divergent evolution. However, our results suggest that the mean identities for convergent and divergent evolution (D and C) are quite close, and hence, in most cases it is difficult to distinguish between the two effects. (5) The low value of the average pairwise sequence identity (O) was surprising to us and other biologists. Clearly, then, this distribution is an important lesson in advancing our understanding of the evolution of protein structures.


References




Further material

fig 2


fig 3


fig 4


fig 5