bottom - TOC - CUBIC-papers - CUBIC - Rost group

Title: Novel leverage of structural genomics
Author:Jinfeng Liu, Gaetano T Montelione, Burkhard Rost
Quote: Nature Biotechnology, 2007, 25:847-851

Novel leverage of structural genomics

Jinfeng Liu 1,3,*, Gaetano T Montelione 4,5 and Burkhard Rost 1,2,3

1 Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA
4 Center for Advanced Biotechnology and Medicine (CABM), Rutgers University, and Department of Biochemistry, Robert Wood Johnson Medical School, Piscataway, New Jersey, USA
5 Northeast Structural Genomics Consortium (NESG), Department of Molecular Biology and Biochemistry, Rutgers University, Piscataway, New Jersey, USA
* Corresponding author: liu@rostlab.org URL http://www.rostlab.org/  Tel: +1-212-851-4669, fax: +1-212-305-7932

 

 

 

Genome sequencing has transformed biomedical research, but it is only the first step toward understanding biological systems. Subsequent steps include the determination of three-dimensional (3D) structures and functions for all proteins, and the study of networks. One long-term goal of Structural Genomics (SG) is to make 3D atomic-level structures easily obtainable for most proteins from their corresponding DNA sequences. In particular, the Protein Structure Initiative (PSI) from the National Institutes of Health (NIH) in the USA is expanding the impact of the Human Genome Project [1] by large-scale structure determination [2] . As part of the five-year pilot phase (PSI1), more than 1,200 protein structures were deposited into the PDB [3] . The second phase of PSI (PSI2), initiated July 1, 2005, supports four Large-scale Research Centers, along with six additional Specialized Research Centers responsible for development of new technologies for structural biology and structural genomics. Structural genomics projects have also been established in Europe [4] , Japan [5] and Canada [2] .

Six years of SG data provide the background needed to establish means for measuring success and for guiding future efforts. SG impacts biology in two important ways: first through pioneering high-throughput technologies and making available extensive structural biology, protein production, and biophysical data along with the corresponding biochemical reagents. The second means of impact is through the final product, i.e. the experimentally-determined 3D structures that create insights into evolution and function. The simplest measure of success would be the number of experimentally-determined structures. However, this measure poorly reflects the overall goal of increasing the impact of structural knowledge, e.g., many structures for very similar proteins do not increase the structural coverage. A recent analysis [6] suggested that about half of the novel structures in the PDB are now contributed by SG and that SG generates novel structures at lower cost than non-SG structural biology, whereas structures from non-SG groups are more often cited than from SG centers.

Such comparisons should be viewed in light of an important difference between two overlapping approaches to structure determination. Traditional structural biology often focuses on subtle structural difference between proteins which impact on their functions. These structural data are an essential and valuable component of biological research, often providing tests of specific biological hypotheses. Structural Genomics, in contrast, is usually not driven by specific hypotheses. Rather, it has the broad hypothesis that structural data can provide critical clues to biochemical function and/or evolutionary processes. These structures often provide specific testable biological hypotheses. In this sense, structural genomics is a "hypothesis generating" activity.

Here, we focused on measuring one direct impact of experimental structures, namely the potential to provide structural coverage for all known proteins and for specific proteomes. Optimal structural coverage requires experimental and computational efforts to complement each other in the sense that the experimental structures will be used as seeds to build comparative models for many homologues [7, 8] . PSI centers report such Òmodeling leverageÓ value annually, corresponding to the number of reliable comparative models that can be built from each experimental structure. The precise value depends on a variety of factors, including specific programs and parameters used to build the models, model assessment protocol and criteria, and the target protein database.

Modeling leverage can be measured in two ways. The first and more reliable approach is to actually build and assess homology models [8] . The downsides are that there are currently no broadly accepted standards for modeling and that building models may be too CPU-intensive for large-scale analyses. The second approach is sequence-based, using a calibrated degree of sequence similarity to identify those proteins which can potentially be modeled reliably. Here, we defined the leverage value for a structure as the number of proteins or residues that can be aligned with the query structure under certain threshold in a specified version of a comprehensive sequence database (Supplementary Methods). We adopted the sequence-based approach through PSI-BLAST [9] because it is both easily reproducible and very fast. Furthermore, this definition can be used to steer target selection toward optimizing novel leverage because it requires only sequence information. The leverage value also depends on the database used to identify models; we opted for UniProt [10] , the most widely used comprehensive protein database. The leverage for a specific protein structure becomes larger as the protein database expands. Although this reflects the genuine expanded value of the experimental structure, leverage values have to be compiled based on a particular release of the database to be comparable.

In compiling modeling leverage, it is critical to consider the new structural information provided by a specific experimental structure ( Fig. 1 ). We introduced the concept of novel leverage to measure the addition of novel structural knowledge. To compile this measure, we generated a database with all the models that can be built reliably on day 0. When a new experimental structure is deposited on day 1, we will use it to build all reliable models possible, and subtract all those that were already available on day 0. The remainder is the novel leverage. Another problem is that of the biological unit for which to report leverage. The PDB contains structures of protein fragments that often but not always span entire domains. Only using proteins as the unit therefore, albeit intuitive, may be misleading. Reporting per-residue leverage, on the other hand, is not intuitive but is technically sound. The added benefit is that it encourages the determination of structures for larger fragments or even complexes.

Fig. 1
fig1.gif
   

Fig. 1: Concept of novel leverage The blue square represents all proteins in UniProt, the pink ellipse sketches the subset of UniProt proteins (UniProt release 7.6) that can be modeled by all PDB structures on October 21, 2004, based on a predefined sequence similarity cutoff (PSI-BLAST E-value <1e-10). For two experimental structures (1xtn [12] and 1xto; both deposited on October 22, 2004), the leverage values are indicated by green circles that mark all the proteins in UniProt that can be modeled through their deposition. The green circle cut-outs denote the novel leverage, i.e. the number of models that could not have been built on October 21, 2004. Although 1xtn provided a higher total leverage than 1xto (619 vs. 492), the novel coverage achieved by 1xto was 100-times higher than that for 1xtn (373 vs. 3).

We analyzed the novel leverage values for all protein structures deposited into the PDB between September 1, 2000 (PSI start date) and August 31, 2006. Some 600,519 novel 3D models could be built from those structures. Structures from SG centers accounted for 161,947 of these models (27% of the proteins and 21% of the residues, blue bars in Fig. 2 a,b, Supplementary Table 1 ); more than half of the SG contribution came from the four largest PSI centers (PSI-BIG4 in Fig. 2 a,b). The concept of novel leverage can also be applied to particular subsets of proteins, e.g., SG structures contributed about 19% of the novel leverage for eukaryotic proteins ( Fig. 2 b, green) and about 23% of all novel models for human proteins ( Fig. 2 b, red). These numbers are notably lower than the novel leverage counting all proteins in UniProt, suggesting a bias toward covering prokaryotic proteins in these six years of SG efforts.

Fig. 2
fig2.gif
   

Fig. 2: SG has made increasingly significant contribution to covering novel proteins.
(a) Comparison between SG (world-wide structural genomics efforts) and non-SG in their novel leverage (number of novel models built); values are given as percentages of the novel leverage for the entire PDB during the same six-year period. PSI-BIG4: four largest centers of PSI (JCSG, MCSG, NESG, and NYSGXRC). (b) Comparisons of novel leverage at the residue level between SG and non-SG structures. (c) Annual statistics for the fraction of structures determined by SG distinguishing between the contribution to all structures deposited in a given year (gray), the contribution toward novel proteins (shaded purple) and toward novel residues (purple). (d,e) Plotted are the percentage of proteins (d) and residues (e) in the entire UniProt database (release 7 6 <<<1e-10). The green line (SG) and the blue line (PSI) represent the structure coverage of UniProt if the only structures deposited into the PDB since September, 2000 were from SG (or PSI). Overall, the coverage of UniProt has increased by about 15 percentage points over the last six years, between a third (per residue) and almost one half (per protein) of this increase originated from SG structures.

The novel leverage contribution from SG structures has dramatically increased over time ( Fig. 2 c): from 10% at the residue level in 2001 to 31% in 2005. On average, each SG structure deposited in 2005 provides novel leverage of 36 proteins and 6,600 residues, compared with 15 proteins and 3,618 residues for each structure deposited by non-SG groups. This increased performance comes at an opportune time since achieving novel leverage becomes increasingly difficult: although the number of deposited structures has increased exponentially over the last six years, the coverage of UniProt by structural models has increased only linearly ( Fig. 2 d,e, Supplementary Fig. 1 ). The structural coverage for UniProt has grown by 15 percentage points since PSI began; almost half of this growth originated from SG structures ( Fig. 2 d,e). One goal of SG is to increase the efficiency and reduce the cost of structure determination through technology development. A detailed analysis demonstrates that SG is becoming more cost efficient in obtaining novel leverage: the cost for obtaining each novel structural model has dropped to $2,600 for PSI structures in 2005 (Supplementary Fig. 2 ).

The growing overlap of 3D fragments between structurally rather different proteins may suggest that the traditional concept of Òthe foldÓ is misleading. Even if folds had a scientific reality, the aim of structural genomics would not be the discovery of Ònovel foldsÓ. Rather, it is to assign a structural class to sequence families for which no reliable structural information is available. Our definition of novel leverage implicitly but directly measures the degree to which such a classification is provided by experimental structures.

In conclusion, we have proposed novel modeling leverage as a key measure for assessing the impact of structural information on protein databases. Novel leverage covers both the importance (in term of family size) and novelty of an experimental structure. We demonstrated that SG structures have been accounting for an increasingly large portion of the novel leverage of the entire PDB and that SG is very cost efficient in obtaining novel leverage. The absolute novel leverage values that we presented are subject to change given different parameters, but the relative values are likely to remain similar. In particular, since protein databases continue to grow rapidly (e.g. UniProt increased by ~67% over the last year; environmental genome sequencing [11] will boost this increase even more), the absolute leverage values are only meaningful with regards to a fixed release of the protein database. Nevertheless, novel leverage values are still very useful not only for post hoc assessment, but also for establishing target selection strategies for SG, e.g., how to optimally increase the structural coverage of model organisms. With such guidance in target selection, dedication of limited resources toward systematically increasing the structural coverage of all proteins appears an extremely valuable objective. Our proposed measure in this sense also reflects on the degree to which structural biology unravels evolutionary connections and to which extent three-dimensional information enables the ascent into discoveries accessible only through knowledge of protein structure.

 

Acknowledgements

This work was supported by a grant from the Protein Structure Initiative (PSI) of the National Institute for General Medicine (NIGMS) at the National Institutes of Health (NIH U54-GM074958-01).

References

1.Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C.,Zody, M. C. et al. (2001). Initial sequencing and analysis of the human genome.Nature, 409, 860-921.
2.Terwilliger, T. C. (2000). Structural genomics inNorth America. Nat Struct Biol, 7 Suppl, 935-9.
3.Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G.,Bhat, T. N. et al. (2000). The Protein Data Bank. Nucleic Acids Res, 28, 235-42.
4.Heinemann, U. (2000). Structural genomics in Europe:slow start, strong finish? Nat Struct Biol, 7 Suppl, 940-2.
5.Yokoyama, S., Hirota, H., Kigawa, T., Yabuki, T.,Shirouzu, M. et al. (2000). Structural genomics projects in Japan. NatStruct Biol, 7Suppl, 943-5.
6.Chandonia, J. M. & Brenner, S. E. (2006). Theimpact of structural genomics: expectations and outcomes. Science, 311, 347-51.
7.Baker, D. & Sali, A. (2001). Protein structureprediction and structural genomics. Science, 294, 93-6.
8.Mirkovic, N., Li, Z., Parnassa, A. & Murray, D.(2006). Strategies for high-throughput comparative modeling: Applications toleverage analysis in structural genomics and protein family organization. . Proteins, in press, .
9.Altschul, S. F., Madden, T. L., Schaffer, A. A.,Zhang, J., Zhang, Z. et al. (1997). Gapped BLAST and PSI-BLAST: a newgeneration of protein database search programs. Nucleic Acids Res, 25, 3389-402.
10.Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C.,Boeckmann, B. et al. (2005). The Universal Protein Resource (UniProt). NucleicAcids Research,33 Database Issue,D154-9.
11.Venter, J. C., Remington, K., Heidelberg, J. F.,Halpern, A. L., Rusch, D. et al. (2004). Environmental genome shotgunsequencing of the Sargasso Sea. Science, 304, 66-74.
12.Xing, Y., Liu, D., Zhang, R., Joachimiak, A.,Songyang, Z. et al. (2004). Structural basis of membrane targeting by the Phoxhomology domain of cytokine-independent survival kinase (CISK-PX). J BiolChem, 279, 30662-9.   

Contact:    admin@rostlab.org Version:    Mar 28, 2007
 top - TOC - CUBIC-papers - CUBIC - Rost group