| Title: | Target space for structural genomics revisited |
| Author: | Jinfeng Liu & Burkhard Rost |
| Quote: | Bioinformatics (2002) 18, 922-933 |
Target space for structural genomics revisited
1 Dept. of Pharmacology, Columbia University, 630 West 168th Street, New York, NY 10032, USA, liu@cubic.bioc.columbia.edu
2 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
3 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
* Corresponding
author: rost@columbia.edu, http://cubic.bioc.columbia.edu/
Tel: +1-212-305-3773, fax: +1-212-305-7932
Motivation: Structural genomics eventually aims at determining structures for all proteins. However, in the beginning experimentalists are likely to focus on globular proteins to achieve a rapid basic coverage of protein sequence space. How many proteins will structural genomics have to target? How many proteins will be excluded since we already have structural information for these or since they are not globular? We have to answer these questions in context of our target selection for the North-East Structural Genomics Consortium (NESG).
Results: We estimated that structural information is available for about 6-38% of all proteins; 6% if we require high accuracy in comparative modelling, 38% if we are satisfied with having a rough idea about the fold. Excluding all regions that are not globular, we found that structural genomics may have to target about 48% of all proteins. This corresponded to a similar percentage of residues of the entire proteomes (52%). We explored a number of different strategies to cluster protein space in order to find the number of families representing these 48% of structurally unknown proteins. For the subset of all entirely sequenced eukaryotes, we found over 18000 fragment clusters each of which may be a suitable target for structural genomics.
Availability: All data are available from the authors, most results are summarised at: http://cubic.bioc.columbia.edu/genomes/RES/2002_bioinformatics/
Contact: rost@columbia.edu
Key words: structural genomics; target selection; sequence space; structure space; protein sequence analysis; membrane proteins.
NOTE for authors (after publication): Upon publication the notice must be changed to read This article is published in (Bioinformatics, issue, date and pages) © copyright The Oxford University Press (2002). OUP is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.
| COILS | coiled-coil prediction/region [1, 2] |
| EVA | server automatically evaluating structure prediction methods [3, 4] |
| HTM | Helical Trans-Membrane region |
| NORS | NO regular secondary structure, i.e. region of more than 70 consecutive residues with less than 5% regular secondary structure (helix, strand) |
| PDB | Protein Data Bank of experimentally determined 3D structures of proteins [5] |
| Pfam | database of expert-curated alignments of protein families [6, 7] |
| PHD | Profile based neural network prediction of secondary structure (PHDsec [8, 9, 10]), solvent accessibility (PHDacc [11, 10]), and transmembrane helices (PHDhtm [12, 10, 13]) |
| PrISM | Protein Informatics System for Modeling, used here for the definition of sequence-consecutive structural domains [14] |
| PSI-BLAST | position specific iterated database search [15] |
| rmsd | root mean square deviation |
| SEG | program detecting low-complexity regions [16] |
| SignalP | signal peptide prediction [17] |
| SWISS-PROT | data base of protein sequences [18] |
| TrEMBL | translation of the EMBL-nucleotide database coding DNA to protein sequences [18] . |
Structural genomics to determine all native protein structures. In 2000, the National Institute of Health (NIH) in the USA began to finance pilot projects for large-scale protein structure determination (structural genomics). Two major objectives of structural genomics have often been given. First, experimentally determine one protein structure for each natural protein [19] . Second, determine one structure for all missing links in pathways and biological mechanisms [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30] . These two objectives correspond to the two aspects of genome sequencing: (i) the mass of data, and (ii) the completeness of entirely sequenced organisms. One expected technical benefit from structural genomics is the development of techniques and protocols for large-scale expression, purification, crystallisation and structure-determination. An important benefit for molecular biology may be the determination of the structural scaffolds for most basic functional elements. A considerable increase in the fraction of proteins for which we have some structural information may also advance the determination of function for single proteins or entire proteomes. It is commonly assumed that the scaffolds of protein folds constitute one of the 'basic units' for evolution. If so, structural genomics will also help to better understand evolution. Structural genomics focuses on structural modules or domains. However, isolated domains do not always suffice to understand function. Instead, understanding function often requires studying complexes composed of many proteins. The difficulty of determining structures for large complexes will be prohibitive for the first round of structural genomics.
Determine one structure for each family of closely related proteins. The safest strategy to go about the goal of determining structures for all native proteins is to simply express, purify, crystallise and X-ray all protein sequences one by one, just in the way large-scale genome sequencing operates. However, sequencing is technically much simpler than is structure determination. None of the necessary steps - express, purify, crystallise, X-ray - has ever been accomplished on the scale of 'all proteins in a proteome'. Consequently, we have to find a way of focussing on some representative fraction of all proteins. Resources such as CATH [31] , FSSP [32] , HSSP [33] , or SCOP [34] illustrate that fewer than 1000 folds and about 2500 families are representative for over 20,000 structures deposited in PDB [5] . Hence, the conceptually simple refinement of the selection strategy is to determine one structure for each unknown fold. Unfortunately, this straightforward concept hides a number of severe problems. The first is that the absolute majority of similar folds have less than 12% pairwise sequence identity [35, 36, 37] , i.e. populate the midnight zone of sequence comparisons in which we cannot detect the fold similarity from sequence alone. Hence, we would have to determine the fold to find the set of representative folds. One way around this vicious circle is to reformulate the goal: Determine one structure for each family of proteins that are related by sequence. The levels of pairwise sequence similarity that imply similarity in structure are well established [38, 39, 40, 41, 42, 43, 36, 37] . Thus, it may seem that all bioinformatics has to do is to cluster all proteins into families of proteins with similar structures, exclude all clusters with known structures and define the remaining list as the target list for structural genomics [44] . In fact, this procedure describes the current modus operandi of structural genomics initiatives fairly well. Additionally, most groups exclude clusters that are particularly problematic due to the presence of membrane regions, and/or long regions of low-complexity.
The age of structural genomics began. The currently active structural genomics groups differ in their focus. Most groups focus on particular organisms: Mycoplasma genitalium and Mycoplasma pneumoniae by BSGC [45] , Caenorhabditis elegans by JCSG [46] , Mycobacterium tuberculosis by TBSGC [47] , Caenorhabditis elegans and Pyrococcus furiosus by SECSG [48] , Saccharomyces cerevisiae by the YSG [49] , Thermus thermophilus by SRG [50] , Homo sapiens by PSF [51] . Two groups focus on particular protein types (short proteins from eukaryotes by NESG [52] , disease related and 'easy' proteins by MCSG [53]), and one on particular functional types (enzymes by NYSGRC [54]). The nine initiatives currently financed by the National Institute of Health (NIH) in the USA together intend to add about 2000 structures over the next four years. Given that almost 3000 new protein chains have been added to PDB [5] over the last 12 months, this number may appear small. However, of the 3000 structures added to PDB in 2001, only about 500 belonged to families of unknown structure [5, 3, 4] . Since the structural genomics initiatives set out to determine structures exclusively for such families of unknown structures their yield would double the number of families for which structures will be added until 2005. Another implicit goal of all structural genomics initiatives is to reduce the costs of determining a protein structure from its current value of about $100K/protein. Interestingly, the US-based pilot groups receive about this amount from the NIH to determine their projected 2000 structures. However, the goal of the first round of structural genomics is not primarily to determine as many structures as possible, rather it is to pioneer the development of techniques that will be required for a cost-efficient large-scale structure determination.
Existing methods that cluster sequence-space. Over the last years, a number of groups have presented different approaches to cluster sequence space. CATH [31] , FSSP [32] , and SCOP [34] group proteins of known structure according to their fold. These classifications can then be extended to homologous proteins for which we do not experimentally know structure – a concept pioneered in the HSSP database [38] . When we want to group proteins into families without knowing the structure of any protein in that family, the problem becomes how to define the boundaries of which proteins to include. For example, should proteins A and F in Fig. 1 become part of the same family, or should we try to chop both A and F into domains and build a family of what is labelled 'Domain 1' in Fig. 1 ? PFAM [6, 7] is an expert annotated database of protein families, that tries to build multiple alignments of regions in proteins that are believed to constitute domains. One limitation of PFAM is that not all known proteins are included, yet. Even more limited in that respect is the similar approach toward listing all domains of secreted proteins in SMART [55] . COG [56, 57] builds clusters of orthologous groups (COGs: proteins in different species that evolved from a common ancestral protein) or orthologous sets of paralogues (proteins from the same organism, which are believed to be related by duplication) from at least three species. The authors try to split multi-domain clusters through pair relations. ProtoMap [58, 59, 60, 61] is an automatic, hierarchical classification of the entire SWISS-PROT database [62] that is based on pairwise relations. The particular algorithm introduced by ProtoMap for merging and splitting groups of pairwise related proteins, yields an implicit separation into clusters with single and multiple domains. An attempt at combining sequence-based and structure-based classifications is implemented in BioSpace that first clusters all proteins of known structures and then pulls in proteins of unknown structures in a way similar to the ProtoMap algorithm. Finding consensus motifs in alignments and then cutting according to some statistical criteria is the concept that leads to the automatic classification of all proteins in ProDom [63] . The particular problem of ProDom is that the domains found tend to be shorter than those assigned from known protein structures. The basic idea of using boundaries in alignments to identify domains has also been implemented by other groups [64, 65] . In particular, the GeneRAGE [65] algorithm appears to yield domains that resemble structural domains. ProClass classifies proteins into families based on PROSITE sequence motifs [66, 67] and PIR super-families [68] . Domains are not explicitly detected by ProClass, rather they are taken from previous annotations (from experts, PFAM, or ProDom). Picasso [69] is another approach clustering protein space based on pairwise relations. It seems that Picasso splits domains in a way similar to the GeneRAGE algorithm. The idea of mapping the space of all proteins implies that we have some sort of metric that defines a distance between two groups. The problem with this concept is that we can only measure the similarity not the distance between two proteins. For example, assume proteins A and B are both 100 residues long. If they have 33 pairwise identical residues, we can infer that they have similar structures [36] . If they only have 25 pairwise identical residues we know that the odds are one in ten that A and B have similar structure, however, these odds reflect our lack of knowledge of the relation between A and B rather than their actual structural similarity. In fact, A and B may structurally be more similar than a pair A'-B' with 33 identical residues. Furthermore, assume we have a globin, an immunoglobulin and a TIM-barrel. We know that the three are not similar, however, we cannot unambiguously define a distance relationship that concludes something such as the globin is more similar to a TIM-barrel than it is to the immunoglobulin. Amongst all the clustering attempts, ProtoMap appears to be the one that most successfully introduces a kind of distance metric [60] .
Here, we re-evaluated earlier estimates [25, 70, 44] for the number of structural families to target by structural genomics efforts. We also presented two clustering strategies that illustrated problems with the simple concept of 'one structure per family'. In particular, our maximal-size clusters illustrated that we fail to cluster sequence-space if we do not dissect the sequences into structural domains before we start clustering. Our preliminary implementation of a domain-dissection approach suggested that structural genomics initiatives might have to target over 18000 fragment clusters in eukaryotes alone. This estimate resulted from the proteins that we selected in our second round for the target selection of the North-East Structural genomics Consortium (NESG; www.nesg.org).
Fig. 1: Concepts of clustering and domain splitting.
Assume we have six
proteins aligned: A-F. Regions in which the proteins have significant pairwise
sequence similarity between any pair shown are marked as black lines (A). The
particular pairs of 'significant similarity' are given in the matrix (B: grey boxes
mark similar pairs; note that we assume symmetry). The 'minimal-size' cluster
concept realises naïvely the task of finding families. For the example
shown, the six proteins group into two clusters, with protein F belonging to
both (C: right). The first of the two clusters could for example describe
all proteins in one HSSP file [33, 32] . The
minimal-size clustering is appropriate if we assume that proteins have no
domains and that the matrix contains no grey tones, i.e. that two proteins are
either similar or not. The 'maximal-size' clustering assumes we have failed to
dissect proteins into domains and want to ascertain that no two clusters have
residual similarity (C: left). One rationale could be to separate the clustering task into
two steps. (1) Coarse-grained: find the maximally separated clusters (those of
largest sizes). (2) Fine-grained: find all minimal-size clusters within the big
clusters. This approach is appropriate to some extent if sequence-space is not
continuous. Assuming that the protein universe is built by arranging a limited
number of well-defined, well-separated structural domains in all possible ways,
we conclude that we better cluster these basic structural building blocks. We
illustrate one particularly naïve way of dissecting proteins into domains:
protein F is assumed to consist of two domains because of the following
relation:
F = E (read 'similar to'), F = A, but A ≠ E (read 'not similar to').
Note that C is not split into two domains because its similarity to D is assumed to be on the borderline, i.e. below some given threshold (indicated by light grey in B).
We obtained the sequences for the entire proteomes of the 31 organisms we analysed from the public domain [71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95] . All ORFs extracted from the genome sequences were downloaded from ftp://ncbi.nlm.nih.gov/genbank/genomes/, except for Homo sapiens (from SWISS-PROT release 39 and TrEMBL database release 15), Drosophila melanogaster (from www.fruitfly.org, release 2), and Caenorhabditis elegans (from www.sanger.ac.uk/Projects/C_elegans/wormpep/, wormpep 65).
|
Latin Name |
Nprot b |
|
Latin Name |
Nprot b |
|
|
|
|
|
|
|
Archae bacteria |
|
|
Prokaryotes |
|
|
Aeropyrum pernix K1 |
2694 |
|
Aquifex aeolicus |
1522 |
|
Archaeoglobus fulgidus |
2383 |
|
Bacillus subtilis |
4099 |
|
Methanococcus jannaschii |
1735 |
|
Borrelia burgdorferi |
850 |
|
Methanobacterium |
|
|
Campylobacter jejuni |
1731 |
|
thermoautotrophicum |
1871 |
|
Chlamydia pneumoniae |
1052 |
|
Pyrococcus abyssi |
1765 |
|
Chlamydia trachomatis |
894 |
|
Pyrococcus horikoshii |
2064 |
|
Deinococcus radiodurans |
3103 |
|
|
|
|
Escherichia coli |
4285 |
|
Eukaryotes |
|
|
Haemophilus influenzae |
1716 |
|
Arabidopsis thaliana |
25462 |
|
Helicobacter pylori |
1788 |
|
Caenorhabditis elegans |
20011 |
|
Mycoplasma genitalium |
470 |
|
Drosophila melanogaster |
14334 |
|
Mycoplasma pneumoniae |
677 |
|
Subset of Homo sapiens c |
31251 |
|
Mycobacterium tuberculosis |
3918 |
|
Saccharomyces cerevisiae |
6357 |
|
Mycobacterium tuberculosis |
3918 |
|
|
|
|
Neisseria meningitidis |
2081 |
|
|
|
|
Rickettsia prowazekii |
834 |
|
|
|
|
Synechocystis PCC6803 |
3169 |
|
|
|
|
Thermotoga maritima |
1846 |
|
|
|
|
Treponema pallidum |
1031 |
|
|
|
|
Ureaplasma urealyticum |
613 |
|
|
|
|
|
|
A: Abbreviations: taken from SWISS-PROT; b Nprot: the number of Open Reading Frames (predicted proteins) is taken from the respective original publication. c human sequences: the only non-complete data were the human sequences, taken from SWISS-PROT release 39 [62] and from TrEMBL release 15 [62] . Note: all sequences used are available on our web site [110] .
Search for similar proteins. We detected similar sequences in two ways. (1) Run PSI-BLAST [15] searches against all known sequences contained in SWISS-PROT [18] , TrEMBL [18] , and PDB [5] . For simplicity, we refer to the combination of these three databases as the set BIG. We first searched against a filtered version of BIG and then used the final profile to search against the unfiltered BIG [96, 97] . We included all hits below a PSI-BLAST E-value of 10-3. We tested various thresholds for 'significant sequence similarity to protein of known structure'. Firstly, we included all protein pairs with more than 50% pairwise identical residues (corresponding to 'high accuracy in comparative modelling'). Secondly, we included all pairs above the refined HSSP-curve (medium accuracy in comparative modelling) relating the length of the alignment to the respective pairwise sequence identity/similarity [38, 36] . Thirdly, we included all pairs with PSI-BLAST E-values below 10-3 (for most of these, comparative modelling supposedly identifies the basic fold schematically).
Predict membrane proteins. We used only the filtered MaxHom alignments [36] for predicting membrane regions by the program PHDhtm [12, 10, 13] using the default threshold of 0.8. We adjusted the total number of membrane proteins according to the false positive rate (1.6%) and false negative rate (3%) published in the original paper [13] :
(Eqn. 1)
where n was the final number of membrane proteins we reported, FP and FN were the false positive and false negative rates respectively, npred was the number of predicted membrane proteins in the genome, and ntotal was the total number of proteins in the genome. Note: our notion of 'membrane proteins' is restricted to integral helical membrane proteins. In particular, we ignored proteins anchoring helices in the membrane or those inserting beta-strands (porins) since these classes of proteins cannot be identified from sequence information alone.
Predicting signal peptides. We predicted signal peptides using the program SignalP [98, 17] . We considered a protein to contain a signal peptide if the “mean S” value in the prediction was above the default threshold. The accuracy of SignalP was estimated to be around 90% [17, 99] . We excluded archaebacteria from the analysis since SignalP was developed for prokaryotes and eukaryotes.
Predicting coiled-coil helices. We used the program COILS [1, 2] to predict coiled-coil region, with the window-size set to 28 residues and the threshold for probability set to 0.9.
Identifying regions of low-complexity (SEG). We labelled regions of low-complexity using the program SEG [100, 16] using the default parameters.
Identifying regions with no regular secondary structure (NORS). Using the filtered MaxHom alignments, we used PHDsec [8, 9, 10] to predict secondary structures. We considered stretches of more than 70 consecutive residues with less than 12% predicted helix or strand as 'NORS' [101] .
Operational definition for removing fragments from the 'to-do' list. Many proteins of known structure contain regions of low-complexity [102, 103] . However, proteins that contain almost no high-complexity regions constitute – at best – low-priority targets for structural genomics. We removed all proteins that had fewer than 50 residues in non-membrane, non-coiled, non-signal peptide, non-SEG, or non-NORS regions.
Clustering sequence space. In order to cluster sequence space for eukaryotes, we tested the following three approximations (Fig. 1). (1) Maximal cluster size: merge all proteins that have some local similarity (BLAST score < 10-3) to one another into one cluster; merge clusters as long as they have common members. (2) Minimal cluster size: given any two proteins A and B, group these into one cluster if the sequence similarity between the pair is above a threshold (BLAST score < 10-3). While the maximal clustering is independent of the starting point, the final clusters resulting from the minimal clustering do differ. We followed the algorithm encoded in GeneRAGE [65] by starting from single-domain proteins (Fig. 1). Once we compiled the minimal-size clusters, we took the domains implied by the clustering and split those further.
We have some idea about structure for 6-38% of all proteins. We have explicit experimental information about structure for less than 0.3% of all entirely sequenced proteomes. The answer to the question for which fraction of entire proteomes we can predict structure by comparative modelling depends on the accuracy we require for the model. One extreme point is to model only proteins for which the respective experimental structure has more than 50% pairwise identical residues. At that level, models are typically very accurate (< 3 Å Ca-rmsd) [104, 3, 4, 105] . For all the 28 proteomes that we analysed (Table 1), we found that about 6% of the proteins can be modelled at this level of accuracy (Fig. 2 left panel, black bars). Next, we tested a level of average accuracy at which the models provide a good idea of the basic fold (around 5-6 Å Ca-rmsd) [104, 3, 4, 105] . At that 'cartoon-level' of model accuracy, we found similar structural regions for about 20% of all proteins (Fig. 2 left panel, grey bars). Finally, we dropped the requirement for model accuracy entirely, and tested a threshold at which the model most often captures basic features of the respective structure. At that level, we found structurally known regions in 38% of all proteins (Fig. 2 central panel, grey bars). Note in particular the extreme increase in coverage when using PSI-BLAST searches against the BIG database. The reason for this non-linear behaviour was that pairs of fairly diverged sequences dominated most structural families (Fig. 3).
Fig. 2: Estimate for the percentage of protein targets.
Left panel: Percentages of proteins in respective proteome for which we found
similarities to proteins of known structure above (1) pairwise sequence
identities of 50% (PIDE), and (2) above the refined HSSP-threshold, e.g. given
by 'more than 33% pairwise identity over 100 residues aligned' [36] . Right
panel: Percentages of proteins predicted with
membrane helices (HTM), coiled-coil regions (COILS), and signal peptides
(SignalP) in all proteomes. Centre: The lowest
threshold for which we can somehow reliably predict aspects of structure
through comparative modelling is an E-value in PSI-BLAST of 10-3. At
this level, we found about 38% of all proteins to have similarity to known
structures. To exclude all these proteins for target selection might be deemed
highest priority. Next, we identified all the proteins without any globular
region longer than 50 residues (UNWANTED). The sum over PSI-BLAST + UNWANTED
marks the percentage of proteins that are certainly not interesting for target
selection in the first round of structural genomics. For all proteomes this
number added to about 52% leaving about 48% of all proteins as putative
targets.
Fig. 3: Distribution of structural families.
Families are defined
as all sequences identified by a PSI-BLAST search with all proteins from the 28
genomes (Table 1) against all known proteins below a threshold of 10-3.
The sum over all pairs thus identified constitutes the 100% mark on the y-axis.
The curves (separated counts for eukaryotes and prokaryotes) give the
cumulative percentage of protein pairs found (A) at a given
PSI-BLAST E-value (x-axis), and (B) at a given level of pairwise
sequence identity. If we were at an early point of evolution, we expect
families of structurally similar proteins to be dominated by very pairs of very
similar sequences. We observed the opposite situation: most pair relations
unravel for low E-values, i.e. most pairs have diverged to levels of sequence
similarity identifiable only through profile-based searches. The curves
illustrate why the percentage of proteins for which we can utilise comparative
modelling changes more than six-fold from 6% (high accuracy) to 38% (very low
accuracy). An interesting observation on the side is the invisible difference
between prokaryotes and eukaryotes. In fact, the data suggested that
prokaryotic families were – on average – as diverged as eukaryotic
families.
30-40% of all proteins contains non-globular regions. We found at least one membrane helix for about 22% of all proteins (Fig. 2 right panel, black bars). About half of all predicted membrane proteins had more than five helices [70] . While the percentage of helical membrane proteins was similar between all three kingdoms (archae, eukaryotes, and prokaryotes), we found significantly more proteins with coiled-coil regions in eukaryotes (eukaryotes > 10%; prokaryotes + archae < 5%, total about 8%; Fig. 2 right panel, stripped bars). Most coiled-coil proteins consisted of a single 28 residue coil [70] . We also found that the percentage of long NORS regions (Methods) differed significantly between eukaryotes and the other two kingdoms: eukaryotes had about 25% NORS proteins, prokaryotes and archae only about 3%, bringing the total percentage to 16% [101] . Initially, structural genomics initiatives will discard all those proteins. The total percentage of proteins with membrane helices, coiled-coils, or NORS regions added to about 30-40% (Fig. 2 central panel).
About 48% of all proteins constitute targets for structural genomics. Even when avoiding membrane regions, experimentalists may still want to determine the structure for the globular region of a membrane protein. We assumed rather daringly that any region of more than 50 consecutive residues without (i) membrane helices, (ii) coiled-coil helices, (iii) low-complexity stretches, (iv) similarity to a known structure, and (v) for which we predicted some regular secondary structure could be of interest to structural biology. After this reduction, we found about 48% of all proteins (slightly less for eukaryotes) to contain regions that could be of interest for structural genomics (Fig. 2 centre). Remarkably, the respective number for the subset of all human proteins we used (31K) was only 35%.
The immediate to-do list corresponds to about 54% of all residues. When estimating the percentage of proteins that structural genomics targets, we need to define arbitrary thresholds for when we consider the unwanted or structurally known regions to span enough of a protein to discard this from the to-do list. When estimating the percentage of residues in the entire proteomes that may become targets for structural genomics, we needed no assumptions about thresholds for 'minimal globular regions'. Rather, we could simply count all residues in transmembrane helices, coiled-coil helices, low-complexity stretches, signal peptides, NORS regions, and regions for which comparative modelling could provide an idea about structure. We found that on average structural genomics will have to contribute to adding in structural information for about 54% of all residues (Fig. 4). On the per-residue level, the subset of human (47%) did not differ as significantly from the average as for the protein level. This might suggest that the difference between human and others on the protein level has some reason other than that our subset was overly biased.
Fig. 4: Estimate for the percentage of residues in putative targets.
Right panel: Percentages of residues in transmembrane helices (HTM), coiled-coil
helices (COILS), signal peptides (SignalP), low-complexity regions (SEG) and
regions without regular secondary structure (NORS). Note: these numbers do not
necessarily add up, since coiled-coil regions are occasionally detected by SEG.
Left panel: Percentages of residues for which PSI-BLAST found similarities to
known structures below an E-value of 10-3, and percentage of
UNWANTED residues, i.e. those that have any of the regions listed on the right
panel. These are unwanted in that they may seriously hamper a high-throughput
structural genomics effort. Interestingly, the percentage of residues for
putative targets was rather similar to the percentage of proteins (Fig. 2).
Eukaryotes cluster into over 170000 fragments. We did not have the CPU resources to cluster all proteomes. Instead, we only had results for Methanococcus jannaschii, Saccharomyces cerevisiae, and the results for all known eukaryotic proteomes (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, subset of Homo sapiens, and Saccharomyces cerevisiae). The 6357 Saccharomyces cerevisiae proteins fall into 3796 maximal-size and into 5448 minimal-size clusters (Fig. 1 Table 2). The largest single maximal-size cluster contained 1351 proteins. The simple domain-splitting algorithm similar to GeneRAGE [65] first separated and then grouped the minimal-size clusters into 3638-6867 clusters ( Table 2 ). The data were similar for Methanococcus jannaschii (Table 2). When splitting ALL eukaryotes, the situation changed dramatically (Table 2): The 97K eukaryotic proteins fall into 22K maximal-size clusters with the largest single cluster containing almost half of all the proteins (46K). This result demonstrated that the maximal-size clustering was not reasonable. We were surprised by the separation of the 97K eukaryotic proteins into more than 170K fragments, i.e. by finding almost twice as many minimal-size fragment-clusters as proteins for the eukaryotes. The majority of these 170K fragments spanned over 80-150 residues (Fig. 5). Overall, the length of the consensus region in each cluster corresponded to the length distribution of structural domains. The particular algorithm implemented in ProDom [63] that uses evolutionary relations to split proteins into domains yields too short fragments. The differences between the fragments generated by the GeneRAGE-type algorithm that we implemented and structural domains from PrISM [14, 37, 106, 107] indicated that the fragments we found were - on average - rather too long than too short.
Fig. 5: Distribution of fragment lengths for eukaryotes.
We found 170K clusters
for the 97K eukaryotic proteins (Table 2). We suspected that this number was
inappropriately high due to an over-splitting of the clustering algorithm
applied (Enright et al., 2000). However, we could not verify this suspicion
when comparing the lengths of the 170,186 fragment clusters to that of
structural domains from PrISM.
Table 2: Clustering and domain splitting of selected proteomes
|
Set a |
Nprotb |
NminPc |
NmergePd |
NmaxPe |
Largestf |
NminDg |
NmaxDh |
|
|
|
|
|
|
|
|
|
|
Methanococcus jannaschii |
1735 |
1432 |
1,070 |
1211 |
72 |
1459 |
1229 |
|
Saccharomyces cerevisiae |
6357 |
5448 |
3337 |
3796 |
1351 |
6867 |
3638 |
|
Eukaryotes |
97421 |
170186 |
|
22112 |
46318 |
|
|
|
Eukaryotic targets |
|
|
|
|
|
18127 |
15003 |
aSet: respective data set, 'Eukaryotes' = arabidopsis, worm, fly, yeast, subset of human, 'Eukaryotic targets' is the subset of eukaryotic clusters that may be targeted by structural genomics (at least one consecutive stretch of over 50 residues without homologue of known structure, membrane regions, low-complexity residues, or NORS regions); bNprot: the number of predicted proteins from the respective original publication; cNminP: the number of 'minimal-size' clusters; dNmaxP: the number of 'maximal-size' clusters; eNmergeP: the number of 'minimal-size' clusters after merging the clusters again with pairwise BLAST E-value of 10-3; fLargest: number of proteins in largest single cluster; gNminD: the number of 'minimal-size' domain clusters; hNmaxD: the number of 'maximal-size' domain clusters.
More than 16000 targets for structural genomics found in eukaryotes alone. The five eukaryotic proteomes corresponded to over 170K minimal-size cluster. Next, we extracted the consensus regions for all these 170K clusters, and removed all clusters that had not at least one fragment of more than 50 consecutive residues without homologue of known structure (according to PSI-BLAST), transmembrane- or coiled-coil helix, low-complexity or NORS region. This reduction yielded 107410 eukaryotic fragments of potential interest to structural genomics (Table 2). An all-against-all for these 107410 fragments resulted in 18127 minimal-size clusters (Table 2), the largest of which contained 81 eukaryotic proteins. Finally, we mapped the 18127 consensus regions to the Pfam-A database [6, 7] . Most of our clusters did not correspond to any of the known 2267 Pfam-A families (82-85%, Table 3). The 3213 clusters for which HMMer found similarities of any protein in that cluster to Pfam, matched in 1208 distinct Pfam families. Most of these Pfam families (57%) matched exclusively in one of our target clusters, 77% (935) matched in at most two clusters (Table 3). At most 210 of the 3213 clusters that matched in Pfam matched to more than one family. This number may provide an upper bound estimate for the error of our clustering if we assume that all Pfam families constitute structural domains. Thus, about 7% of our 18127 clusters may be problematic. Consequently, we expect that we have about 17000 targets for structural genomics in eukaryotes.
| Numberof Pfam families matched by one cluster | Numberof clusters | Percentageof clusters | Percentageof Pfam families matching N target clusters | |||
| BLAST | HMMer | BLAST | HMMer | BLAST | HMMer | |
| 0 | 15443 | 14914 | 85.2 | 82.3 | ||
| 1 | 2565 | 3003 | 14.2 | 16.6 | 56.6 | 57.0 |
| 2 | 107 | 191 | 0.6 | 1.1 | 23.0 | 20.4 |
| 3 | 11 | 19 | 0.1 | 0.1 | 8.7 | 9.9 |
| 4 | 1 | 0.0 | 3.7 | 3.6 | ||
| 5 | 0 | 0 | 2.2 | 2.1 | ||
| ≥6 | 0 | 0 | 2.5 | 2.7 | ||
∆
For eachof the 107410 potential eukaryotic target fragments that we grouped into 18127clusters (Table 2), we searched against all proteins in the Pfam-A database.For the search we applied two methods: (1) Align all target fragments withpairwise BLAST against all Pfam proteins (BLAST E-val of 10-3,columns labelled BLAST), and (2) use the HMMermodel for each Pfam family to find possible similarities in all our targetclusters (PFAM E-val of 10-6, columns labelled HMMer). The vast majority of our target clusters (>82%) had no corresponding Pfam entry.
Using pairwise BLAST to find similarities, 2683 of
our target clusters matched to one protein in 1141 distinct Pfam families;
56.6%
of these Pfam families matched exclusively in one of our clusters. Using HMMer
to find similarities, 3213 of our target clusters matched to one protein in
1208 distinct Pfam families; 57% of these Pfam families matched exclusively
in one of our clusters.
About 48% of all proteins in the 28 proteomes constitute possible targets. Let us assume that structural genomics will have to experimentally determine structures for all proteins for which we do not have information about structure through experiments or through comparative modelling based on experimentally known homologues. We have explicit experimental information about structure for only a marginal fraction of all the proteins in currently sequenced proteomes (< 0.3%). Hence, the number of targets for structural genomics is not given by 'all-structurally known', rather it is given by 'all-models', i.e. by the number of proteins for which we can obtain structural information through comparative modelling. The size of structural families increases exponentially when lowering the threshold for detecting structural similarities (Fig. 3). Lower thresholds imply lower accuracy in comparative modelling. Thus, the estimate for the number of targets for structural genomics is extremely sensitive to the accuracy we require in comparative modelling to remove a protein from the potential target list. While we have highly accurate information for only 6% of all proteins, we have low-accuracy information about structure for about 38%. In the first round of structural genomics, we may want to optimise the yield of 'new structures'. Hence, the low-accuracy number (38%) appears to be a reasonable choice.
About half of all proteins constitute targets for the first round. Initially structural genomics may want to try avoiding experimental problems by targeting proteins that are as globular as possible. We found that about 48% of all the proteins contained fragments of over 50 residues that were not similar to known structures and did not contain problematic regions (membrane, coiled-coil, low-complexity, no regular secondary structure, or signal peptides, Fig. 2). Interestingly, this fraction was significantly lower (35%) for the subset of the 23 K human sequences that we analysed. Comparative modelling predicts structure only for the fragments that correspond to known structures. The average protein length in PDB is clearly lower than the average length of the proteins found in entirely sequenced proteomes. Thus, we might expect that the percentage of all residues to target by structural genomics is significantly higher than the percentage of proteins. In fact, this expectation has recently been verified [44] . Surprisingly, we found that the 48% of all putative protein targets corresponded to about 52% of the entire residue mass of all proteomes (Fig. 4). This significant difference between our results (Fig. 2 and Fig. 4) and the results published previously [44] might have two reasons. Firstly, we used PSI-BLAST searches against the BIG database rather than pairwise BLAST searches against PDB (note that due to the small size of PDB, PSI-BLAST and BLAST searches against PDB basically yield the same results). Secondly, we marked all residues for which we predicted membrane or coiled-coil helices, and low-complexity or NORS regions. For all eukaryotic proteomes that we analysed, these regions added to almost half of the 'residue mass' excluded from the target list of structural genomics (Fig. 4). Thus, our results suggested that about half of all the proteins in entire proteomes constitute potential targets for structural genomics.
Clustering raised more questions than it answered. How to best cluster all known sequences depends on the reason for the clustering. In the context of structural genomics, the reason appears clear: find a representative set of targets. However, this seemingly straightforward concept hides a can of worms. The first problem is that of a hierarchy: The HSSP database that relates all known structures to known sequences [38] implicitly treats the protein of experimentally known structure as the 'master-representative' of the structural family for that structure. If we use this concept, we find 4600 families in yeast, 1431 in Methanococcus jannaschii and about 30,000 families in all eukaryotes (data not shown). However, different structural genomics initiatives favour different organisms. Hence, we want to generate clusters without 'master-representatives'. The obvious problem with this task is to find the basic unit for the clustering. If we assume that the 'building blocks' are structural domains, the problem becomes to dissect proteins of unknown structure into structural domains. Arguing that we cannot accomplish this, we identified the 'maximal-size' protein clusters; by construction there is no sequence similarity between any two of the single-linkage clusters. We found 1211 such clusters in Methanococcus jannaschii with the largest cluster containing 72 proteins (Table 2); for yeast the largest of the 3796 clusters contained over 1300 proteins. For all eukaryotes, the number of clusters appeared reasonable (22112) but the largest cluster contained more than 46K proteins. These results suggested two conclusions. Firstly, sequence space appeared to be more continuous than we might have anticipated because almost half of all proteins are connected to one another by some local structural similarity. This may imply that domains were shuffled considerably during evolution [108, 109] and/or that structural domains are not the appropriate 'building blocks'. Secondly, the 'maximal-size' clustering obviously failed entirely to generate a reasonable map of sequence space when we do not split proteins into domains. Thus, we have to find some way to dissect proteins into domains. A particular way applicable to all protein sequences was suggested by Enright et al. (2000). For Methanococcus jannaschii this clustering/domain-splitting algorithm yielded about 1400 clusters (Table 2). The authors of GeneRAGE [65] published a similar number, suggesting that their implementation of the major concept did not differ substantially from ours. The number of minimal-size clusters for yeast appeared also reasonable (Table 2). Interestingly, the numbers we obtained with and without explicitly starting from the already split domains did not differ very much (for yeast 3337 vs. 3638, Table 2). When we applied the algorithm to the 97K eukaryotic proteins, we obtained over 170K fragment clusters. Obviously, the number appears rather high, suggesting that the algorithm might split proteins into too small regions. However, we found that the length distribution of the respective fragments was surprisingly similar to typical structural domains (Fig. 5). Thus, the 170K fragments may indeed constitute the base for clustering eukaryotic sequences. We continued by excluding all the clusters that appeared of no interest to initial structural genomics approaches. Thus, we obtained 45051 fragment clusters containing 107K eukaryotic fragments. Next, we re-applied our minimal-size clustering by comparing all 107K against each other. This yielded 18127 fragment clusters, the largest of which contained 81 proteins. Most (82%) of these clusters did not match to any PFAM family [7] (Table 3); 99% of all the clusters that matched in PFAM matched one or two PFAM families. Matches to more than two PFAM families might constitute errors in defining our clusters; the problem, in particular was that our domain-splitting approach missed many domains. Further splitting clusters is likely to increase the number of putative eukaryotic targets. A step missing from our analysis that works in the opposite direction is the attempted merge of some of the clusters through PSI-BLAST rather than pairwise relations.
Structural genomics for eukaryotes may have to target 3000-18000 protein fragments. We could not put up a firm conclusion as to the number of putative targets for structural genomics. One extreme answer was: Less than 3000! This number based on the observation that the current PDB consists of about 2600 sequence-unique families which allow inferring low-resolution information about structure for about half of the proteins in all the proteomes we analysed. Assuming that a similar number of structures would fill in all unknowns, we need 2600 new structures to fill the white spaces. Another possible answer was: Over 16000 for eukaryotes alone! This number resulted when grouping the fragment clusters for eukaryotes that had more than 50 residues without known structure, membrane- or coiled-coil helices, and NORS regions or low-complexity regions (Table 3). How many proteins will have to be added for prokaryotes and archae bacteria? To approach the answer to this question, we will have to complete our clustering of all known proteomes, first. Clearly, our estimate puts the ball-park figure substantially higher than what was previously suggested [44] . While Vitkup and colleagues proposed a similar number (17600 for all species), their estimate was valid for a level of modelling accuracy that covers less than 10% of all residues in current proteomes. In contrast, our estimate of 16-18000 for eukaryotes was valid for an accuracy level at which over 45% of all residues were already covered. Furthermore, we excluded many fragments that were not excluded by Vitkup et al. (NORS, coiled-coil, transmembrane helices, and signal peptides). Nevertheless, our results confirmed the work of Vitkup and colleagues in that structural genomics has a long way to go. If our estimates are correct, the first pilot phase of structural genomics will - at best - pave one fifth of the way by 2005.
Thanks to Dariusz Przybylski (Columbia) and to Volker Eyrich (Columbia) for providing programs. We are grateful to our hard-working wet-lab colleagues from the North-East Structural Genomics Initiative (NESG), in particular to Guy Montelione (Rutgers) for his continued support and optimism. The work of JL and BR were supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health. Last, not least, thanks to all those who deposit their experimental data in public databases, to those who maintain these databases, and to those heroes who will structural genomics make come true through their dedication and experiments.
| 1. | Lupas, A., Van Dyke, M. & Stock, J. (1991). Predictingcoiled coils from protein sequences. Science,252, 1162-1164. |
| 2. | Lupas, A. (1996). Prediction andanalyis of coiled-coil structures. Methods in Enzymology, 266, 513-525. |
| 3. | Eyrich, V., Martí-Renom, M.A., Przybylski, D., Fiser, A., Pazos, F. et al. (2001). EVA: continuousautomatic evaluation of protein structure prediction servers. 2001. |
| 4. | Eyrich, V., Martí-Renom, M.A., Przybylski, D., Fiser, A., Pazos, F. et al. (2001). EVA: continuousautomatic evaluation of protein structure prediction servers. Bioinformatics, 17, 1242-1243. |
| 5. | Berman, H. M., Westbrook, J., Feng,Z., Gillliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. NucleicAcids Research, 28,235-242. |
| 6. | Sonnhammer, E. L., Eddy, S. R. &Durbin, R. (1997). Pfam: a comprehensive database of protein domain familiesbased on seed alignments. Proteins: Structure, Function, and Genetics, 28, 405-420. |
| 7. | Bateman, A., Birney, E., Durbin, R.,Eddy, S. R., Howe, K. L. et al. (2000). The Pfam protein families database. NucleicAcids Research, 28,263-6. |
| 8. | Rost, B. & Sander, C. (1993).Prediction of protein secondary structure at better than 70% accuracy. J.Mol. Biol., 232,584-599. |
| 9. | Rost, B. & Sander, C. (1994).Combining evolutionary information and neural networks to predict protein secondarystructure. Proteins, 19, 55-72. |
| 10. | Rost, B. (1996). PHD: predictingone-dimensional protein structure by profile based neural networks. Meth.Enzymol., 266,525-539. |
| 11. | Rost, B. & Sander, C. (1994).Conservation and prediction of solvent accessibility in protein families. Proteins, 20, 216-226. |
| 12. | Rost, B., Casadio, R., Fariselli,P. & Sander, C. (1995). Prediction of helical transmembrane segments at 95%accuracy. Prot. Sci., 4, 521-533. |
| 13. | Rost, B., Casadio, R. &Fariselli, P. (1996). Topology prediction for helical transmembrane proteins at86% accuracy. Prot. Sci., 5, 1704-1718. |
| 14. | Yang, A. S. & Honig, B. (1999).Sequence to structure alignment in comparative modeling using PrISM. Proteins:Structure, Function, and Genetics, Suppl, 66-72. |
| 15. | Altschul, S., Madden, T., Shaffer,A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a newgeneration of protein database search programs. Nucl. Acids Res., 25, 3389-3402. |
| 16. | Wootton, J. C. & Federhen, S.(1996). Analysis of compositionally biased regions in sequence databases. Meth.Enzymol., 266,554-571. |
| 17. | Nielsen, H., Engelbrecht, J.,Brunak, S. & von Heijne, G. (1997). Identification of prokaryotic andeukaryotic signal peptides and prediction of their cleavage sites. Prot.Engin., 10, 1-6. |
| 18. | Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucl. Acids Res., 28, 45-48. |
| 19. | NIGMS (2001). Structural genomicsinitiatives. 2001. |
| 20. | Lima, C. D., Klein, M. G. &Hendrickson, W. A. (1997). Structure-based analysis of catalysis and substratedefinition in the HIT protein family. Science,278, 286-290. |
| 21. | Gaasterland, T. (1998). Structuralgenomics: bioinformatics in the driver's seat. Nat. Biotechnol., 16, 625-627. |
| 22. | Gaasterland, T. (1998). Structuralgenomics taking shape. TIGS, 14, 135. |
| 23. | Rost, B. (1998). Marrying structureand genomics. Structure, 6, 259-263. |
| 24. | Burley, S. K., Almo, S. C.,Bonanno, J. B., Capel, M., Chance, M. R. et al. (1999). Structural genomics:beyond the human genome project. Nat. Gen.,23, 151-157. |
| 25. | Teichmann, S. A., Chothia, C. &Gerstein, M. (1999). Advances in structural genomics. Current Opinion inStructural Biology, 9, 390-399. |
| 26. | Blundell, T. L. & Mizuguchi, K.(2000). Structural genomics: an overview. Prog Biophys Mol Biol, 73, 289-295. |
| 27. | Christendat, D., Yee, A., Dharamsi,A., Kluger, Y., Savchenko, A. et al. (2000). Structural proteomics of anarchaeon. Nature Structural Biology, 7, 903-9. |
| 28. | Moult, J. & Melamud, E. (2000).From fold to function. Current Opinion in Structural Biology, 10, 384-389. |
| 29. | Shapiro, L. & Harris, T.(2000). Finding function through structural genomics. Curr. Opin. Biotech., 11, 31-35. |
| 30. | Thornton, J. (2001). Structuralgenomics takes off. Trends in Biochemical Sciences, 26, 88-89. |
| 31. | Orengo, C. A., Michie, A. D.,Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). CATH - Ahierarchic classification of protein domain structures. Structures, 5, 1093-1108. |
| 32. | Holm, L. & Sander, C. (1999).Protein folds and families: sequence and structure alignments. Nucl. AcidsRes., 27, 244-247. |
| 33. | Schneider, R., de Daruvar, A. &Sander, C. (1997). The HSSP database of protein structure-sequence alignments. Nucl.Acids Res., 25,226-230. |
| 34. | Lo Conte, L., Ailey, B., Hubbard,T. J., Brenner, S. E., Murzin, A. G. et al. (2000). SCOP: a structuralclassification of proteins database. Nucl. Acids Res., 28, 257-259. |
| 35. | Rost, B. (1997). Protein structuressustain evolutionary drift. Folding & Design,2, S19-S24. |
| 36. | Rost, B. (1999). Twilight zone ofprotein sequence alignments. Protein Engineering,12, 85-94. |
| 37. | Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. II. On the relationship between sequence and structural similarityfor proteins that are not obviously related in sequence. Journal ofMolecular Biology, 301, 679-689. |
| 38. | Sander, C. & Schneider, R.(1991). Database of homology-derived structures and the structural meaning ofsequence alignment. Proteins: Structure, Function, and Genetics, 9, 56-68. |
| 39. | Abagyan, R. A. & Batalov, S.(1997). Do aligned sequences share the same fold? Journal of MolecularBiology, 273,355-368. |
| 40. | Park, J., Teichmann, S. A.,Hubbard, T. & Chothia, C. (1997). Intermediate sequences increase thedetection of distant sequence homologies. Journal of Molecular Biology, 273, 349-354. |
| 41. | Brenner, S. E., Chothia, C. &Hubbard, T. J. P. (1998). Assessing sequence comparison methods with reliablestructurally identified distant evolutionary relationships. Proceedings ofthe National Academy of Sciences, 95, 6073-6078. |
| 42. | Park, J., Karplus, K., Barrett, C.,Hughey, R., Haussler, D. et al. (1998). Sequence comparisons using multiplesequences detect three times as many remote homologues as pairwise methods. Journalof Molecular Biology, 284, 1201-1210. |
| 43. | Muller, A., MacCallum, R. M. &Sternberg, M. J. (1999). Benchmarking PSI-BLAST in genome annotation. Journalof Molecular Biology, 293, 1257-1271. |
| 44. | Vitkup, D., Melamud, E., Moult, J.& Sander, C. (2001). Completeness in structural genomics. NatureStructural Biology, 8, 559-566. |
| 45. | Kim, S.-H. (2001). BerkeleyStructural Genomics Center. |
| 46. | Wilson, I. A. (2001). Joint Centerfor Structural Genomics. |
| 47. | Terwilliger, T. (2001).Mycobacteriumm tuberculosis (TB) Structural Genomics Consortium. |
| 48. | Wang, B.-C. (2001). SoutheastCollaboratory for Structural Genomics. |
| 49. | YSG (2001). Yeast Structuralgenomics. |
| 50. | Yokoyama, S. & Kuramitsu, S.(2001). Structurome Research group, RIKEN. |
| 51. | Umbach, P. (2001). ProteinStructure Factory. |
| 52. | Montelione, G. T. (2001). NortheastStructural Genomics Consortium. |
| 53. | Joachimiak, A. (2001). MidwestCenter for Structural Genomics. |
| 54. | Burley, S. K. (2001). New YorkStructural Genomics Research Consortium. |
| 55. | Ponting, C. P., Schultz, J.,Milpetz, F. & Bork, P. (1999). SMART: identification and annotation ofdomains from signalling and extracellular protein sequences. Nucl. AcidsRes., 27, 229-32. |
| 56. | Tatusov, R. L., Koonin, E. V. &Lipman, D. J. (1997). A genomic perspective on protein families. Science, 278, 631-7. |
| 57. | Tatusov, R. L., Galperin, M. Y.,Natale, D. A. & Koonin, E. V. (2000). The COG database: a tool forgenome-scale analysis of protein functions and evolution. Nucleic Acids Res, 28, 33-36. |
| 58. | Yona, G., Linial, N., Tishby, N.& Linial, M. (1998). A map of the protein space--an automatic hierarchicalclassification of all protein sequences. Ismb,6, 212-221. |
| 59. | Yona, G., Linial, N. & Linial,M. (1999). ProtoMap: automatic classification of protein sequences, a hierarchyof protein families, and local maps of the protein space. Proteins: Structure,Function, and Genetics, 37, 360-378. |
| 60. | Linial, M. & Yona, G. (2000).Methodologies for target selection in structural genomics. Progress inBiophysics and Molecular Biology, 73, 297-320. |
| 61. | Yona, G., Linial, N. & Linial,M. (2000). ProtoMap: automatic classification of protein sequences andhierarchy of protein families. Nucleic Acids Research, 28, 49-55. |
| 62. | Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucleic Acids Res, 28, 45-48. |
| 63. | Corpet, F., Servant, F., Gouzy, J.& Kahn, D. (2000). ProDom and ProDom-CG: tools for protein domain analysisand whole genome comparisons. Nucl. Acids Res.,28, 267-9. |
| 64. | Marcotte, E. M., Pellegrini, M.,Thompson, M. J., Yeates, T. O. & Eisenberg, D. (1999). A combined algorithmfor genome-wide prediction of protein function. Nature, 402, 83-86. |
| 65. | Enright, A. J. & Ouzounis, C.A. (2000). GeneRAGE: a robust algorithm for sequence clustering and domaindetection. Bioinformatics, 16, 451-457. |
| 66. | Bairoch, A., Bucher, P. &Hofmann, K. (1997). The PROSITE database, its status in 1997. Nucl. AcidsRes., 25, 217-221. |
| 67. | Hofmann, K., Bucher, P., Falquet,L. & Bairoch, A. (1999). The PROSITE database, its status in 1999. Nucl.Acids Res., 27,215-219. |
| 68. | Barker, W. C., Garavelli, J. S.,Huang, H., McGarvey, P. B., Orcutt, B. C. et al. (2000). The proteininformation resource (PIR). Nucleic Acids Research, 28, 41-4. |
| 69. | Heger, A. & Holm, L. (2000).Towards a covering set of protein family profiles. Progress in Biophysicsand Molecular Biology, 73, 321-337. |
| 70. | Liu, J. & Rost, B. (2001).Comparing function and structure between entire proteomes. Protein Science, 10, 1970-1979. |
| 71. | Fleischmann, R. D., Adams, M. D.,White, O., Clayton, R. A., Kirkness, E. F. et al. (1995). Whole-genome randomsequencing and assembly of Haemophilus influenzae Rd [see comments]. Science, 269, 496-512. |
| 72. | Fraser, C. M., Gocayne, J. D.,White, O., Adams, M. D., Clayton, R. A. et al. (1995). The minimal genecomplement of Mycoplasma genitalium [see comments]. Science, 270, 397-403. |
| 73. | Bult, C. J., White, O., Olsen, G.J., Zhou, L., Fleischmann, R. D. et al. (1996). Complete genome sequence of themethanogenic archaeon, Methanococcus jannaschii [see comments]. Science, 273, 1058-73. |
| 74. | Himmelreich, R., Hilbert, H.,Plagens, H., Pirkl, E., Li, B. C. et al. (1996). Complete sequence analysis ofthe genome of the bacterium Mycoplasma pneumoniae. Nucleic Acids Res, 24, 4420-49. |
| 75. | Kaneko, T., Sato, S., Kotani, H.,Tanaka, A., Asamizu, E. et al. (1996). Sequence analysis of the genome of theunicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequencedetermination of the entire genome and assignment of potential protein-codingregions. DNA Res, 3, 109-36. |
| 76. | (1997). The yeast genome directory.Nature, 387, 5. |
| 77. | Blattner, F. R., Plunkett, G., 3rd,Bloch, C. A., Perna, N. T., Burland, V. et al. (1997). The complete genomesequence of Escherichia coli K-12 [comment] [see comments]. Science, 277, 1453-74. |
| 78. | Fraser, C. M., Casjens, S., Huang,W. M., Sutton, G. G., Clayton, R. et al. (1997). Genomic sequence of a Lymedisease spirochaete, Borrelia burgdorferi [see comments]. Nature, 390, 580-6. |
| 79. | Klenk, H. P., Clayton, R. A., Tomb,J. F., White, O., Nelson, K. E. et al. (1997). The complete genome sequence ofthe hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus[published erratum appears in Nature 1998 Jul 2;394(6688):101]. Nature, 390, 364-70. |
| 80. | Kunst, F., Ogasawara, N., Moszer,I., Albertini, A. M., Alloni, G. et al. (1997). The complete genome sequence ofthe gram-positive bacterium Bacillus subtilis [see comments]. Nature, 390, 249-56. |
| 81. | Smith, D. R., Doucette-Stamm, L.A., Deloughery, C., Lee, H., Dubois, J. et al. (1997). Complete genome sequenceof Methanobacterium thermoautotrophicum deltaH: functional analysis andcomparative genomics. J Bacteriol, 179, 7135-55. |
| 82. | Tomb, J. F., White, O., Kerlavage,A. R., Clayton, R. A., Sutton, G. G. et al. (1997). The complete genomesequence of the gastric pathogen Helicobacter pylori [see comments] [publishederratum appears in Nature 1997 Sep 25;389(6649):412]. Nature, 388, 539-47. |
| 83. | (1998). Genome sequence of thenematode C. elegans: a platform for investigating biology. The C. elegansSequencing Consortium [published errata appear in Science 1999 Jan1;283(5398):35 and 1999 Mar 26;283(5410):2103 and 1999 Sep 3;285(5433):1493]. Science, 282, 2012-8. |
| 84. | Andersson, S. G., Zomorodipour, A.,Andersson, J. O., Sicheritz-Ponten, T., Alsmark, U. C. et al. (1998). Thegenome sequence of Rickettsia prowazekii and the origin of mitochondria [seecomments]. Nature, 396, 133-40. |
| 85. | Cole, S. T., Brosch, R., Parkhill,J., Garnier, T., Churcher, C. et al. (1998). Deciphering the biology ofMycobacterium tuberculosis from the complete genome sequence [see comments][published erratum appears in Nature 1998 Nov 12;396(6707):190]. Nature, 393, 537-44. |
| 86. | Deckert, G., Warren, P. V.,Gaasterland, T., Young, W. G., Lenox, A. L. et al. (1998). The complete genomeof the hyperthermophilic bacterium Aquifex aeolicus. Nature, 392, 353-8. |
| 87. | Fraser, C. M., Norris, S. J.,Weinstock, G. M., White, O., Sutton, G. G. et al. (1998). Complete genome sequenceof Treponema pallidum, the syphilis spirochete [see comments]. Science, 281, 375-88. |
| 88. | Kawarabayasi, Y., Sawada, M.,Horikawa, H., Haikawa, Y., Hino, Y. et al. (1998). Complete sequence and geneorganization of the genome of a hyper-thermophilic archaebacterium, Pyrococcushorikoshii OT3. DNA Res, 5, 55-76. |