| Title: | Natively unstructured loops differ from other loops |
| Author: | Avner Schlessinger , Jinfeng Liu & Burkhard Rost |
| Quote: | PLoS Computational Biology, 2007, 3:e140 |
Natively unstructured loops differ from other loops
| 1 | Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| 4 | Dept. of Pharmacology, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA |
| * | Corresponding author: schles@rostlab.org URL http://www.rostlab.org/ Tel: +1-212-851-4669, fax: +1-212-305-7932 |
This article is published in (PLoS Computational Biology, 2007 and pages) ©copyright PLoS (2007). PLoS is the only authorized source. All copying of this article including placing on another website requires the written permission of the copyright owner.
Natively unstructured or disordered protein regions may increase the functional complexity of an organism; they are particularly abundant in eukaryotes and often evade structure determination. Many computational methods predict unstructured regions by training on outliers in otherwise well-ordered structures. Here, we introduce an approach that uses a neural network in a very different and novel way. We hypothesize that very long contiguous segments with non-regular secondary structure (NORS regions) differ significantly from regular, well-structured loops and that a method detecting such features could predict natively unstructured regions. Training our new method, NORSnet, on predicted information rather than on experimental data yielded three major advantages: it removed the overlap between testing and training, it systematically covered entire proteomes, and it explicitly focused on one particular aspect of unstructured regions with a simple structural interpretation, namely that they are loops. Our hypothesis was correct: well-structured and unstructured loops differ so substantially that NORSnet succeeded in their distinction. Benchmarks on previously used and new experimental data of unstructured regions revealed that NORSnet performed very well. Although it was not the best single prediction method, NORSnet was sufficiently accurate to flag unstructured regions in proteins that were previously not annotated. In one application, NORSnet revealed previously undetected unstructured regions in putative targets for structural genomics and may thereby contribute to increasing structural coverage of large eukaryotic families. NORSnet found unstructured regions more often in domain boundaries than expected at random. In another application, we estimated that 50-70% of all worm proteins observed to have over 7 protein-protein interaction partners have unstructured regions. The comparative analysis between NORSnet and DISOPRED2 suggested that long unstructured loops are a major part of unstructured regions in molecular networks.
The details of protein structures are important for function. Regions that do not adopt any regular structure in isolation (natively unstructured or disordered regions) initially appeared as a curious exception to this structure-function paradigm. It has become increasingly clear that unstructured regions are fundamental to many roles and that they are particularly important for multicellular organisms. Structural biology is just beginning to apprehend the stunning diversity of these roles. Here, we focused on unstructured regions dominated by a particular type of loop, namely the natively unstructured one. We developed a method that succeeded in the distinction between well-structured and natively unstructured loops. For the development we did not use any experimental data for unstructured regions; when tested on experimental data the method performed surprisingly well. Due to its different premises, the method captured very different aspects of unstructured regions than other methods that we tested. We applied the new method to two different problems. The first was the identification of proteins that may be difficult targets for structure determination. The second was the identification of worm proteins that have many interaction partners (>7) and unstructured regions. Surprisingly, we found unstructured regions of the loopy type in over 50% of all the promiscuous worm proteins.
Key words: protein disorder prediction; protein structure prediction; multiple alignments; secondary structure prediction; disordered regions; protein function; natively unstructured regions.
| 1D structure | one-dimensional (e.g. sequence or string of residue secondary structure or numbers for residue solvent accessibility) |
| 3D structure | three-dimensional structure, i.e. co-ordinates of all residue atoms in a protein |
| Angstroem (A) | =0.1 nm |
| CESG | Center for Eukaryotic Structural Genomics |
| DISOPRED2 | machine learning method for the identification of disorder as defined by missing residues from X-ray structures |
| FoldIndex a method that predicts whether a given protein will fold based on hydrophobicity/net charge | FoldIndex a method that predicts whether a given protein will fold based on hydrophobicity/net charge |
| NESG | NorthEast Structural Genomics consortium |
| NMR | nuclear magnetic resonance spectroscopy |
| NORS | long regions with no regular secondary structure (almost no helix and no strand predicted) |
| NORSnet | neural network based prediction method of short NORS-like unstructured regions described here |
| PDB | the Protein Data Bank with full coordinates of 3D structures |
| PROFacc | profile-based neural network prediction of solvent accessibility |
| PROFsec | profile-based neural network prediction of secondary structure. |
Unstructured regions define a new heterogeneous structural reality. One central paradigm of structural biology is that the intricate details of three-dimensional (3D) protein structures determine protein function [1, 2] . Over the last years many studies have shown that often, the lack of a unique, native 3D structure in physiological conditions can be crucial for function [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21] . Such proteins are variously called disordered, unfolded, natively unstructured, or intrinsically unstructured proteins. A typical example is a protein that adopts a unique 3D structure only upon binding to an interaction partner and thereby performs its biochemical function [8, 22, 23, 16] . The better our experimental and computational means of identifying such proteins, the more we realize that they come in a great variety: some adopt regular secondary structure (helix or strand) upon binding, some remain loopy. Some proteins are almost entirely unstructured, and others have only short unstructured regions. The more we can recognize short unstructured regions, the more we realize that the term "unstructured protein" would be misleading, as most unstructured proteins have relatively short unstructured regions. There is no single one way to define unstructured regions. Here, we define an unstructured region as that which lacks unique 3D structure by one of the following experimental techniques: circular dichroism spectroscopy (CD), nuclear magnetic resonance spectroscopy (NMR), X-ray crystallography, or proteolysis [5, 11, 24] . Thanks to the outstanding data collection by the Dunker group, we could also describe this as regions that are the minimal common denominator between all proteins collected in DisProt [25] . However, as we learned from prediction methods, DisProt and similar databases cover only a small fraction of all unstructured regions ( Fig. 1 ), and as we learned from recent experiments [26, 27, 28] there are many unstructured regions neither covered by these databases nor by existing prediction methods.
|
|
Fig. 1 : Putative "map" of unstructured regions. Proteins with unstructured regions are likely to occupy large portions of sequence space [9, 10, 11, 15] as sketched by the light-grayed inner rectangle. The space of all proteins with unstructured regions is likely to be considerably larger than what today's experimental techniques capture. The region rounded darker gray rectangle labeled experiment sketches proteins for which some experimental method annotated natively unstructured regions. While most NORS regions (predicted long loops, striped gray ellipse) are likely to be natively unstructured, many unstructured regions are not NORS, i.e. contain helices and strands even in their native form. Previous methods for the prediction of unstructured regions (left lens) are optimized to somehow reflect today's experiments. In contrast, the method introduced here (NORSnet, right lens) is developed based on predictions. This is an advantage because it avoids the bias of today's experimental techniques in a field that is just beginning to grasp its own dimensions, and it is a disadvantage because performance on today's data sets appears somehow limited.
|
Unstructured regions can be defined and recognized in many ways. Methods that predict unstructured regions from sequence are mushrooming. Fast methods, identify regions with high net charge and low hydrophobicity [4, 29] , monitor the differences in amino acid propensities between unstructured and other regions (GlobPlot) [30] , or identify motifs associated with regions depleted of regular structure [31, 32] . Most methods are based on a different definition of disordered region that has been introduced by the Dunker group [33] : residues for which X-ray structures do not have coordinates are considered as disordered. Methods based on this concept used neural networks [34, 13, 33, 35, 36] or support vector machines [15] . The meetings for the Critical Assessment of Structure Prediction (CASP) have exclusively assessed disorder predictions on subsets of the "non-coordinate" data [37, 38] . The major drawback of this approach is that the PDB is biased towards proteins for which structures can be determined; natively unstructured proteins are underrepresented in the PDB [10, 15, 16, 25] . This may be one reason why most prediction methods tested by Oldfield et al. [26, 27] missed a substantial number of the proteins with unstructured regions identified in a large-scale NMR study spin-off from structural genomics. Other sequence features are predictive of disorder. For example, functionally flexible regions (FFRs) are identified from known structures through molecular dynamics simulation and can be generalized through machine learning. The method Wiggle provides predictions that overlap with unstructured regions even though it is focused on a different aspect of protein flexibility [39] .
Regions with no regular secondary structure provide alternative. Our group identified long regions with no regular secondary structure (NORS) which are stretches of 70 or more sequence-consecutive surface residues with few or no predicted helices and strands [10] . NORS regions showed considerable overlap with proteins predicted to have long unstructured regions by various disorder predictors. NORS regions are over-represented in eukaryotes (over five times more than in prokaryotes), over-represented in regulatory and interacting proteins [10, 40] , and share biophysical properties with unstructured regions. Additionally, when natively unstructured regions are co-crystallized with their binding partner, they are still enriched in non-regular structure compared to globular proteins, ~45% and ~31% of the residues are in coils, respectively [22] . Somehow surprisingly, the method for predicting regular secondary structure in NORS regions, PROFsec [41, 42, 43] , accurately predicts the secondary structure state in unstructured regions [22] .
NORS regions capture only one particular aspect of unstructured regions ( Fig. 1 ). The major advantages of our focus on NORS regions are that this definition implies a simple structural interpretation, and that we can reliably identify thousands of such regions by scanning entire organisms. The thresholds for the minimal length (70 residues) and for the definition of "largely loop" were optimized in order to minimize the identification of any of these stretches in the PDB [10] . This procedure does not explicitly use any information about a protein other than its prediction of secondary structure and solvent accessibility. Thus, it mainly identifies extreme cases, e.g., highly exposed and long loop regions. Since many unstructured regions are shorter, one of our objectives was to capture much shorter NORS-like regions while ascertaining that we would not confuse long, well-structured loops with unstructured regions. One disadvantage of our focus on NORS was that some unstructured regions contain secondary structure elements (helix or strand) [22] , i.e. not all unstructured regions are captured by NORS ( Fig. 1 ).
Eukaryotic Disordered regions challenge structural genomics. One goal of structural genomics is the determination of a 3D structure representative for every protein family [44, 45] . Unstructured regions have not impeded structural genomics so far because almost all consortia have focused on bacterial proteins in order to increase the structure to clone ratio. However, consortia that focus on eukaryotes, such as the Northeast Structural Genomics (NESG) Consortium, or the Center for Eukaryotic Structural Genomics (CESG) have to carefully exclude such problematic targets [46, 47] . Over 10,000 proteins have been cloned and over 3,000 purified by NESG. Many of these did not adopt regular structure, possibly because they have unstructured regions that were not filtered out by our original filter that discarded targets containing NORS regions [40] . To speed up structure determination we need to increase the sensitivity in identifying unstructured regions [26] , i.e. one goal of the development was to end up with a method that would be complementary to existing methods for the identification of unstructured regions.
Our first hypothesis was that NORS regions share commonalities that distinguish such long unstructured loops from well-structured loops. If so, we should be able to somehow distinguish between the two types of loops at least in the sense that all loops predicted to be unstructured by our method ought to have different average features than other loops. We assumed that the neural network will pick up local correlations in amino-acid preferences for the different structural states. Our second hypothesis was that what distinguishes NORS regions from regular loops is exactly what makes regions become unstructured. If so, our method for the identification of NORS regions would also accurately predict unstructured regions.
Here, we described NORSnet, a new method that extends
our NORS concept to also detect shorter (30-70 residues) NORS-like regions. The
method was developed without ever using proteins with experimentally known
unstructured regions. Instead, it was optimized to distinguish predicted NORS
from all other regions. This unique approach, unprecedented in any machine
learning method competing in a real-life application with other methods, has three
important advantages. First, the data used for development and testing do not
overlap. Since NORS regions were predicted from sequence, we can identify
thousands of such regions. Our dataset was "dirty" in the sense that it
contained many false negatives (all residues in PDB were considered as
well-structured during training) as well as some false positives (incorrect
NORS predictions). This was the second major advantage: the positives
(unstructured regions) sampled entirely sequenced organisms without any major
bias with respect to this particular flavor of unstructured regions. Thereby,
we identified unstructured regions that were missed by methods trained on more
specialized data sets. The third advantage was that the resulting method
explicitly focused on one feature of unstructured regions with a structural
interpretation, namely that they are loops. Although we could have assessed
NORSnet on any existing data set due to the lack of overlap, we added a new set
with experimental data about unstructured regions different from existing data.
Note that both sets differed from each other as well as from the set used for
development.
Our three major results confirmed our hypothesis: (1) training on predictions succeeded to develop a powerful prediction method, (2) long loops are a major component of what is picked up by existing methods predicting unstructured regions, and (3) well-ordered and unstructured loops differ. In conjunction with existing methods the one that we introduced here will allow the focus on particular structural aspects.
Accurate distinction between unstructured and regular loop regions
We trained our system on NORS (no regular secondary structure) regions that had been predicted by our previous high-accuracy/low-coverage method [10, 40] for the identification of very long regions depleted of predicted helices and strands (NORSp, Methods). Technically, the task was to separate between all residues predicted to be in a NORS region and all residues in the PDB. As we used neural networks for this task, the typical assessment of accuracy usually involves a cross-validation experiment. For the first time in our work, we did not do this. In fact, we completely ignored the performance of the network on the task it optimized. Our hypothesis simply was that the only aspect that consistently separates extreme NORS regions from all residues in the PDB are the building blocks for a particular type of unstructured regions, namely the NORS-like loopy ones. Therefore, we measured performance on rather different data sets and separation tasks.
First, we established success by predicting well-structured loops and NORS-like loops for DisProt, which consists of proteins with experimentally characterized unstructured regions. 88% of the residues predicted by NORSnet were also predicted to be loops by PROFsec while only 51% of the residues predicted as loops in DisProt also appeared NORS-like. In other words, most regions identified by NORSnet appeared to be in loops. Conversely, many loops were not predicted by NORSnet. Since residues in loops were identified through prediction, this difference may have been caused by prediction mistakes. To rule this out, we collected a set of 45 sequence-unique proteins that had been added to the PDB after we had completed developing our method (September 2005 to June 2006). We found that NORSnet classified only 1% of loop residues (DSSP states TSL) [48] as natively unstructured regions. In other words, NORSnet largely succeeded for these new proteins. In fact, it predicted only one region in these structures to be unstructured, namely a stretch in a viral protein of 52 residues (PDB identifier 2c55_A [49] ) the NMR structure of which indicated depletion of regular secondary structure. This protein has been shown to undergo conformational changes [49] , suggesting that our method correctly identified it as unstructured.
Very long NORS regions differ statistically from regularly structured or well-ordered loops [10] . In general, unstructured regions, that are not NORS-like, tend to be more loopy than well-structured regions [22] . Here, we showed that our ability to distinguish between well-ordered and unstructured loops was also successful for much shorter loops. Medium length (30-70 residues), unstructured loops differed from well-structured loops ( Fig. 2 >
|
|
Fig. 2 : Regular, flexible, and predicted-to-be unstructured loops differed. We compared the amino acid compositions between four different subsets representing four types of "loops" (non-helix/non-strand): loops from regular, well-ordered structures, i.e. from proteins without natively unstructured regions (states TSL from DSSP; in light blue); unstructured loops as predicted by NORSnet (in green), "flexible loops" from regular structures (TSL states with normalized Bfactors³1 [86] , in red) and unstructured regions as predicted by DISOPRED2 (in orange). The sign of the bar corresponds to over - (positive) or under - representation (negative) of amino acid in a subset with respect to the PDB. The NORS and DISOPRED2 residue subsets were taken from the worm genome (from the IntAct database [71] ) and were predicted to be unstructured by NORSnet and DISOPRED2. Flexible loops were enriched in amino acids with net charges such as lysine and glutamate (as described before [30, 50] ). Predicted unstructured regions by NORSnet, however, differed in their composition from regular loops, flexible loops and from any type of disorder that has been described previously (data not shown) [50, 54] . Cysteines were not over-abundant in the unstructured regions predicted by DISOPRED2. Overall, these data suggested that NORSnet captured something other than just "loop" and other than what is captured by methods such as DISOPRED2.
|
NORSnet precisely distinguished between unstructured and well-structured loops. Although the amino acid composition of unstructured loops was similar to that in long disordered regions [50] , it was unique ( Fig. 2 ). For instance, the regions identified by our method contained significantly more cysteines than other PDB proteins and, within these, more than the set of residues unresolved in electron density maps. Thus, methods trained on unresolved residues, such as DISOPRED2, are likely to miss these regions. Furthermore, methods utilizing pairwise energy potentials, such as IUPred, to predict unstructured regions are also likely to miss these regions as many cysteines typically coincide with many paired cysteine-bonds that significantly contribute to protein stability [51, 52] .
Proteins with unstructured regions accurately identified
About 30-60% of all eukaryotic proteins have been estimated to contain unstructured regions [53, 15] . However, DisProt [25] , the largest resource of experimentally verified unstructured regions, contains only a few hundred eukaryotic proteins and thus covers a small fraction of sequence space ( Fig. 1 ). Moreover, this small fraction is not representative as many unstructured regions described experimentally are missing from existing databases and are not identified by prediction methods [26] . NORSnet attempted to solve both problems by sampling sequence space exhaustively (trained on all positives from entirely sequence organisms) and focusing on unstructured loops.
In order to assess the accuracy of NORSnet and estimate to what extent unstructured loops dominate our current identification of unstructured regions, we investigated two different data sets. The first was built around the DisProt database used previously in the literature, and the second originated from careful NMR measurements and has not been used in many previous analyses.
DisProt data set. The first set included proteins with unstructured regions from DisProt as positives and 173 PDB structures from the EVA server as negatives (Methods). NORSnet correctly identified half of the DisProt proteins without false positives ( Fig. 3 A). DISOPRED2 [15] was ranked as one of the best three methods for predicting residues that are missing in electron density maps from X-ray crystallography at CASP6 [38] and CASP7 (L. Bordoli, unpublished). Many other studies corroborated the leading role of DISOPRED2 [52, 38, 36, 54, 55] . Overall, NORSnet performed almost on par with DISOPRED2 for the DisProt data set ( Fig. 3 A). Simply taking the average over the outputs of DISOPRED2 and NORSnet (DISOPRED2+NORSnet) outperformed both individually. The improvement was particularly important for the realm of very high accuracy ( Fig. 3 A). IUPred predicts unstructured regions based on a statistical potential optimized for this purpose [52, 56] . In our hands, IUPred clearly and consistently out-performed the other methods tested, including the averaged DISOPRED2/NORSnet output ( Fig. 3 A). IUPred is optimized to identify all unstructured regions in DisProt, [56, 52] but it cannot distinguish between unstructured regions dominated by loops and those dominated by regular secondary structure (as are often found in unstructured regions [22] ).
NORSnet predictions were not superior to those from DISOPRED2. However, the performance of these two methods was surprisingly similar despite the fact that NORSnet was not trained on a single experimentally verified unstructured region. Did the similarity in performance indicate that both methods picked up the same signal, i.e. that DISOPRED2 largely captured unstructured loops?
|
|
Fig. 3 : Predictions for DisProt. (A) ROC-like curve for NORSnet (green), DISOPRED2 (orange) and their combination (through arithmetic average, dark gray). While the performance of NORSnet and DISOPRED2 were similar, the combined method seemed to outperform both methods. Particularly, at accuracy=100% (inset) the combined method covers significantly more sequences than each one of the methods individually. IUPred (purple) outperformed all other methods on this data set. Note that IUPred was optimized on a set similar to the one used in this study. In contrast, NORSnet and DISOPRED2 were optimized on different sets defining disorder differently (B) Venn-diagram of overlap between very accurate predictions by NORSnet, DISOPRED2 and the combined method. The numbers in the circles are mutually exclusive; for instance, 2 proteins were identified only by DISOPRED2 to have a unstructured region and 17 proteins were identified by both NORSnet and by the combined method to have unstructured region.
|
If two prediction methods are based on very different information, their combination typically improves performance over any one of them [57, 54] . A more explicit way to demonstrate that methods focus on different aspects is the analysis of their predictions by Venn diagrams. We picked points for which each of the three methods (DISOPRED2, NORSnet, DISOPRED2+NORSnet) yielded 100% accuracy and compared the true positives predicted at those thresholds. DISOPRED2 and NORSnet identified the same 73 proteins, but each correctly identified proteins that the other missed ( Fig. 3 B). This agreement supported our initial hypothesis that many unstructured regions are loopy (considerable overlap in true positives). But the most important result was that the two methods complemented each other. At the same 100% accuracy threshold, the combined method (DISOPRED2+NORSnet) identified more proteins than any of the two individual methods and missed only two proteins that DISOPRED2 correctly identified. Although not surprising given the differences in training set and underlying optimizations, this result highlighted the difference in the types of unstructured regions identified.
The combination of DISOPRED2 and NORSnet by averaging their outputs was better than either method alone. This did not work with IUPred and either of the two methods. This might suggest that IUPred covers the same aspects as the other two. However, this notion proved to be incorrect: IUPred missed proteins in the NESG data set that the others captured ( Fig. 4 B). Therefore, a beneficial combination of different methods predicting unstructured regions may require a more sophisticated algorithm.
Unstructured regions from NESG data set. Many prediction methods were optimized or benchmarked on data sets overlapping with DisProt. In contrast, the data set from the NESG contained proteins with unstructured regions that have not been used for training existing methods yet. The NESG set was collected with a unified definition of unstructured regions based on 2D NMR experiments [58] ; it included 30 proteins with unstructured regions as positives and 170 regular structures solved by NESG as negatives (Methods and Table 1 _App, Supporting Online Material). In the high accuracy region NORSnet captured a considerable fraction of the positives (40% coverage at 100% accuracy, Fig. 4 A). The performance of DISOPRED2 was clearly lower than that of NORSnet for high accuracy/low coverage ( Fig. 4 A, lower right), while the inverse was true for low accuracy/high coverage ( Fig. 4 A, upper left). False positives from NORSnet (unstructured regions predicted and not observed) were almost equally divided between X-ray and NMR structures, while DISOPRED2's false positives were predominantly from NMR structures. The most extreme examples for this were the ordered structures with PDB identifiers 1eij [59] and 1eo0 [60] .
|
|
Fig. 4 : Predictions for NESG data. (A) The NESG set contains many proteins with unstructured regions that are not in DisProt and have never been used for method optimization. We compared NORSnet (in green), DISOPRED2 (in orange), their combined method (in dark gray) and IUPred (in purple) on these proteins. While DISOPRED2 performed better than all other methods in the low accuracy/high coverage region (top left), the combined method, NORSnet and IUPred individually excelled in the high accuracy/low coverage region (lower right). (B) Venn-diagram of overlap between very accurate predictions by NORSnet, DISOPRED2 and the IUPred. The numbers in the circles are mutually exclusive. Note that 5 proteins were identified only by NORSnet to have unstructured region. |
Case study: NORSnet differed from other predictions. As demonstrated above, NORSnet and other predictors give similar predictions with some exceptions. For instance, we applied NORSnet and two other prediction tools (DISOPRED2 and FoldIndex) on the Kappa-casein precursor protein (DisProt identifier DP00192) that is found in milk and stabilizes micelle formation by preventing casein precipitation. Raman optical activity and thermal stability experiments revealed the protein as entirely unstructured in isolation [61] . Secondary structure prediction methods such as PROFsec or PSIPRED [62] predicted the protein to be highly enriched in loops (Fig. S1, Supporting Online Material). We may therefore expect that the prediction of the Kappa-casein precursor as unstructured will be a simple task. However, the distinction between natively unstructured and well-structured loops is not trivial: DISOPRED2 did not identify the long loopy segment to be part of a natively unstructured region ( Fig. 5 A). In contrast, NORSnet identified most of this protein to be unstructured in its strictest cutoff (corresponding to 100% accuracy on the DisProt dataset; Fig. 5 B). FoldIndex, a method that uses only amino acid composition and calculates the hydrophobicity/net charge within a given window, predicted only short segments of this protein to be unstructured ( Fig. 5 C).
This example reveals that NORSnet and DISOPRED2 outputs are rather correlated. However, the signal from NORSnet clearly indicated unstructured regions, while the one from DISOPRED2 did not. One reason for this drastic difference may have been that NORSnet correctly captured some global feature from its global input units (Methods).
|
|
Fig. 5 : Different prediction methods outputs for Kappa-casein precursor. Kappa-casein precursor has been shown to be unstructured by different experiments [61] . Despite its low content in predicted helices and strands, not all prediction methods identify it as unstructured. We compared outputs of DISOPRED2 (A), NORSnet (B) and FoldIndex (C) for this protein. For DISOPRED2 and NORSnet higher values indicate unstructured regions; for FoldIndex low values indicate unstructured regions (red). Note that FoldIndex and DISOPRED2 do not use any explicit information about secondary structure. DISOPRED2 disorder probability, however, is somewhat correlated with coil predictions (Fig. S1, Supporting Online Material).
|
Natively unstructured loops are elements of domain boundaries. Although NORSnet was designed to identify all regions in any PDB structure as well-structured, the editor of this manuscript, Phil Bourne suspected that NORSnet predictions of disorder may more often be in domain boundaries than expected at random and than expected for loop residues in general. To address this, we started with a sequence-unique subset of all PDB proteins considered to be multi-domain by SCOP [63] (set taken from [64] ). Although a much more comprehensive answer will remain the subject for future investigation, we clearly confirmed this assumption (Fig. S4, Supporting Online Material), i.e. the regions in otherwise well-structured proteins that most resemble unstructured regions are domain linkers.
Case study: DFF correctly identified despite being a tough case. The DNA fragmentation factor (DFF) 45 must bind to DFF40 so that DFF40 can execute its catalytic function required for the onset of caspase-mediated apoptosis [65] . The N-terminal domain (NTD) of DFF45 is natively unstructured: its folding is induced upon binding to DFF40 NTD [66] ( Fig. 6 ). Methods that only use amino acid composition to predict unstructured regions are likely to perform better on such proteins than more complex prediction methods since these proteins often have a high net charge which is neutralized upon binding to the target. For example, FoldIndex [29] , identified about a third of DFF45 as unstructured.
Secondary structure prediction methods, such as PSIPRED and PROFsec usually predict the secondary structure of these regions the way they appear in substrate-bound form. Therefore, methods that use this type of information might be fooled by the rigidity and stability that are associated with regular secondary structure segments and identify these regions as well-structured. Since NORSnet uses secondary structure predictions as input, it may mispredict unstructured regions that become helices and strands upon binding. However, despite the fact that DFF45 NTD is enriched in regular secondary structure (Fig. S2A, Supporting Online Material), NORSnet identified NTD as an unstructured region at a rather stringent cutoff (the cutoff corresponded to 100% and 97.2% accuracy in the NESG and the DisProt sets respectively). DISOPRED2 also identified NTD as unstructured albeit at a less stringent cutoff (corresponding to 72.2% and 94.2% accuracy).
The unstructured regions in DFF are correctly identified by many prediction methods. NORSnet and DISOPRED2 are only two of those. This example was one of 24 proteins with unstructured regions that become structured upon binding and were extensively analyzed in a recent study [22] . NORSnet identified 14 of these proteins to have unstructured regions in its strictest cutoff. Again, this underlines the surprising finding that methods based on loop predictions can capture unstructured regions of this type. DFF and similar proteins are just some of many examples for unstructured regions involved in protein-protein interactions. How representative are they?
|
|
Fig. 6 : NORSnet captured unstructured regions related to high net charge/low hydrophobicity. DFF45 (white, yellow, and red) becomes structured upon complex formation with DFF40 (purple; PDB identifier 1ibx [66] ). The interface includes a buried hydrophobic patch surrounded by hydrophilic interactions. Usually, charged residues disrupt the formation of tertiary structure, in this case however, when the complex is formed, the negative charge of the Asp groups in DFF45 is cancelled out with the positive charges of DFF40 allowing the protein to be folded. Visualization was done using GRASP2 [89] . Since DFF45 has high secondary structures content, it is a relatively hard target for NORSnet prediction. However, NORSnet correctly identified its unstructured region at a rather stringent cutoff.
|
Predicted unstructured regions are abundant in protein-protein network hubs
The structural plasticity of proteins with unstructured regions may enable its binding to many proteins, i.e. may typify a protein-protein interaction hub (a protein with many binding partners in an interaction network) [23, 19, 67, 68, 69] . Several detailed studies have specifically identified unstructured regions in hub proteins that are involved in signaling [8, 14, 23, 16, 17, 18] . Natively unstructured regions are also predicted to be abundant in other regulatory processes (e.g. alternative splicing [21] and transcription [20] ) and in cancer-associated signaling proteins [9] .
We addressed this point by correlating sustained large-scale data sets of physical protein-protein interactions (Methods) with predictions for unstructured regions. We applied NORSnet, DISOPRED2, and IUPred to all proteins in the worm (Caenorhabditis elegans) proteome and considered only predictions at thresholds corresponding to 100% accuracy. The subset of interacting proteins resulted from the high-throughput experiment by Vidal et al. [70] and from IntAct [71] . Predictions for unstructured regions for all three methods correlated with the average number of interacting partners; in other words, proteins with more unstructured regions had more binding partners ( Fig. 7 ). Since we used two different data sets to determine the thresholds for what constituted reliable predictions (DisProt and NESG), we also obtained two different thresholds for each method. For the purpose of fishing for hubs in protein-protein networks, we counted the number of proteins with unstructured regions according to any of those thresholds. Using DisProt to tune thresholds, DISOPRED2 predicted more proteins with unstructured regions than did NORSnet (1279 ± 88 vs. 899 ± 76); using the NESG dataset, NORSnet predicted many times more proteins with unstructured regions than DISOPRED2 (1282 ± 87 vs. 321 ± 46, Fig. 7 ). These results agreed with recent studies that estimated hub proteins to be enriched in unstructured regions [67, 68, 69] . However, could NORSnet identify any new unstructured regions in hub proteins?
We chose the cutoff that yielded the highest number of unstructured regions (NORSnet: 1279, DISOPRED2: 1282) for each method and checked whether the two methods predicted unstructured regions in the same hub proteins. Both methods predicted unstructured regions in most (74) of the proteins observed with more than 10 partners (140). DISOPRED2 predicted unstructured regions in another 13 of the promiscuous proteins, and NORSnet in another 21 proteins. If the reliable predictions of both methods are correct, 77% of all promiscuous proteins in the worm (74+13+21=108 of 140) have unstructured regions. While these data do not suffice to identify hubs from sequence, we undoubtedly showed that methods such as NORSnet and DISOPRED2 clearly have some capability in the identification of unstructured regions that will adopt 3D-structures upon binding. While this finding was not new, our particular perspective was that the differences between DISOPRED2 and NORSnet resulted from the difference in the focus of the two. NORSnet focuses more on loopy regions than DISOPRED2 and it also identified more hub proteins. Similar results were obtained when we compared NORSnet and IUPred predictions on the same dataset. Again, IUPred identified the hub signal but much less clearly than did NORSnet (Fig. S3). All these observations suggested that the aspect of unstructured regions most relevant to hubs may exactly be the unstructured loops.
|
|
Fig. 7 : Unstructured regions over-represented in protein-protein hubs of worm. We ran both NORSnet and DISOPRED2 on worm proteins that are involved in protein-protein interactions (as identified by yeast-2-hybrid [70] ). The number of proteins that are predicted to be either unstructured or well-structured is plotted against the number of interacting partners for two different thresholds of reliability of the two methods: A+B were compiled for thresholds at which both methods maintained 100% accuracy for the NESG data (Fig. 4), while graphs C+D were compiled for 100% accuracy on DisProt (Fig. 3). Since the number of observed interaction partners falls off dramatically, we had to group the data into bins of roughly equal sizes (x-axes). A+C show the results for the number of proteins predicted in each bin of interaction partners, while B+D show the normalized ratios to zoom into the difference between unstructured and structured proteins in each bin. These ratios were compiled as Ratio(bin)={#unstructured(bin)/#structured(bin)} / {#unstructured(1)/#structured(1)}. As all ratios are above 1, proteins with more than one interaction partners have more unstructured regions than proteins with one partner. (A) These graphs were compiled with the reliability threshold at which each method achieved 100% accuracy by the NESG data (Fig. 4). Overall, this threshold resulted in NORSnet (filled bars) predicting many more proteins with unstructured regions than DISOPRED2 (hashed bars). The difference was particularly relevant for proteins with more interacting partners. (B) NORSnet (filled, dark green) predicted many more unstructured regions in proteins with seven or more interaction partners than did DISOPRED2 (hashed, light green). (C) For the thresholds at which both methods achieved 100% accuracy on the DisProt dataset, DISOPRED2 identified more proteins with unstructured regions than did NORSnet. In contrast to the situation for the NESG set (A), the difference was not as significant for promiscuous proteins (³10 partners). (D) Although NORSnet (filled, dark green) predicted as many unstructured as structured regions in hubs (³7), this ratio was significantly smaller than the one for proteins with a single interaction partner. In other words, even on this data set NORSnet picked up a much stronger over-representation of unstructured regions in hubs than did DISOPRED2 (hashed, light green).
|
While NORSnet has some ability to identify unstructured regions that are often involved in binding ( Fig. 6 ) it may miss many of these regions due to their enrichment in regular secondary structure (helix, strand) in their bound form. We may therefore wonder why NORSnet identified so many worm hub proteins to have unstructured regions in the first place. Interestingly, many of the hubs had several modules/domains, some of which were predicted not to contain unstructured regions. Some of these modules were DNA-binding domains (such as Homeobox domains) or protein-protein interaction binding motifs (such as EGF repeats). The majority of the unstructured regions predicted by NORSnet in these hubs bridged connections between well-structured domains: these bridges were often on the surface (data not shown). At first glance, the fact that these regions were predicted to be unstructured might seem biologically unimportant. However, there are several possible biological consequences of the abundance of hubs with unstructured loops. These exposed unstructured/loopy regions might serve as sites for proteolysis, allowing some parts of the protein to undergo proteolytic degradation under different cellular conditions. Such differential degradation could allow different modules of the same protein to be functional under different conditions.
Alternatively, these long connecting loops might function as extremely flexible connecting linkers that facilitate the modules to adopt different orientations, thereby allowing the binding of different targets or binding similar targets in different fashion. Each of these alternatives could be at the heart of a different function. These two hypotheses may explain some of the regulatory characteristics of hub proteins.
Mapping the sequence space of proteins with unstructured regions. Most likely, unstructured regions and NORS regions occupy slightly different parts in sequence space ( Fig. 1 ). Indications for the overlap between NORS and unstructured regions are that both are enriched in proline and both depleted of glycine ( [50] and Fig. 2 ). Some experimentally observed unstructured regions have been shown to contain cysteines. For instance, Zinc fingers often become structured only upon binding zinc. Nevertheless, most previous studies of unstructured regions did not find cysteines to be over-represented with respect to well-structured regions in the PDB. This may be due partially to the fact that in well-structured proteins cysteines often stabilize disulfide bonds. Methods optimized to identify regions missing in electron density maps from X-ray crystallography are therefore likely to miss many of the cysteines in unstructured regions. In contrast, NORSnet captured cysteines in unstructured regions ( Fig. 2 ). Additionally, the differences between DISOPRED2 and NORSnet that were revealed by both our head-to-head comparison on different sets of proteins with unstructured regions and by our analysis of protein hubs pointed to the different types of unstructured regions that we may have to separate ( Fig. 1 ). To complicate matters further, some proteins with unstructured regions may look just like any regular protein, while others may be generically different. Consequently, some of the proteins with unstructured regions may be missed by any prediction method.
Refining target selection for structural genomics. One goal of structural genomics projects is to contribute considerably to the increase in the fraction of proteins with known 3D structures. In order to achieve this goal, 3D structures are experimentally determined for representatives of as many large families as possible [72, 44, 73, 45, 64, 74] . In particular, the large structural genomics initiatives financed by the Protein Structure Initiative (PSI) from the National Institutes of Health (NIH) in the USA systematically target the experimental determination of structures for large families without representatives of known structure. Structural genomics also aims at making 3D structures more readily accessible to non-structural molecular biology and at reducing the costs and difficulty of determining structures. All of these goals require high-throughput determination of 3D structures. This implies that experimental high-throughput pipelines have to move on if structure determination fails for some families, and that targets are also chosen with the objective to increase the throughput. This does not imply that PSI consortia "go for the low hanging fruits". Quite to the contrary, they have succeeded where many small-scale studies failed.
Membrane proteins and proteins with unstructured regions are the two major types of proteins that are not only avoided by conventional small-scale structural biology but also by structural genomics efforts. Due to the fact that proteins with unstructured regions are much more abundant in eukaryotic organisms, consortia that focus on eukaryotes, such as NESG and CESG have to carefully avoid such difficult targets. Over the last six years, thousands of proteins have been cloned, expressed and purified by NESG. Although, the NESG target selection filtered out many domains with strong predictions for the presence of unstructured regions [46, 47] , many were left for which biophysical data suggested that they contain unstructured regions [28] .
We applied NORSnet to 11,587 putative NESG targets that had already passed our previous and cruder NORS filter ( Table 1 ). Using two different cutoffs NORSnet predicted that 13%-20% of the previously filtered targets have unstructured regions. Although NORSnet was not optimized to identify very short unstructured regions (²30 residues), NORSnet predicted 47%-58% of the proteins to contain such regions. The same filter would not have excluded any of the proteins that succeeded in the experimental pipeline, suggesting that the application of NORSnet could have increased the structure/clone ratio. However, the ultimate proof for this assumption will have to wait until another hundred or so experimentally determined structures are added by NESG to the PDB over the next year(s).
The intricate details of protein 3D structures are crucial for their functional role, i.e. structure determines function. Natively unstructured regions do not necessarily contradict this structure-function paradigm. Nevertheless, a variety of proteins require unstructured regions in order to function as domain linkers, filling material and detergents. For other proteins with unstructured regions, changes in the environment (e.g. pH change, presence of target) or posttranslational modifications can trigger the formation of a regular 3D structure that will then again determine function. In an evolutionary sense, the required changes/modifications constitute an integral part of the function and are therefore likely to be somehow encoded in the sequence of such proteins. The unusual aspect is that the key structural feature of these proteins is to keep regions natively unstructured or adaptable. The experimental and in silico identification and characterization of proteins with unstructured regions is evolving into an increasingly important challenge for structural biology. In facing this challenge it becomes increasingly clear that the term "unstructured" describes a rather mixed bag of phenomena from regions that alter between different conformations to those that remain molten globule-like; from regions that adopt regular helices and strands to those that remain intrinsically loopy.
Here, we presented NORSnet, a neural network-based method that revisited the task of identifying unstructured regions from a different angle than that taken by other methods. It focused on the distinction between unstructured and well-structured loops. The success in this undertaking confirmed our initial hypothesis, namely that short unstructured loops resemble very long unstructured loops (NORS regions). Our application of machine learning was rather unconventional in two ways. First, we trained on positives (predicted NORS) that did not contain the feature we sought to predict (short unstructured loops) and on negatives (all regions in the PDB) that contained regions that we wanted the method to predict as positives, i.e. we implicitly hoped that our development would fail for many cases. Secondly, we did not optimize any parameters on the data set used for assessing the performance of our method. Due to the difference in our approach, NORSnet complemented existing methods that optimize on previous data sets of unstructured regions. Consequently, NORSnet will enable the application of additional filters for structural genomics. Lastly, through a comparison between our new and other prediction methods, we confirmed the importance of unstructured regions for protein-protein interactions. Moreover, we specifically touched on the importance of unstructured loops for network complexity.
Data set of NORS regions. We created our data set of residues in natively unstructured regions ('positives') in the following way. We grouped all proteins from 62 entirely sequence proteomes into domain-like families using CHOP and CLUP [46, 75, 76] . We identified proteins with long NORS regions by the application of NORSp i.e., all residues that are located in a stretch of >70 consecutive residues with <12% predicted helix or strand [10, 40] by PROFsec [41, 42, 43] and have at least one contiguous segment longer than 10 residues predicted to be on the protein surface [77] . The hope was that all residues in this pool have commonalities that we can extract through machine learning, and that will also be shared by proteins with unstructured regions much shorter than 70 residues. Due to the fact that PROFsec is especially accurate for natively unstructured regions [22] , the noise in these data that originated from the prediction mistakes was likely very low. In order to distinguish between proteins with and without unstructured regions, we needed a set of Ônegatives', i.e. residues that are well-structured. For this, we chose a sequence-unique subset of globular protein structures from the PDB. Technically, this sequence-unique subset was taken from the EVA server [78, 79] . Specifically, the sequence redundancy was removed above HSSP similarity values of 0 [80, 81] (corresponding to <22% pairwise sequence identity for long alignments). Any pair of sequences between training and testing sets that could be aligned at PSI-BLAST [82] E-values <10-3 according to our standard procedure of three automated iterations [83] was also removed. In order to further amplify the signal from well-structured regions in the negative set, we also excluded all loops longer than 30 residues. Our data sets were not fully clean in the sense that our negative set of well-structured PDB proteins certainly contained some residues that did not appear in the X-ray structure (which were implicitly treated as well-structured), and that the positive set (predicted NORS) might contain some regular, ordered regions. However, due to the immense size of both data sets and to our use of neural networks, we did not worry about such outliers. In fact, our particular generation of a prediction-based training set that is more than 10 times larger, and certainly more representative than sets used previously might be the most important difference to all previous methods. In the context of a different problem, we showed how beneficial the use of prediction-based sets with errors may be [84] .
Training and testing set. In order to optimize the parameters of the method we trained the network on 90% of the sequences and tested it on the remaining 10%. Note that these data were only used for the development of the method. We never reported the performance of the method on these data. The data sets on which we DID assess NORSnet had no overlap (HSSP-value< 0 <<<22% pairwise sequence identity for 250 aligned residues) with any of the proteins used for development. In particular, NORSnet was not optimized in any way on DisProt and the NESG data set as these were solely used to assess its performance.
DisProt data. After optimizing our method to predict NORS regions (as described below in the prediction method section) we assessed NORSnet performance on different sets without any further optimization. In the first benchmark we used DisProt proteins that have unstructured regions longer than 30 residues as positives and a sequence-unique subset of 173 PDB X-ray structures as negatives. The latter subset was taken from the EVA server [78, 79] and did not include sequences that were in the original training set. One particular advantage of testing our method on DisProt was that we did not have to run any additional cross-validation experiment since we used different proteins, respectively the same proteins with different labels (ALL residues from PDB in DisProt were explicitly treated as "well-structured" by our training procedure).
NESG data set. In order to further validate the method, we tested it on a set of proteins from the NESG consortium. The positive set included 30 proteins that were identified to have unstructured regions (ÔNESG unfolded') and the negative set included 170 recently determined protein structures. Both sets were identified as such by the NESG consortium. The definition of Ôunstructured region' was as follows: (1) HSQC was high signal to noise and very low dispersion; (2) hetNOE data was clean negative (Montelione, personal communication). Using this set contributed to removing two types of biases in DisProt and similar databases. (1) Structure determination method: The negative set was almost equally divided between X-ray and NMR structures. (2) Length bias: while usually sequences selected for NMR structure determination are shorter than for X-ray determination, the NESG consortium reduced this artifact by using both methods in parallel to determine the structures of the same sequences. Thus, the length distribution of the NESG unfolded set is similar to the one of the folded set, in contrast to DisProt database which consists of some much longer sequences (see Table 1 _App, Supporting Online Material).
Protein-protein interaction set. For the large-scale predictions of proteins that are involved in protein-protein interactions we used the IntAct database (http://www.ebi.ac.uk/intact/). IntAct includes both large- and small-scale experiments for different organisms [71] . Specifically, we used proteins from interactions that were detected in a large-scale yeast two-hybrid screen of Caenorhabditis elegans (worm) proteins [70] . The set included 2622 proteins that participate in 4039 interactions.
Prediction method. We used a standard feed-forward neural network described elsewhere in more detail [85, 77, 41, 43] The crucial novelty for the given task was the choice of input information. This choice was largely influenced by what we found to succeed in different contexts, namely for the prediction of normalized B-values [86] , and protein-protein interfaces [87] . Local input information was taken from a sliding window of 13 sequence-consecutive residues (the prediction was for the central residue in that window). For each residue, we used the evolutionary profile (from PSI-BLAST alignments according to our standard protocol [83] ), the three-state secondary structure predicted by PROFsec [41, 42, 43] , the two-state solvent accessibility state predicted by PROFacc [77] and the two-state flexibility prediction by PROFbval [86, 88] . Global input information was represented by the global amino acid composition (20 units), the composition in predicted secondary structure (3 units) and solvent accessibility (2 units), as well the length of the protein/domain-like fragment (3 units as in [86] ), and the mean hydrophobicity divided by the net charge as was first suggested by Uversky et al [4] .
DISOPRED2, FoldIndex and IUPred. We downloaded the DISOPRED2 package from http://bioinf.cs.ucl.ac.uk/disopred/ and installed it locally. The package included DISOPRED2 V0.2 and PSIPRED Version 2.45 (from November 2003). To assess its performance on our datasets we ran the program using the default parameters. The prediction for casein precursor in Fig. 5 A was taken from the DISOPRED2 server. We ran FoldIndex using the server at http://bip.weizmann.ac.il/fldbin/findex (in September 2006) with default parameters. We ran IUPred using the server at http://iupred.enzim.hu/index.html (in December 2005 & January 2006) with default parameters.
Thanks to Dariusz Przybylski and Guy Yachdav (Columbia University) for providing preliminary information and programs, to Andrew Kernytsky and Marco Punta (Columbia) for valuable discussions, to Kazimierz Wrzeszczynski and Henry Bigelow (Columbia) for helpful comments on the manuscript. Thanks to Jonathan Ward and David Jones (Univ. College London) for making DISOPRED2 and PSIPRED available, to Jaime Prilusky and Joel Sussman (Weizmann Inst., Rehovot) for making FoldIndex available and to Zsuzsanna Doszt‡nyi and Istv‡n Simon (Institute of Enzymology, Budapest) for making IUPred available. Particular thanks to Guy Montelione and colleagues (Rutgers) for creating and providing the NESG data sets. Thanks to the constructive criticism of two anonymous reviewers and to those from the editor, Phil Bourne. The work was supported by the grants U54-GM074958-01 from the Protein Structure Initiative of National Institutes of Health to the Northeast Structural Genomics Consortium and 2R01-LM07329-01 from the National Library of Medicine (NLM). The work of BR was also supported partially by grant U54-GM072980 from the NIH. Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases, in particular to Keith Dunker and his colleagues for the maintenance of DisProt.
| 1. | BrŠndŽn, C. & Tooze, J. (1991). Introductionto Protein Structure. Garland Publ., New York, London. |
| 2. | Lesk, A. M. (2004). Introduction to proteinarchitecture - The structural biology of proteins. OUP, Oxford. |
| 3. | Wright, P. E. & Dyson, H. J. (1999).Intrinsically unstructured proteins: re-assessing the proteinstructure-function paradigm. Journal of Molecular Biology, 293,321-331. |
| 4. | Uversky, V. N., Gillespie, J. R. & Fink, A.L. (2000). Why are "natively unfolded" proteins unstructured underphysiologic conditions? Proteins: Structure, Function, and Genetics, 41,415-427. |
| 5. | Dunker, A. K. & Obradovic, Z. (2001). Theprotein trinity-linking function and disorder. Nature Biotechnology, 19,805-806. |
| 6. | Dunker, A. K., Brown, C. J., Lawson, J. D.,Iakoucheva, L. M. & Obradovic, Z. (2002). Intrinsic disorder and proteinfunction. Biochemistry, 41, 6573-82. |
| 7. | Dunker, A. K., Brown, C. J. & Obradovic, Z.(2002). Identification and functions of usefully disordered proteins. AdvProtein Chem, 62, 25-49. |
| 8. | Dyson, H. J. & Wright, P. E. (2002).Coupling of folding and binding for unstructured proteins. Current Opinionin Structural Biology, 12, 54-60. |
| 9. | Iakoucheva, L. M., Brown, C. J., Lawson, J. D.,Obradovic, Z. & Dunker, A. K. (2002). Intrinsic disorder in cell-signalingand cancer-associated proteins. J Mol Biol, 323, 573-84. |
| 10. | Liu, J., Tan, H. & Rost, B. (2002). Loopyproteins appear conserved in evolution. Journal of Molecular Biology, 322,53-64. |
| 11. | Tompa, P. (2002). Intrinsically unstructuredproteins. Trends Biochem Sci,27, 527-533. |
| 12. | Uversky, V. N. (2002). What does it mean to benatively unfolded? Eur J Biochem,269, 2-12. |
| 13. | Linding, R., Jensen, L. J., Diella, F., Bork,P., Gibson, T. J. et al. (2003). Protein disorder prediction: implications forstructural proteomics. Structure,11, 1453-1459. |
| 14. | Iakoucheva, L. M., Radivojac, P., Brown, C. J.,O'Connor, T. R., Sikes, J. G. et al. (2004). The importance of intrinsicdisorder for protein phosphorylation. Nucleic Acids Res, 32,1037-49. |
| 15. | Ward, J. J., Sodhi, J. S., McGuffin, L. J.,Buxton, B. F. & Jones, D. T. (2004). Prediction and functional analysis ofnative disorder in proteins from the three kingdoms of life. Journal of MolecularBiology, 337, 635-645. |
| 16. | Dyson, H. J. & Wright, P. E. (2005).Intrinsically unstructured proteins and their functions. Nat Rev Mol CellBiol, 6, 197-208. |
| 17. | Fink, A. L. (2005). Natively unfolded proteins.Curr Opin Struct Biol, 15, 35-41. |
| 18. | Tompa, P. (2005). The interplay betweenstructure and function in intrinsically unstructured proteins. FEBS Letters, 579,3346-3354. |
| 19. | Uversky, V. N., Oldfield, C. J. & Dunker,A. K. (2005). Showing your ID: intrinsic disorder as an ID for recognition,regulation and cell signaling. J Mol Recognit, 18,343-84. |
| 20. | Liu, J., Perumal, N. B., Oldfield, C. J., Su,E. W., Uversky, V. N. et al. (2006). Intrinsic disorder in transcriptionfactors. Biochemistry, 45, 6873-88. |
| 21. | Romero, P. R., Zaidi, S., Fang, Y. Y., Uversky,V. N., Radivojac, P. et al. (2006). Alternative splicing in concert withprotein intrinsic disorder enables increased functional diversity inmulticellular organisms. Proc Natl Acad Sci U S A, 103,8390-5. |
| 22. | Fuxreiter, M., Simon, I., Friedrich, P. &Tompa, P. (2004). Preformed structural elements feature in partner recognitionby intrinsically unstructured proteins. Journal of Molecular Biology, 338,1015-1026. |
| 23. | Dunker, A. K., Cortese, M. S., Romero, P.,Iakoucheva, L. M. & Uversky, V. N. (2005). Flexible nets. The roles ofintrinsic disorder in protein interaction networks. Febs J, 272,5129-48. |
| 24. | Dyson, H. J. & Wright, P. E. (2004).Unfolded proteins and protein folding studied by NMR. Chem Rev, 104,3607-22. |
| 25. | Vucetic, S., Obradovic, Z., Vacic, V.,Radivojac, P., Peng, K. et al. (2005). DisProt: a database of protein disorder.Bioinformatics, 21, 137-40. |
| 26. | Oldfield, C. J., Cheng, Y., Cortese, M. S.,Brown, C. J., Uversky, V. N. et al. (2005). Comparing and combining predictorsof mostly disordered proteins. Biochemistry, 44,1989-2000. |
| 27. | Oldfield, C. J., Ulrich, E. L., Cheng, Y.,Dunker, A. K. & Markley, J. L. (2005). Addressing the intrinsic disorderbottleneck in structural proteomics. Proteins, 59,444-53. |
| 28. | Snyder, D. A., Chen, Y., Denissova, N. G.,Acton, T., Aramini, J. M. et al. (2005). Comparisons of NMR spectral qualityand success in crystallization demonstrate that NMR and X-ray crystallographyare complementary methods for small protein structure determination. J AmChem Soc, 127, 16505-11. |
| 29. | Prilusky, J., Felder, C. E., Zeev-Ben-Mordehai,T., Rydberg, E. H., Man, O. et al. (2005). FoldIndex: a simple tool to predictwhether a given protein sequence is intrinsically unfolded. Bioinformatics, 21,3435-8. |
| 30. | Linding, R., Russell, R. B., Neduva, V. &Gibson, T. J. (2003). GlobPlot: Exploring protein sequences for globularity anddisorder. Nucleic Acids Res, 31, 3701-8. |
| 31. | Zetina, C. R. (2001). A conservedhelix-unfolding motif in the naturally unfolded proteins. Proteins, 44,479-83. |
| 32. | Lise, S. & Jones, D. T. (2005). Sequencepatterns associated with disordered regions in proteins. Proteins, 58,144-50. |
| 33. | Obradovic, Z., Peng, K., Vucetic, S.,Radivojac, P., Brown, C. J. et al. (2003). Predicting intrinsic disorder fromamino acid sequence. Proteins: Structure, Function, and Genetics, 53,566-572. |
| 34. | Jones, D. T. & Ward, J. J. (2003).Prediction of disordered regions in proteins from position specific score matrices.Proteins: Structure, Function, and Genetics, 53,573-578. |
| 35. | Cheng, J., Sweredoski, M. J. & Baldi, P.(2005). Accurate Prediction of Protein Disordered Regions by Mining ProteinStructure Data. In Data Mining and Knowledge Discovery eds.), pp. 213-222,Springer Science + Business Media, Inc., . |
| 36. | Yang, Z. R., Thomson, R., McNeil, P. &Esnouf, R. M. (2005). RONN: the bio-basis function neural network techniqueapplied to the detection of natively disordered regions in proteins. Bioinformatics, 21,3369-76. |
| 37. | Melamud, E. & Moult, J. (2003). Evaluationof disorder predictions in CASP5. Proteins, 53 Suppl 6, 561-5. |
| 38. | Jin, Y. & Dunbrack, R. L., Jr. (2005).Assessment of disorder predictions in CASP6. Proteins, 61 Suppl 7,167-75. |
| 39. | Gu, J., Gribskov, M. & Bourne, P. E.(2006). Wiggle-predicting functionally flexible regions from primary sequence. PLoSComput Biol, 2, e90. |
| 40. | Liu, J. & Rost, B. (2003). NORSp:predictions of long regions without regular secondary structure. Nucleic AcidsResearch, 31, 3833-3835. |
| 41. | Rost, B. (1996). PHD: predictingone-dimensional protein structure by profile based neural networks. Methodsin Enzymology, 266, 525-539. |
| 42. | Rost, B. (2001). Protein secondary structureprediction continues to rise. Journal of Structural Biology, 134,204-218. |
| 43. | Rost, B. (2005). How to use protein 1Dstructure predicted by PROFphd. In The Proteomics Protocols Handbook (Walker,J. E., eds.), pp. 875-901, Humana, Totowa NJ. |
| 44. | Rost, B. (1998). Marrying structure andgenomics. Structure, 6, 259-63. |
| 45. | Montelione, G. T. & Anderson, S. (1999).Structural genomics: keystone for a Human Proteome Project. Nat Struct Biol, 6,11-2. |
| 46. | Liu, J., Hegyi, H., Acton, T. B., Montelione,G. T. & Rost, B. (2004). Automatic target selection for structural genomicson eukaryotes. Proteins, 56, 188-200. |
| 47. | Wunderlich, Z., Acton, T. B., Liu, J.,Kornhaber, G., Everett, J. et al. (2004). The protein target list of theNortheast Structural Genomics Consortium. Proteins, 56,181-7. |
| 48. | Kabsch, W. & Sander, C. (1983). Dictionaryof protein secondary structure: pattern recognition of hydrogen bonded andgeometrical features. Biopolymers,22, 2577-2637. |
| 49. | Fossen, T., Wray, V., Bruns, K., Rachmat, J.,Henklein, P. et al. (2005). Solution structure of the human immunodeficiencyvirus type 1 p6 protein. J Biol Chem,280, 42515-27. |
| 50. | Radivojac, P., Obradovic, Z., Smith, D. K.,Zhu, G., Vucetic, S. et al. (2004). Protein flexibility and intrinsic disorder.Protein Science, 13, 71-80. |
| 51. | Garbuzynskiy, S. O., Lobanov, M. Y. &Galzitskaya, O. V. (2004). To be folded or to be unfolded? Protein Sci, 13,2871-7. |
| 52. | Dosztanyi, Z., Csizmok, V., Tompa, P. &Simon, I. (2005). The pairwise energy content estimated from amino acidcomposition discriminates between folded and intrinsically unstructuredproteins. J Mol Biol, 347, 827-39. |
| 53. | Dunker, A. K., Obradovic, Z., Romero, P.,Garner, E. C. & Brown, C. J. (2000). Intrinsic protein disorder in completegenomes. Genome Inform Ser Workshop Genome Inform, 11,161-71. |
| 54. | Peng, K., Radivojac, P., Vucetic, S., Dunker,A. K. & Obradovic, Z. (2006). Length-dependent prediction of proteinintrinsic disorder. BMC Bioinformatics, 7, 208. |
| 55. | Su, C. T., Chen, C. Y. & Ou, Y. Y. (2006).Protein disorder prediction by condensed PSSM considering propensity for orderor disorder. BMC Bioinformatics,7, 319. |
| 56. | Dosztanyi, Z., Csizmok, V., Tompa, P. &Simon, I. (2005). IUPred: web server for the prediction of intrinsically unstructuredregions of proteins based on estimated energy content. Bioinformatics, 21,3433-4. |
| 57. | Vucetic, S., Brown, C. J., Dunker, A. K. &Obradovic, Z. (2003). Flavors of protein disorder. Proteins, 52,573-84. |
| 58. | Goh, C. S., Lan, N., Echols, N., Douglas, S.M., Milburn, D. et al. (2003). SPINE 2: a system for collaborative structuralproteomics within a federated database framework. Nucleic Acids Res, 31,2833-8. |
| 59. | Christendat, D., Yee, A., Dharamsi, A., Kluger,Y., Savchenko, A. et al. (2000). Structural proteomics of an archaeon. NatStruct Biol, 7, 903-9. |
| 60. | Booth, V., Koth, C. M., Edwards, A. M. &Arrowsmith, C. H. (2000). Structure of a conserved domain common to thetranscription factors TFIIS, elongin A, and CRSP70. J Biol Chem, 275, 31266-8. |
| 61. | Syme, C. D., Blanch, E. W., Holt, C., Jakes,R., Goedert, M. et al. (2002). A Raman optical activity study of rheomorphismin caseins, synucleins and tau. New insight into the structure and behaviour ofnatively unfolded proteins. Eur J Biochem, 269, 148-56. |
| 62. | McGuffin, L. J., Bryson, K. & Jones, D. T.(2000). The PSIPRED protein structure prediction server. Bioinformatics, 16,404-5. |
| 63. | Andreeva, A., Howorth, D., Brenner, S. E.,Hubbard, T. J., Chothia, C. et al. (2004). SCOP database in 2004: refinementsintegrate structure and sequence family data. Nucleic Acids Res, 32,D226-9. |
| 64. | Liu, J. & Rost, B. (2004). Sequence-basedprediction of protein domains. Nucleic Acids Res, 32,3522-30. |
| 65. | Enari, M., Sakahira, H., Yokoyama, H., Okawa,K., Iwamatsu, A. et al. (1998). A caspase-activated DNase that degrades DNAduring apoptosis, and its inhibitor ICAD. Nature, 391,43-50. |
| 66. | Zhou, P., Lugovskoy, A. A., McCarty, J. S., Li,P. & Wagner, G. (2001). Solution structure of DFF40 and DFF45 N-terminaldomain complex and mutual chaperone activity of DFF40 and DFF45. Proc NatlAcad Sci U S A, 98, 6051-5. |
| 67. | Ekman, D., Light, S., Bjorklund, A. K. &Elofsson, A. (2006). What properties characterize the hub proteins of theprotein-protein interaction network of Saccharomyces cerevisiae? Genome Biol, 7, R45. |
| 68. | Haynes, C., Oldfield, C. J., Ji, F., Klitgord,N., Cusick, M. E. et al. (2006). Intrinsic disorder is a common feature of hubproteins from four eukaryotic interactomes. PLoS Comput Biol, 2,e100. |
| 69. | Patil, A. & Nakamura, H. (2006). Disordereddomains and high surface charge confer hubs with the ability to interact withmultiple proteins in interaction networks. FEBS Lett, 580,2041-5. |
| 70. | Li, S., Armstrong, C. M., Bertin, N., Ge, H.,Milstein, S. et al. (2004). A map of the interactome network of the metazoan C.elegans. Science, 303, 540-3. |
| 71. | Hermjakob, H., Montecchi-Palazzi, L.,Lewington, C., Mudali, S., Kerrien, S. et al. (2004). IntAct: an open sourcemolecular interaction database. Nucleic Acids Res, 32,D452-5. |
| 72. | Gaasterland, T. (1998). Structural genomicstaking shape. Trends in Genetics,14, 135. |
| 73. | Sali, A. (1998). 100,000 protein structures forthe biologist. Nature Structural Biology, 5, 1029-1032. |
| 74. | Redfern, O., Grant, A., Maibaum, M. &Orengo, C. (2005). Survey of current protein family databases and theirapplication in comparative, structural and functional genomics. J ChromatogrB Analyt Technol Biomed Life Sci,815, 97-107. |
| 75. | Liu, J. & Rost, B. (2004). CHOP proteinsinto structural domain-like fragments. Proteins, 55,678-88. |
| 76. | Liu, J. & Rost, B. (2004). CHOP: parsingproteins into structural domains. Nucleic Acids Res, 32,W569-71. |
| 77. | Rost, B. (1994). Conservation and prediction ofsolvent accessibility in protein families. Proteins: Structure, Function,and Genetics, 20, 216-226. |
| 78. | Eyrich, V., Mart’-Renom, M. A., Przybylski, D.,Fiser, A., Pazos, F. et al. (2001). EVA: continuous automatic evaluation ofprotein structure prediction servers. Bioinformatics, 17,1242-1243. |
| 79. | Koh, I. Y. Y., Eyrich, V. A., Marti-Renom, M.A., Przybylski, D., Madhusudhan, M. S. et al. (2003). EVA: evaluation ofprotein structure prediction servers. Nucleic Acids Research, 31,3311-3315. |
| 80. | Sander, C. & Schneider, R. (1991). Databaseof homology-derived protein structures and the structural meaning of sequencealignment. Proteins, 9, 56-68. |
| 81. | Rost, B. (1999). Twilight zone of proteinsequence alignments. Protein Engineering, 12, 85-94. |
| 82. | Altschul, S. F., Madden, T. L., Schaeffer, A.A., Zhang, J., Zhang, Z. et al. (1997). Gapped BLAST and PSI-BLAST: a newgeneration of protein database search programs. Nucleic Acids Research, 25,3389-33402. |
| 83. | Przybylski, D. & Rost, B. (2002).Alignments grow, secondary structure prediction improves. Proteins:Structure, Function, and Genetics,46, 195-205. |
| 84. | Nair, R. & Rost, B. (2005). Mimickingcellular sorting improves prediction of subcellular localization. J Mol Biol, 348,85-100. |
| 85. | Rost, B. & Sander, C. (1993). Prediction ofprotein secondary structure at better than 70% accuracy. Journal ofMolecular Biology, 232, 584-599. |
| 86. | Schlessinger, A. & Rost, B. (2005). Proteinflexibility and rigidity predicted from sequence. Proteins, 61,115-26. |
| 87. | Ofran, Y. & Rost, B. (2003). Predictprotein-protein interaction sites from local sequence information. FEBSLetters, 544, 236-239. |
| 88. | Schlessinger, A., Yachdav, G. & Rost, B.(2006). PROFbval: predict flexible and rigid residues in proteins. Bioinformatics,. |
| 89. | Petrey, D., Xiang, Z., Tang, C. L., Xie, L.,Gimpelev, M. et al. (2003). Using multiple structure alignments, fast modelbuilding, and energetic analysis in fold recognition and homology modeling. Proteins, 53 Suppl 6,430-5. |
| Contact: admin@rostlab.org | Version: Jun 4, 2007 |