| Title: | Natively unstructured regions in proteins identified from contact predictions |
| Author: | Avner Schlessinger , Marco Punta & Burkhard Rost |
| Quote: | Bioinformatics, 2007, 23:2376-84 |
Natively unstructured regions in proteins identified from contact predictions
| 1 | Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA |
| * | Corresponding author: Avner "Schlessinger AT rostlab.org" URL http://www.rostlab.org/ Tel: +1-212-851-4669, fax: +1-212-305-7932 |
This article is published in (Bioinformatics, issue, 2007 and pages) © copyright OUP (2007). OUP is the only authorized source. All copying of this article including placing on another website requires the written permission of the copyright owner.
Motivation: Natively unstructured (also dubbed intrinsically disordered) regions in proteins lack a defined 3D structure under physiological conditions and often adopt regular structures under particular conditions. Proteins with such regions are overly abundant in eukaryotes, they may increase functional complexity of organisms, and they usually evade structure determination in the unbound form. Low propensity for the formation of internal residue contacts has been previously used to predict natively unstructured regions.
Results: We combined PROFcon predictions for protein-specific contacts with a generic pairwise potential to predict unstructured regions. This novel method, Ucon, outperformed the best available methods in predicting proteins with long unstructured regions. Furthermore, Ucon correctly identified cases missed by other methods. By computing the difference between predictions based on specific contacts (approach introduced here) and those based on generic potentials (realized in other methods), we might identify unstructured regions that are involved in protein-protein binding. We discussed one example to illustrate this ambitious aim. Overall, Ucon added quality and an orthogonal aspect that may help in the experimental study of unstructured regions in network hubs.
Availability: http://www.predictprotein.org/submit_ucon.html
Contact: as2067@columbia.edu
Key words: natively unstructured proteins, predicting globularity, protein disorder, protein structure prediction, protein function prediction, internal residue contact prediction
| 1D structure | one-dimensional (e.g. sequence or string of residue secondary structure or numbers for residue solvent accessibility) |
| 2D structure | two-dimensional structure, i.e. maps that specify for all pairs of residues ij in a protein how close in space these residues are |
| 3D structure | three-dimensional structure, i.e. co-ordinates of all residues/atoms in a protein |
| Angstroem (A) | =0.1 nm |
| CASP | Critical Assessment of methods of protein Structure Prediction [1, 2] |
| DISOPRED2 | SVM-based prediction of disordered regions (based on missing coordinates) [3] |
| DisProt | database of proteins with experimentally characterized disorder [4] |
| FoldIndex | a method that predicts whether a given protein will fold based on hydrophobicity/net charge [5] |
| DSSP | Dictionary of Secondary Structure of Proteins: program and database assigning secondary structure and solvent accessibility for proteins of known 3D structure [6] |
| IUPred | prediction of unstructured regions based on pairwise statistical potential (different versions for missing coordinates and for DisProt data) [7] |
| RONN | artificial neural-network based disorder predictor trained on missing coordinates [8] . NORSnet, prediction of unstructured loops [9] |
| PDB | the Protein Data Bank with full coordinates of 3D structures [10] |
| PROFcon | a profile neural network based method that predicts intra-chain interactions between residues [11] |
| PROFsec | a profile neural network based method that predicts secondary structure [12, 13, 14] |
| SVM | Support Vector Machine |
| SWISS-PROT | curated database of sequences and annotations [15] , Ucon, prediction of natively unstructured regions through contacts - method introduced here. |
Many different flavors of unstructured regions. Regions in proteins that do not adopt well-ordered three-dimensional (3D) structures under physiological conditions are often dubbed natively unstructured, disordered, intrinsically unstructured, or unfolded. Typical are proteins that adopt stable 3D structures only upon binding to substrates to carry out their function ( Fig. 1 ), or proteins that perform a particular function in their "unstructured state" [16, 17, 18] . The better our experimental and computational means of identifying such proteins, the more we realize that they come in a great variety: some adopt regular secondary structure (helix or strand) upon binding, some remain loopy; some proteins are almost entirely unstructured, others have only short unstructured regions [19, 20, 21, 22, 23, 24, 25, 26, 27, 28] . There is no single way to define "unstructured regions". Here, we refer to a region as unstructured if it appears to lack a defined 3D structure by either of the following experimental techniques: circular dichroism spectroscopy (CD), nuclear magnetic resonance spectroscopy (NMR), X-ray crystallography, or protein proteolysis. This is a very roundabout way to collect a plethora of phenomena as exercised, e.g., in DisProt [4] . However, many unstructured regions are neither covered by DisProt, nor by existing prediction methods [29] .
|
|
Fig. 1 : Unstructured regions may become structured upon binding. A sensitive mechanism of regulation for many cellular processes is facilitated through the existence of proteins with natively unstructured regions. (A) Unstructured regions are often only unstructured in isolation and are therefore, inactive in their native state. (B) Through binding a substrate (which can also be unstructured), many unstructured regions may become ordered/folded, thereby activating a particular function. (C) The interaction between unstructured region and substrate is often reversible, and the complex may dissociate upon small changes in the environment. (D) Some methods that predict unstructured regions resemble the identification of metal-like regions that have low preferences to bind to anything. (E) In contrast, some unstructured regions will have high binding affinity but only for externally binding. In the image, we picture this type as Velcro-like, i.e. the reversible connection between hooks (French: velours) and loops (French: crouchets). In this image the unstructured regions are the loops.
|
Unstructured regions are important for biomedicine. Proteins with unstructured regions are increasingly implicated in important functional activities. Due to their intrinsic adaptability, they participate in many regulatory processes such as the transcription and translation machineries, signal transduction pathways, and macromolecular transport by the nuclear pore complex [30, 31, 32, 33] . Protein regions involved in alternative splicing and transcription are also often unstructured [34, 35] . Defective proteins in regulatory processes leading to uncontrolled proliferation of cells have been repeatedly associated with cancer [36] . Unstructured regions appear to play a critical role in initiating malignant tumors. For instance, translocation of genes that fuse the unstructured N-terminus of CBP (CREB binding protein) with the MLL (mixed lineage leukemia) gene is associated with leukemia [37] . A large-scale analysis revealed this to be a frequent theme: cancer-related proteins are significantly enriched in unstructured regions [31] . Partially unfolded intermediates can trigger or promote diseases, e.g. some human cataracts are associated with the aggregation of partially unfolded intermediates of HγD-Cryst [38] . The aggregation of proteins with unstructured regions has also been associated with neurodegenerative diseases, e.g., Huntington's disease is directly linked to the aggregation of polyglutamine from CBP [39] .
Many phenomena, many approaches to prediction. Many concepts are pursued to predict unstructured regions (Fig. S1, Supporting Online Material). Some methods utilize machine learning algorithms to discriminate between residues that appear to be regularly structured and residues that are invisible in the electron density maps from X-ray structures [30, 40, 3, 41, 8] . Other predictors target the difference in amino acid propensities between unstructured and well-ordered regions. Disordered regions tend to have high net charge, low hydrophobicity [20] and high loop content [42] . Methods that implement this idea usually identify long and biologically relevant unstructured regions [5] and usually miss short unstructured regions [43] . Unstructured regions appear to have lower contact densities ( eqn. 1) than well-structured regions and can therefore be identified by average contact propensities [44, 7, 45] . Such methods use average amino acid contact propensity scores (derived from regular structures) with or without pairwise interaction energy matrices. SCPRED method uses neural networks to identify clusters of internally contacting residues. This method was found to be useful in identifying residues in long unstructured regions for a few proteins [46, 47] .
Some methods combine more than one approach. For instance, a neural network trained on residues missing in electron density maps and on residues in high B-factor loops [40] . Another example is a meta-method that uses two different prediction methods, each optimized on unstructured regions of different lengths [48] . The combined methods typically outperform individual approaches on average.
NORS are long regions with no regular secondary structure, i.e. ≥ 70 sequence-consecutive residues depleted of predicted helices and strands that are relatively exposed to the solvent (Fig. S1D, Supporting Online Material). Most NORS regions are unstructured, but many unstructured regions are not NORS [49, 50] . NORSnet is a new method that succeeded in the distinction between natively unstructured and well-structured loops [51] . On the one hand, different methods capture different aspects from the plethora of unstructured regions. On the other hand, very few methods capture specific aspects and all methods still miss many long unstructured regions identified experimentally (GT Montelione, unpublished).
Here, we focused on one particular aspect of unstructured regions, namely their intrinsic low contact density ( eqn. 1). It is this intrinsic "flexibility" that makes unstructured regions so adaptable. Not surprisingly then, predictions of unstructured regions based on average contact propensities perform very well [44, 7] . We hypothesized that a method based on protein-specific internal contact predictions could specifically identify unstructured regions relevant for protein interactions. To test this assumption, we developed a novel method, Ucon (prediction of natively unstructured regions through contacts), that identified unstructured regions in DisProt [4] . In our analysis, Ucon appeared more accurate than commonly used methods by many measures, and it identified different proteins with unstructured regions.
"Unstructured" proteins. The terminology "unstructured proteins" can be misleading because we do not always have experimental data to establish an entire protein as natively unstructured. Typically, the literature refers to intrinsically disordered or natively unstructured proteins as those that have at least a certain number of residues in unstructured regions. Here, we distinguished between well-structured proteins and proteins with at least one stretch of ≥30 consecutive residues in a predicted unstructured region.
Contact density. The contact density was defined as:
| (Eq. 1) |
where w stood for a window of w sequence-consecutive residues; these could correspond to a short, local segment in a protein or to the entire protein (with w=N and N being the number of residues in a protein).
spatial relation (contact/not) between two residues. Our method did not use any particular threshold for the definition of a contact, however, the underlying prediction method PROFcon was trained on the standard applied for CASP [52] , i.e. (i,k) were considered to be in contact if their C-betas were closer than 8 ngstr¿m (0.8 nm).
Basic concept of prediction method. We calculated the contribution of each residue i to the energy (
):
|
| (Eq. 2) |
where
propensity for the pair (i,k) as predicted by PROFcon and
energy between the two residues estimated from a particular pairwise interaction potential (below). L is a free parameter that was optimized (below). We calculated
in a chain P and smoothed the values by a moving average of length S (below). Unstructured regions were then identified as peaks in this profile ( Fig. 2 ). In practice, we defined a threshold T above which a residue was labeled as unstructured.
Predicting the contact matrix. Contact propensities were predicted by PROFcon [11] , one of the best methods at CASP6 [52] . Upon submission of the protein sequence, PROFcon (using features such as evolutionary profiles, predicted secondary structure and solvent accessibility) returns for each residue i a list of predicted contact propensities cik (k=1,N; N: protein length and |k-i|>5) between 0 and 1. We used these propensities to predict a contact matrix for each protein.
Pairwise interaction potential. Statistical pairwise contact potentials are knowledge-based quantities that represent the interactions between amino acid types. Many groups obtain energy-like values through Boltzmann relations of observed pairwise contact frequencies. Over 30 such potentials exist; many appear similar [53] to at least one of the Miyazawa-Jernigan potentials [54] . We therefore used the latter, and also tested a potential [55] that was used to predict unstructured regions [7] .
Data sets. As positives, we used proteins with unstructured regions from DisProt (version 3.0) [4] . This set included proteins that have experimentally verified, biologically relevant unstructured regions. We optimized our method on a subset of DisProt that included proteins with unstructured regions with ≥30 residues. As negatives (well-structured), we chose PDB proteins that were used to test PROFcon, i.e. none of the proteins had been used for its development [11] . We excluded structures with backbone atoms missing from the middle of the chain and structures that had unstructured regions in the termini (>15 residues within C- or N-term). Since PROFcon currently flunks on proteins that are longer than 550 residues, we discarded these proteins from our sets.
Testing/cross-validation. UniqueProt [56] generated sequence-unique subsets: the maximal similarity between any protein used for training and testing was an HSSP-value <10 [57, 58] , e.g. <31% pairwise sequence identity for >250 aligned residues. Alignments were generated by 3 iteration PSI-BLAST [59] vs. UniProt with a protocol established earlier [60] . Our final cross validation set included 174 proteins with at least one unstructured region longer than 30 (174 was a subset of 243 proteins with unstructured region of any length), and 223 structured proteins.
In order to optimize the free parameters (T, S, L), we performed a 5-fold cross validation: we randomly divided our data into five sets (no protein pair was sequence-similar between training and test set). We chose the parameters that maximized the area under the ROC curve (AUC) on 4/5 of the data, and tested on the remaining 1/5 (test set). We rotated five times and averaged over the five test sets (Table S1, Supporting Online Material).
Of all methods that we assessed only FoldIndex gives a per-protein prediction (i.e. decides whether or not a protein is predicted to be unstructured). Therefore, we had to introduce a criterion to assess RONN, DISOPRED2, IUPred and NORSnet. We did this in the same way in which we optimized our method: First, we labeled as positives (proteins with unstructured regions) all proteins for which the method identified at least 30 consecutive residues as unstructured (other methods, Table S1, Supporting Online Material, Fig. 3 A) at a given threshold. Second, we identified the minimal number of consecutive residues predicted to be unstructured that optimized the AUC for each method for training (4/5 of data) and recorded the results for testing (1/5 of data). As some methods (RONN, NORSnet) performed better without optimization, we provided both values (see Table S1, Supporting Online Material).
Performance measures: We measured accuracy/specificity (Acc), coverage/sensitivity (Cov) and false positive (FP) rate by the standard formulas:
|
| (Eq. 3) |
TP are the true positives (proteins with unstructured regions experimentally observed AND correctly predicted); FP are the false positives (structured proteins that are predicted to be unstructured); TN are the true negatives (observed and predicted as well-structured), and FN are the false negatives (observed to be unstructured and predicted to be structured). In analogy, we computed the accuracy and coverage for the negatives, i.e, proteins that do not have regions with 30 or more consecutive residues that are unstructured:
|
| (Eq. 4) |
We compiled two other frequently used measures, namely the two-state accuracy Q2 (percentage of proteins correctly predicted in either of the two groups: proteins with and proteins without unstructured regions) and the arithmetic average over Acc and Acc_neg (Average accuracy). In contrast to the measures presented in eqn. 3and eqn. 4these values thrive at simultaneously reflecting all aspects of expected performance:
Receiver-operator curves (ROCs) were constructed by calculating FP and TP rates at different thresholds defining a positive prediction. The curves were then integrated in order to calculate the area under the curve (AUC).
|
|
Fig. 2 : Schematic representation of the prediction method. We run each protein against the database with PSI-BLAST in order to create position-specific profiles (a, b). The profiles, along with other sequence-derived information such as predicted secondary structure and solvent accessibility, constitute the input to the neural network-based contact prediction method, PROFcon (c) that predicts 2D contact maps (d). Each dot in the 2D-map represents a predicted residue-residue interaction. The darker the dot, the higher the probability of the corresponding two residues is to interact. Next, we multiply the 2D-maps with energy-like statistical potential defined by Jernigan and Miyazawa (e) to derive a position specific score creating a profile for each sequence (f).
|
Predicted contacts capture natively unstructured regions.
PROFcon
predicted protein-specific internal contact maps that captured the low contact
density ( eqn. 1) of natively unstructured regions. Particular examples
suggested that PROFcon alone somehow discriminates well-ordered and unstructured
regions (Fig. S2, Supporting Online Material). To test this hypothesis, we calculated a contact-based "unstructured propensity" (
,
eqn. 2
)
for each residue in our data set (with Mij=1).
Parameter optimization under five-fold cross-validation (Methods) performed
substantially better than random ( Fig. 3 A, Table S1,
Supporting Online Material). Performance was significantly higher, when considering
only the strongest PROFcon predictions ( eqn. 2Mij=1, cij>0.5).
Statistical potential combined with predicted contacts improves. The energetic stability of a particular region ultimately determines whether or not that region is unstructured. Therefore, we weighted each predicted contact (cik in eqn. 2) with an energy term (Mik) representing the specific contribution of that predicted interaction to stability. This sequence-based score, which combined the intra-chain contacts explicitly predicted by PROFcon (cik, in eqn. 2) with statistical pairwise potentials (Mik in eqn. 2), clearly discriminated between well-structured and unstructured regions (Fig. S3, red and blue curves, Supporting Online Material).
|
|
Fig. 3 : Assessing the contribution of different sources of information. (A) Comparison between five internal methods (dark blue - PROFcon predictions alone; light blue - high-probability PROFcon predictions only; yellow - Miyazawa-Jernigan potential; green - our final combined method, Ucon; Ucon_MJ (dark green) and Ucon_KD (light green) differed in the potential) and five external methods (gray - RONN; purple - IUPred; red - NORSnet; dark red - FoldIndex; orange - DISOPRED2). Ucon performed significantly best (Table S1, Supporting Online Material for detail). (B) We compared the 30 proteins with the strongest signal for long unstructured regions by the three of methods presented (numbers in circles mutually exclusive). While Ucon (green) and the method that used only the pairwise Miyazawa-Jernigan statistical potential (yellow) yielded similar results, the method that utilized only predicted contact propensity (blue) identified very different proteins. (C) We compared the 80 proteins that gave the strongest signal to have long unstructured region by four methods. NORSnet and Ucon appeared the most orthogonal identifying 9 and 8, respectively unique proteins that were not identified by any other method. Conversely, the pairs of most overlapping methods were IUPred-Ucon and DISOPRED2-NORSnet respectively sharing 66 and 64 proteins.
|
Final method performed best. When using the combined score to discriminate unstructured from well-structured regions, the results were considerably better than when using predicted contacts alone ( Fig. 3 A). Furthermore, Ucon appeared more accurate than the best state-of-the-art prediction methods for unstructured regions ( Fig. 3 A, Table S1, Supporting Online Material).
On the one hand, position-specific contact predictions provided an edge over other approaches. On the other hand, methods based on statistical potentials calculating the per-residue propensity for unstructured regions by simply assuming equal contact probability within a sliding window [7, 5] performed very well. FoldIndex, for instance, which uses a simple one-body hydrophobic potential [61] divided by the net-charge contribution, yielded an AUC of 0.747; IUPred using a pairwise energy potential specifically optimized to capture unstructured regions yielded an AUC=0.878.
Methods that utilize machine learning algorithms to learn the features of residues that do not have coordinates in the PDB, usually predict regions in DisProt rather accurately [7, 48] . This surprises since the two data sets (development on PDB, test on DisProt) differ, e.g. in amino acid composition and in the length of unstructured regions [62] . On our set, DISOPRED2 and RONN also performed very well (AUC of 0.868 and 0.887, respectively).
Specific contact maps identified different proteins. We looked at the 30 correct predictions that gave the strongest signal for long unstructured regions by several prediction methods. In particular, we compared predictions by the method only using statistical contact potentials (potentials-only), by the method based only on predicted PROFcon contact maps (PROFcon-only) and by the merger of the two, namely Ucon (Methods). PROFcon-only identified different proteins than other methods: 25 of the 30 proteins (83%) were only predicted by using contact maps alone ( Fig. 3 >
Next, we compared Ucon to three methods based on different concepts ( Fig. 3 C). Each method identified its list of 80 proteins with strongest signal for unstructured. Note that for all these methods the cutoff that resulted in 80 true positives also yielded ³95% accuracy on this set, i.e., the 80 were highly reliable predictions. This analysis revealed several points: First, only 46 of the 80 <<<60%) were identified by all methods. This suggested that the methods focused on different aspects of unstructured regions. Second, the smallest overlap between any pair of prediction methods was between NORSnet and Ucon (54 proteins). As NORSnet focuses on the identification of natively unstructured loops, this result suggested that Ucon identified unstructured regions that were most different from unstructured loops. In contrast, IUPred and Ucon had the highest overlap (66 of 80 proteins). Since both methods estimate an energy-related score, for many of the overlapping proteins the energetic contribution might be the main determining factor for the lack of regular structure. For instance, proteins such as DNA repair protein XPAC, Stathmin, and Nucleoplasmin protein have long stretches of destabilizing charged residues. Although Ucon and IUPred overlapped, they still differed for 14 of the 80 proteins (17.5%), i.e. even these rather similar methods differed importantly. We did not find a statistically significant trend in the DisProt function annotations for the 80 proteins.
Case study: MAX. MAX is a sequence-specific transcription factor that binds DNA and activates or represses, depending on its binding partner [63] . While MAX is unstructured in isolation, it adopts a helix-loop-helix leucine zipper (bHLH-LZ) fold upon binding to DNA and to its target protein (PDB identifier: 1AN2 [64] , Fig. S4A). Since unstructured regions that bind DNA are often enriched in charged residues, prediction methods based on statistical potentials (such as IUPred and FoldIndex, Fig. S5, Supporting Online Material) easily identify these regions as unstructured. In contrast, methods that are trained on missing residues might fail as this protein contains residues that appear in the PDB (in its bound form). Indeed, DISOPRED2 [3] , missed most of the unstructured regions in MAX (Fig. S6, Supporting Online Material). Conversely, PROFcon predicted MAX to have few high-probability internal contacts suggesting that the helices do not interact internally and that MAX may need to bind to an external target in order to adopt regular structure (Fig. S4). Our final method, Ucon, that combined PROFcon output with the statistical potential, indeed, correctly identified this helical region as unstructured (Fig. S4C).
MAX becomes helical upon binding DNA; this is representative for many unstructured->well-structured transitions [65] . PROFsec predicted 30% of the DisProt residues as helical (data not shown). Unstructured regions that undergo disorder-order transition have slightly more helix (35-36%), in fact, their helix content resembles that of well-structured regions [65] . When we applied our new method that exclusively relied on predicted contacts (PROFcon-only) to create a list of the most likely unstructured residues (at an estimated level of 85% accuracy), we found that about 35% of these residues were predicted in helices. This was another indication that our new method Ucon captured "structured" proteins with unstructured regions. In other words, the differential between the two numbers (30/35%) are the candidates for the difference between never-bind ( Fig. 1 D) and Velcro-like binding ( Fig. 1 E).
|
|
Fig. 4 : Low intra chain contact propensities for the lever arm of Myosin striated muscle. PROFcon can capture unstructured regions that bind other proteins. (A) The complex structure of Myosin striated muscle heavy chain (in green) bound to Myosin essential light chain (orange) and Myosin regulatory light chain (magenta) has been determined by Cohen et al (PDB identifier 1SR6 [67] ). The interface of the heavy chain with the two other molecules is mediated through a long extended helix (the lever arm) and contains many hydrophobic interactions. (B) PROFcon score has been translated to a 1D score ( eqn. 2Mij=1) reflecting the propensity of a residue to be internally in contact with other residues. The residues in the lever arm (in gray) have very low contact propensity, thus, due to the fact that they are rather hydrophobic they are likely to bind outside. Note that we omitted the first 55 residues of the protein due to the fact that PROFcon is unable to handle sequences that are extremely long. |
Case study: unstructured region in protein-protein interaction. Myosin is a primary protein involved in muscle contraction. Its regulatory domain is composed of a long (residues 765-832 in SWISSPROT Identifier MYS_AEQIR) helical stretch named the Lever arm ( Fig. 4 A). Experimental evidence suggests that the Lever arm is natively unstructured [66, 67] . It has also been identified as a Molecular Recognition Feature (MoRF), i.e., it becomes structured upon binding to its target [28] . The Lever arm has relatively few charged residues, and interacts with its two binding partners through hydrophobic interfaces: according to the Protein-Protein Interaction server (http://www.biochem.ucl.ac.uk/bsm/PP/server/) about 65% of the interfaces between the Lever arm and its binding partners are hydrophobic. Since most statistical potential-based methods tend to take buried hydrophobic residue as indicators of well-structured, they likely miss this region (Fig. S7, Supporting Online Material). In contrast, PROFcon uses information such as predicted secondary structure and solvent accessibility; it considers the fact that the long helix does not have a break (that can lead to packing) and has surface exposed hydrophobic residues. Thus, for the PROFcon-based contact-only method (Mij=1, eqn. 2), this region is likely to be unstructured because it will bind externally rather than internally. The Lever arm was exactly the type of example that we hoped to identify by the differential analysis of the methods that we introduced here. However, so far, we have not found any other example for which we can verify the prediction.
Why do unstructured regions have fewer contacts? Natively unstructured regions have unusually low contact densities. This appears to be due to several factors. (1) Unstructured regions are deficient in hydrophobic residues, (2) they are enriched in residues such as glycine and proline that break helices or strands, and (3) they have high net charges [20, 68, 65, 62, 26] . These properties may prevent proteins with unstructured regions from folding independently through mechanisms such as hydrophobic collapse, the formation of regular secondary structures (nucleation), or their combination (nucleation condensation [69] . The interactions between unstructured regions and their external binding partners often result in the formation of regular secondary structure (mainly helices [65, 18] ) and in the cancellation of destabilizing charges.
However, even in their bound, well-structured form, unstructured regions tend to adopt unusual, relatively loopy conformations [65, 28] . We have shown [51] that the previously observed [32, 28, 70] abundance of unstructured regions in proteins with many interaction partners (hubs) is more extensive for loopy regions than for the type of unstructured region picked up by IUPred.
How well do we predict unstructured regions today? Comparing our cross-validated Ucon to methods that have been developed on largely overlapping data sets likely over-estimates the performance for others. Nevertheless, Ucon performed best amongst publicly available methods that were shown to be highly accurate on DisProt. The difference between Ucon and the best runner up not developed in our group may appear small, but it exceeded the levels that distinguished winners from runner-ups at CASP7 [71] , and it did so on much larger sets of positives.
DisProt captures only some aspects of unstructured regions. For instance, CASP focuses on a very different aspect, namely residues not visible in electron density maps from X-ray crystallography. This concept, originally introduced by Keith Dunker [72] , has many limitations but it clearly is the only completely automated and somehow objective definition that creates large data sets. Irrespectively of the additional filter applied to the minimal length of a region that qualifies as disorder, data from otherwise well-structured proteins is dominated by short regions. Ucon will most likely fail for regions shorter than 10 residues, because PROFcon is optimized to predict long-range contacts. Over 80% of the disorder residues considered in CASP7 were in regions with ≥10 residues [71] . In contrast, unstructured regions that actually prevent folding into a folded 3D structure in isolation are typically considerably longer than this.
New experimental techniques based on NMR are about to provide the first detailed, objective, and large-scale data sets on these types of regions (GT Montelione, in preparation). Preliminary tests [51] suggest that even methods for the DisProt type of unstructured regions rather than the CASP-type are significantly less successful when assessed in light of these new data. Clearly, the complete universe of unstructured regions is yet unknown and the regions mapped out today are already extremely varied. We will need different methods for different aspects. Ucon targets the identification of long unstructured regions, and appears to be most successful at this.
The population of experimentally characterized unstructured regions is extremely heterogeneous [17] . It is therefore not surprising that different methods focus on different flavors of unstructured regions [68] . An extreme case is NORSnet that focuses on the identification of unstructured loops [51] . The second most unique method was Ucon. The observation that the smallest overlap between two prediction methods outputs was between NORSnet and Ucon (54 proteins), coupled with the observation that unstructured regions identified by Ucon have high predicted helix content suggested that Ucon identified unstructured regions that become well-structured upon binding. In summary, Ucon adds three important virtues to the pool of prediction methods: it is highly accurate, quite orthogonal to other methods, and it enables some specific interpretation of the meaning of its differences to methods such as NORSnet.
Even low-accuracy predictions can be useful. Although predictions of inter-residue contacts have been improving over the years, many researchers continue to perceive contact predictions as relatively inaccurate [52] . However, Ucon is not the first example of a successful application of contact predictions to protein structure and function prediction [73, 74, 47, 75] . One of the most interesting aspects of this particular application might be that Ucon could only succeed because PROFcon predicted many specific long-range contacts correctly. Another evidence for the usefulness of contact predictions was that we could correctly identify unstructured regions using contact predictions alone; some of those were not identified by any other method that we tested.
Conclusions. We introduced the combination of two unique approaches to create Ucon, a new method for the prediction of unstructured regions. Ucon compared favorably with methods utilizing either one of these approaches alone for proteins with long (>30 residues) unstructured regions. We remained most surprised by the result that methods based on position-specific and position-independent preferences performed similarly on average. A position-independent method only depends on amino acid composition, i.e. is "blind" to the specific positions in the sequence (e.g. the sequence AGEREG gives the same preference as does REGGAE). Such a simplification obviously ignores the importance of folding pathways that we know matter greatly. The explanation for the minute difference between these two methods might be that there are a great variety of unstructured regions. Some serve as buffering or filling material. These regions just have to be selected to stay clear off binding to anything. Other unstructured regions, in contrast, have strong binding preferences; however, they are selected not to bind internally but to bind through external transient protein-protein interactions. We provided evidence that our method identified different proteins with unstructured regions than existing methods. Furthermore, we showed that at least for one single example we could specifically identify regions involved in external protein-protein interactions. Thus, this method might become rather useful for the prediction of protein function, as well as, for more detailed experimental studies of natively unstructured regions.
Thanks to Lawrence Shapiro, Barry Honig (Columbia) and Mickey Kosloff (Duke) for discussions; to Andrew Kernytsky (Columbia) for comments on the manuscript; to Jinfeng Liu and Guy Yachdav (Columbia) for computer assistance; to Dariusz Przybylski (Columbia) for preliminary information and programs. This work was supported by grants from the National Library of Medicine (NLM, RO1-LM07329-01), by a grant to the Northeast Structural Genomics Consortium (P50 GM62413), and by the grant U54-GM072980 from the NIH. Last, not least, thanks to Keith Dunker (DisProt, Indiana University), and Phil Bourne (PDB, San Diego Univ.), and their crews for maintaining excellent databases and to all experimentalists who enabled this analysis by making their data publicly available.
| 1. | Moult, J., Pedersen, J. T., Judson, R. & Fidelis, K. (1995). Alarge-scale experiment to assess protein structure prediction methods. Proteins:Structure, Function, and Genetics, 23,ii-iv. |
| 2. | Moult, J., Fidelis, K.,Rost, B., Hubbard, T. & Tramontano, A. (2005). Critical assessment ofmethods of protein structure prediction (CASP)-Round 6. Proteins, 61, 3-7. |
| 3. | Ward, J. J., Sodhi, J. S.,McGuffin, L. J., Buxton, B. F. & Jones, D. T. (2004). Prediction andfunctional analysis of native disorder in proteins from the three kingdoms oflife. Journal of Molecular Biology,337, 635-645. |
| 4. | Vucetic, S., Obradovic, Z.,Vacic, V., Radivojac, P., Peng, K. et al. (2005). DisProt: a database ofprotein disorder. Bioinformatics, 21,137-40. |
| 5. | Prilusky, J., Felder, C.E., Zeev-Ben-Mordehai, T., Rydberg, E. H., Man, O. et al. (2005). FoldIndex: asimple tool to predict whether a given protein sequence is intrinsicallyunfolded. Bioinformatics, 21,3435-8. |
| 6. | Kabsch, W. & Sander, C.(1983). Dictionary of protein secondary structure: pattern recognition ofhydrogen bonded and geometrical features. Biopolymers, 22, 2577-2637. |
| 7. | Dosztanyi, Z., Csizmok, V.,Tompa, P. & Simon, I. (2005). The pairwise energy content estimated fromamino acid composition discriminates between folded and intrinsicallyunstructured proteins. J Mol Biol,347, 827-39. |
| 8. | Yang, Z. R., Thomson, R.,McNeil, P. & Esnouf, R. M. (2005). RONN: the bio-basis function neuralnetwork technique applied to the detection of natively disordered regions inproteins. Bioinformatics, 21,3369-76. |
| 9. | Schlessinger, A., Liu, J.& Rost, B. (2007). Natively unstructured loops differ from other loops. PLoSComputational Biology,submitted2nd revision March 2007. |
| 10. | Berman, H. M., Battistuz,T., Bhat, T. N., Bluhm, W. F., Bourne, P. E. et al. (2002). The Protein DataBank. Acta Crystallogr D Biol Crystallogr,58, 899-907. |
| 11. | Punta, M. & Rost, B.(2005). PROFcon: novel prediction of long-range contacts. Bioinformatics, 21, 2960-8. |
| 12. | Rost, B. (1996). PHD:predicting one-dimensional protein structure by profile based neural networks. Methodsin Enzymology, 266, 525-539. |
| 13. | Rost, B. (2001). Proteinsecondary structure prediction continues to rise. Journal of StructuralBiology, 134, 204-218. |
| 14. | Rost, B. (2005). How touse protein 1D structure predicted by PROFphd. In The Proteomics ProtocolsHandbook (Walker, J. E., eds.), pp. 875-901, Humana, Totowa NJ. |
| 15. | Bairoch, A. &Apweiler, R. (2000). The SWISS-PROT protein sequence database and itssupplement TrEMBL in 2000. Nucleic Acids Res, 28, 45-8. |
| 16. | Dyson, H. J. & Wright,P. E. (2002). Coupling of folding and binding for unstructured proteins. CurrentOpinion in Structural Biology, 12,54-60. |
| 17. | Dyson, H. J. & Wright,P. E. (2005). Intrinsically unstructured proteins and their functions. NatRev Mol Cell Biol, 6,197-208. |
| 18. | Oldfield, C. J., Cheng,Y., Cortese, M. S., Romero, P., Uversky, V. N. et al. (2005). Coupled Foldingand Binding with alpha-Helix-Forming Molecular Recognition Elements. Biochemistry, 44, 12454-12470. |
| 19. | Wright, P. E. & Dyson,H. J. (1999). Intrinsically unstructured proteins: re-assessing the proteinstructure-function paradigm. Journal of Molecular Biology, 293, 321-331. |
| 20. | Uversky, V. N., Gillespie,J. R. & Fink, A. L. (2000). Why are "natively unfolded" proteinsunstructured under physiologic conditions? Proteins: Structure, Function,and Genetics, 41, 415-427. |
| 21. | Demchenko, A. P. (2001).Recognition between flexible protein molecules: induced and assisted folding. JMol Recognit, 14, 42-61. |
| 22. | Dunker, A. K., Lawson, J.D., Brown, C. J., Williams, R. M., Romero, P. et al. (2001). Intrinsicallydisordered protein. J Mol Graph Model,19, 26-59. |
| 23. | Namba, K. (2001). Roles ofpartly unfolded conformations in macromolecular self-assembly. Genes Cells, 6, 1-12. |
| 24. | Romero, P., Obradovic, Z.& Dunker, A. K. (2004). Natively disordered proteins: functions andpredictions. Appl Bioinformatics, 3,105-13. |
| 25. | Fink, A. L. (2005).Natively unfolded proteins. Curr Opin Struct Biol, 15, 35-41. |
| 26. | Tompa, P. (2005). Theinterplay between structure and function in intrinsically unstructuredproteins. FEBS Letters, 579,3346-3354. |
| 27. | Uversky, V. N., Oldfield,C. J. & Dunker, A. K. (2005). Showing your ID: intrinsic disorder as an IDfor recognition, regulation and cell signaling. J Mol Recognit, 18, 343-84. |
| 28. | Mohan, A., Oldfield, C.J., Radivojac, P., Vacic, V., Cortese, M. S. et al. (2006). Analysis ofMolecular Recognition Features (MoRFs). J Mol Biol, 362, 1043-59. |
| 29. | Oldfield, C. J., Cheng,Y., Cortese, M. S., Brown, C. J., Uversky, V. N. et al. (2005). Comparing andcombining predictors of mostly disordered proteins. Biochemistry, 44, 1989-2000. |
| 30. | Romero, P., Obradovic, Z.,Kissinger, C., Villafranca, J. E., Garner, E. et al. (1998). Thousands ofproteins likely to have long disordered regions. Pac. Symp. Biocomput., 3, 437-448. |
| 31. | Iakoucheva, L. M., Brown,C. J., Lawson, J. D., Obradovic, Z. & Dunker, A. K. (2002). Intrinsicdisorder in cell-signaling and cancer-associated proteins. J Mol Biol, 323, 573-84. |
| 32. | Dunker, A. K., Cortese, M.S., Romero, P., Iakoucheva, L. M. & Uversky, V. N. (2005). Flexible nets.The roles of intrinsic disorder in protein interaction networks. Febs J, 272, 5129-48. |
| 33. | Devos, D., Dokudovskaya,S., Williams, R., Alber, F., Eswar, N. et al. (2006). Simple fold compositionand modular architecture of the nuclear pore complex. Proc Natl Acad Sci U SA, 103, 2172-7. |
| 34. | Liu, J., Perumal, N. B.,Oldfield, C. J., Su, E. W., Uversky, V. N. et al. (2006). Intrinsic disorder intranscription factors. Biochemistry,45, 6873-88. |
| 35. | Romero, P. R., Zaidi, S.,Fang, Y. Y., Uversky, V. N., Radivojac, P. et al. (2006). Alternative splicingin concert with protein intrinsic disorder enables increased functionaldiversity in multicellular organisms. Proc Natl Acad Sci U S A,. |
| 36. | Hanahan, D. &Weinberg, R. A. (2000). The hallmarks of cancer. Cell, 100, 57-70. |
| 37. | Yang, X. J. (2004). Thediverse superfamily of lysine acetyltransferases and their roles in leukemiaand other diseases. Nucleic Acids Res,32, 959-76. |
| 38. | Flaugh, S. L.,Kosinski-Collins, M. S. & King, J. (2005). Interdomain side-chaininteractions in human gammaD crystallin influencing folding and stability. ProteinSci, 14, 2030-43. |
| 39. | Nucifora, F. C., Jr.,Sasaki, M., Peters, M. F., Huang, H., Cooper, J. K. et al. (2001). Interferenceby huntingtin and atrophin-1 with cbp-mediated transcription leading tocellular toxicity. Science, 291,2423-8. |
| 40. | Linding, R., Jensen, L.J., Diella, F., Bork, P., Gibson, T. J. et al. (2003). Protein disorderprediction: implications for structural proteomics. Structure, 11, 1453-1459. |
| 41. | Cheng, J., Sweredoski, M.J. & Baldi, P. (2005). Accurate Prediction of Protein Disordered Regions byMining Protein Structure Data. Data Mining and Knowledge Discovery, in press, . |
| 42. | Linding, R., Russell, R.B., Neduva, V. & Gibson, T. J. (2003). GlobPlot: Exploring proteinsequences for globularity and disorder. Nucleic Acids Res, 31, 3701-8. |
| 43. | Jin, Y. & Dunbrack, R.L., Jr. (2005). Assessment of disorder predictions in CASP6. Proteins, 61 Suppl 7, 167-75. |
| 44. | Garbuzynskiy, S. O.,Lobanov, M. Y. & Galzitskaya, O. V. (2004). To be folded or to be unfolded?Protein Sci, 13, 2871-7. |
| 45. | Dosztanyi, Z., Csizmok,V., Tompa, P. & Simon, I. (2005). IUPred: web server for the prediction ofintrinsically unstructured regions of proteins based on estimated energycontent. Bioinformatics, 21,3433-4. |
| 46. | Dosztanyi, Z., Fiser, A.& Simon, I. (1997). Stabilization centers in proteins: identification,characterization and predictions. J Mol Biol, 272, 597-612. |
| 47. | Orosz, F., Kovacs, G. G.,Lehotzky, A., Olah, J., Vincze, O. et al. (2004). TPPP/p25: from unfoldedprotein to misfolding disease: prediction and experiments. Biol Cell, 96, 701-11. |
| 48. | Peng, K., Radivojac, P.,Vucetic, S., Dunker, A. K. & Obradovic, Z. (2006). Length-dependentprediction of protein intrinsic disorder. BMC Bioinformatics, 7, 208. |
| 49. | Liu, J., Tan, H. &Rost, B. (2002). Loopy proteins appear conserved in evolution. Journal ofMolecular Biology, 322,53-64. |
| 50. | Liu, J. & Rost, B.(2003). NORSp: predictions of long regions without regular secondary structure.Nucleic Acids Research, 31,3833-3835. |
| 51. | Schlessinger, A., Liu, J.& Rost, B. (2007). Natively Unstructured Loops Differ from Other Loops. PLoSComputational Biology, preprint,e140.eor. |
| 52. | Grana, O., Baker, D.,MacCallum, R. M., Meiler, J., Punta, M. et al. (2005). CASP6 assessment ofcontact prediction. Proteins, 61Suppl 7, 214-24. |
| 53. | Pokarowski, P.,Kloczkowski, A., Jernigan, R. L., Kothari, N. S., Pokarowska, M. et al. (2005).Inferring ideal amino acid interaction forms from statistical protein contactpotentials. Proteins, 59,49-57. |
| 54. | Miyazawa, S. &Jernigan, R. L. (1999). Evaluation of short-range interactions as secondarystructure energies for protein fold and sequence recognition. Proteins, 36, 347-56. |
| 55. | Thomas, P. D. & Dill,K. A. (1996). An iterative method for extracting energy-like quantities fromprotein structures. Proc Natl Acad Sci U S A, 93, 11628-33. |
| 56. | Mika, S. & Rost, B.(2003). UniqueProt: creating representative protein sequence sets. NucleicAcids Research, 31,3789-3791. |
| 57. | Sander, C. &Schneider, R. (1991). Database of homology-derived protein structures and thestructural meaning of sequence alignment. Proteins, 9, 56-68. |
| 58. | Rost, B. (1999). Twilightzone of protein sequence alignments. Protein Engineering, 12, 85-94. |
| 59. | Altschul, S. F., Madden,T. L., Schaeffer, A. A., Zhang, J., Zhang, Z. et al. (1997). Gapped BLAST andPSI-BLAST: a new generation of protein database search programs. NucleicAcids Research, 25,3389-33402. |
| 60. | Przybylski, D. & Rost,B. (2002). Alignments grow, secondary structure prediction improves. Proteins:Structure, Function, and Genetics, 46,195-205. |
| 61. | Kyte, J. & Doolittle,R. F. (1982). A simple method for displaying the hydropathic character of aprotein. J Mol Biol, 157,105-32. |
| 62. | Radivojac, P., Obradovic,Z., Smith, D. K., Zhu, G., Vucetic, S. et al. (2004). Protein flexibility andintrinsic disorder. Protein Science,13, 71-80. |
| 63. | Patikoglou, G. &Burley, S. K. (1997). Eukaryotic transcription factor-DNA complexes. AnnuRev Biophys Biomol Struct, 26,289-325. |
| 64. | Ferre-D'Amare, A. R.,Prendergast, G. C., Ziff, E. B. & Burley, S. K. (1993). Recognition by Maxof its cognate DNA through a dimeric b/HLH/Z domain. Nature, 363, 38-45. |
| 65. | Fuxreiter, M., Simon, I.,Friedrich, P. & Tompa, P. (2004). Preformed structural elements feature inpartner recognition by intrinsically unstructured proteins. Journal ofMolecular Biology, 338,1015-1026. |
| 66. | Houdusse, A., Kalabokis,V. N., Himmel, D., Szent-Gyorgyi, A. G. & Cohen, C. (1999). Atomicstructure of scallop myosin subfragment S1 complexed with MgADP: a novelconformation of the myosin head. Cell,97, 459-70. |
| 67. | Risal, D., Gourinath, S.,Himmel, D. M., Szent-Gyorgyi, A. G. & Cohen, C. (2004). Myosin subfragment1 structures reveal a partially bound nucleotide and a complex salt bridge thathelps couple nucleotide and actin binding. Proc Natl Acad Sci U S A, 101, 8930-5. |
| 68. | Vucetic, S., Brown, C. J.,Dunker, A. K. & Obradovic, Z. (2003). Flavors of protein disorder. Proteins, 52, 573-84. |
| 69. | Fersht, A. R. &Daggett, V. (2002). Protein folding and unfolding at atomic resolution. Cell, 108, 573-82. |
| 70. | Patil, A. & Nakamura,H. (2006). Disordered domains and high surface charge confer hubs with theability to interact with multiple proteins in interaction networks. FEBSLett, 580, 2041-5. |
| 71. | Bordoli, L., Kiefer, F.& Schwede, T. (2006). Assessment of Disorder Prediction. CASP7,. |
| 72. | Dunker, A. K., Garner, E.,Guilliot, S., Romero, P., Albrecht, K. et al. (1998). Protein disorder and theevolution of molecular recognition: theory, predictions and observations. PacSymp Biocomput, 3, 473-484. |
| 73. | Ortiz, A. R., Kolinski,A., Rotkiewicz, P., Ilkowski, B. & Skolnick, J. (1999). Ab initio foldingof proteins using restraints derived from evolutionary information. Proteins:Structure, Function, and Genetics, Suppl3, 177-185. |
| 74. | Pazos, F., Rost, B. &Valencia, A. (1999). A platform for integrating threading results with proteinfamily analyses. Bioinformatics, 15,1062-1063. |
| 75. | Punta, M. & Rost, B.(2005). Protein folding rates estimated from contact predictions. J Mol Biol, 348, 507-12. |
| Contact: admin@rostlab.org | Version: Aug 10, 2007 |