Research goals of the lab involve using protein and DNA sequences along with evolutionary information to predict a protein's:
- overall function
- interaction partners
- secondary structure
- disordered regions
- subcellular localization
- membrane spanning protein structure
- intra-chain residue contacts
- cell cycle control
- and domain boundaries.
Another significant research focus is to improve the effectiveness and efficiency of structural genomics projects' ability to determine the structures of proteins on a large scale.
PiNat is a platform for assessing protein-protein interaction networks. The platform integrates information about protein function and sub-cellular localization and outputs the reliable interactions involving the query proteins. The interactions are rendered as an image in the cellular context and can help elucidate biological pathways and processes. We have used the system to analyze proteins implicated with Alzheimer's disease and shown how the integrated view corroborates previous observations and helps formulate new hypotheses regarding the molecular underpinnings of the disease.
MD (Meta-Disorder predictor) is a neural-network based meta-predictor that uses different sources of information predominantly obtained from orthogonal approaches. MD significantly outperformed its constituents, and compared favorably to other top prediction methods. MD is capable of predicting disordered regions of all "flavors", and identifying new ones that are not captured by other predictors.
NORSnet is a neural network based method that focuses on the identification of unstructured loops. NORSnet was trained to distinguish between very long contiguous segments with non-regular secondary structure (NORS regions) and well-folded proteins. NORSnet was trained on predicted information rather than on experimental data. Therefore, it was optimized on a large data, which is not biased by today's experimental means of capturing disorder. Thus, NORSnet reached into regions in sequence space that are not covered by the specialized disorder predictors. One disadvantage of this approach is that it is not optimal for the identification of the "average" disordered region.
Ucon (prediction of natively unstructured regions through contacts) is a method that combines protein-specific internal contacts with generic pairwise energy potentials to accurately predict long and functional unstructured regions. One advantage of Ucon over statistical-potential based methods is that it incorporates the contribution of the specific order of the amino-acids rather than the amino acid composition alone.
PROFbval is a neural-network method that aimed at predicting flexible and rigid residues in proteins from sequence alone. PROFbval was trained on B-factor data from PDB- Xray structures and, to an extent, can capture disordered residues. Additionally, surface residues that are predicted to be rigid by PROFbval are correlated with the location of enzyme active sites.
LOCtree is a novel system of support vector machines (SVMs) that predict the subcellular localization of proteins, and DNA-binding propensity for nuclear proteins, by incorporating a hierarchical ontology of localization classes modeled onto biological processing pathways. Biological similarities are incorporated from the description of cellular components provided by the gene ontology consortium (GO). GO definitions have been simplified and tailored to the problem of protein sorting. Technically the ontology has been implemented using a decision tree with SVMs as the nodes. LOCtree, was extremely successful at learning evolutionary similarities among subcellular localization classes and was significantly more accurate than other traditional networks at predicting subcellular localization. Whenever available, LOCtree also reports predictions based on the following: 1) Nuclear localization signals found by PredictNLS. 2) Localization inferred using Prosite motifs and Pfam domains found in the protein, and 3) SWISS-PROT keywords associated with a protein. Localization is inferred in the last two cases using the entropy-based LOCkey algorithm.
PredictNLS is an automated tool for the analysis and in silico determination of Nuclear Localization Signals (NLS). In NLS discovery mode, PredictNLS searches a query protein for known and potential NLS's in NLSdb to determine if a protein is likely to be targeted to the nucleus. If the protein is determined to be nuclear, the program also reports if a known DNA binding motif is found. In Motif detection mode, the program can help you decide if a sequence motif is likely to act as a nuclear localization signal. The PredictNLS website also documents the largest collection of experimentally determined NLS's.
LOCkey is a database of subcellular localization of eukaryotic proteins inferred using SWISSPROT keywords. LOCkey was the first fully automated algorithm for inferring subcellular loclaization from database annotations. LOKey outperformed semi-automated methods relying on expert annotators in benchmark tests. NLSdb NLSdb is a database of nuclear localization signals (NLSs) and of nuclear proteins targeted to the nucleus by NLS motifs.NLSdb contains over 12500 predicted nuclear proteins and over 1500 DNA-binding proteins from six entirely sequenced eukaryotic proteomes (human, mouse, fly, worm, grass and yeast). ER/Golgi Localization: Analysis of experimentally characterized endoplasmic reticulum and Golgi apparatus retrieval motifs and estimates of their specificity to classify subcellular localization for the ER and Golgi. Further investigation of inferring ER and Golgi localization from homology-transfer sequence similarity of ER and Golgi localized proteins.
Cell Cycle Protein Identification: Identification of cell cycle control proteins through homology transfer and machine learning techniques. We use database mining, literature searches and evolutionary conservation estimates to provide genome-wide annotations for cell cycle control proteins. We have also developed a SVM method to complement homology-transfer in the identification of cell cycle kinases from sequence alone.
Cell Cycle Kinase Identification: Using information from highly conserved and semi-exposed protein residues from cell cycle kinases we are able to classify kinases involved in this specific biological pathway. We show the ability to correctly predict kinases involved in the cell cycle from all kinases by using a superset of highly conserved and semi-exposed residues. These residues, many of which reside in the nucleotide binding site of the enzymes, represent a majority of the functionally significant residues of the kinases and lead us towards their specific cell cycle functional classification.
PSI-BIG4 : The PSI-BIG4 website assess the progress of the large scale structural genomics initiative (PSI) funded by NIGMS. Progress is assessed by reporting monthly statistics on the number of novel structures, progress with BIG and MEGA families and novel leverage generated by the structures as specified by the milestones document. Subcellular Localization
NYCOMPS: NYCOMPS targets are subdivided into three main categories: Pipeline, Nominated and Biological Theme targets. Pipeline targets are selected by a protocol that currently begins with E. coli proteins. These E. coli seeds are predicted membrane proteins that have been expressed successfully in a previous large-scale experiment on membrane proteins carried out by the lab of Gunnar von Heijne (Daley et al. Science. 2005 308:1321-3). We expand these seeds and into 92 prokaryotic genomes (reagent genomes) from which we clone. Nominated targets are seeds selected by individual NYCOMPS experimental groups and expanded by the bioinformatics unit into the 92 reagent genomes. Biological Theme targets are proteins of exceptional biological interest that are cloned by the individual NYCOMPS experimental groups. Data from experimental trials (cloning, expression, purification, crystallization) are processed and analyzed with the aim to iteratively improve our target selection strategy.
Proteins are intrinsically flexible molecules, thus function is often associated to flexibility. Experimental methods to determine protein flexibility are expensive and often time consuming. Over the past few years an efficient complementing method, molecular dynamics (MD) simulations, more and more proved to be a powerful tool to yield information on protein dynamics. In MD methods, successive conformations of proteins can be calculated using Newton’s law of motion. As a result a trajectory is produced that describes how the positions and velocities of all atoms vary with time. This way important observations can be made, helping to understand proteins, mutations and eventually associated diseases better.
In a large-scale study we try to learn how and to what extent MD simulations can help us understand the effect of Single Nucleotide Polymorphisms (SNPs) on protein flexibility, and thus, function. One main objective is to learn about the advances of additional structural information, compared to sequence-based predictions only that are produced by the SNAP software developed in our lab. For this project we have created a comprehensive dataset including synonymous and non-synonymous SNPs that we have mapped to known PDB structures at different resolutions and sequence identity cutoffs.
This work is done in the context of the SCALALIFE (Scalable Software Services for Life Science) framework, together with Leibnitz-Rechenzentrum (LRZ), facilitating the MD package GROMACS.