Completed Theses 2011

Improving predictions of functional effect of non-synonymous SNP in human

In this thesis, a machine learning approach is used to develop a classifier with the ability to predict functional effects of non-synonymous single nucleotide polymorphisms. This involved the implementation of various protein- and mutation-related features to describe the properties of neutral and non-neutral substitutions. An extensive search of the parameter- and feature-space was performed to evaluate the impact of different property combinations on the prediction accuracy. The sample space used for training was clustered and refined to avoid over-optimistic estimates of performance. Evaluated in a careful 10-fold cross-validation and on an independent test set, the resulting neural network classifier achieved an overall prediction accuracy of 79%. For the final method, all extraction and prediction steps were optimized towards reducing run-time in order to meet the needs of large scale prediction studies.

Master thesis
Student: Maximilian Hecht
Supervisor: Burkhard Rost

Predict subcellular localization for proteins in all kingdoms

The prediction of protein subcellular localization is an important step towards understanding its function. Here, a new method for predicting localization in all six taxonomic kingdoms is presented. The method was developed on a non-redundant data set of proteins of known localization from SWISS-PROT. Three localization classes were targeted for archaea, six for bacteria and eleven for eukaryota. Prediction requires an amino acid sequence and the taxonomic classi cation. For the development of the method support vector machines were used and a range of string kernels examined. The kernel using evolutionary pro les was selected as the most appropriate for detecting compartment-specific patterns. A number of multiclass classification techniques were then compared, including one-against-all, various types of ensembles of nested dichotomies and the nested dichotomy with a xed structure. The latter allowed prediction of protein subcellular localization by mimicking the cascading mechanism of cellular sorting. Though its overall accuracy was comparable to or higher than other classi cation techniques, its computational time was signi cantly lower. Three separate classifiers were trained on non-membrane proteins, transmembrane proteins and proteins of all types, allowing the latter to be applied to large-scale screenings of entire proteomes. When evaluated on the non-redundant test sets, the method developed on all types of proteins achieved the highest level of accuracy for archaeal proteins, 84% for bacterial and 64% for eukaryotic proteins, thus outperforming current state-of-the art predictors. In addition, the prediction methods were benchmarked on three independent data sets that were not used during their development. The method developed here surpassed the other methods in nearly all benchmarks.

Master thesis
Student: Tatyana Goldberg
Supervisor: Burkhard Rost, Tobias Hamp

Compare effects of nsSNPs from human variations with disease related SNPs

Since the sequencing of the human genome was completed in 2000, the analysis of SNPs gained more and more attention in the past few years. SNPs make up about 90% of all human genetic variation and are known to cause several diseases such as cancer. Therefore the analysis of their effects on protein function is of major interest. In this master thesis a comprehensive analysis of the functional effects of SNPs identified by the 1000 genomes project was performed using the SNAP prediction method. The number of effect SNPs in the 1000 genome data was found to be as high as the number of neutral SNPs and the cumulative distribution of SNAP scores looked like the distribution for completely random generated SNPs. These findings led to the conclusion that there exist more SNPs with functional effects in naturally occurring human variation than one has assumed in the past years. Furthermore 15 proteins were identified that accumulated the most function changing SNPs. With these candidate proteins and their direct neighbors a Protein Protein interaction network was built and a GO enrichment analysis was performed. As a result, most of the genes were predicted to belong to the ’intracellular signaling’ and ’protein binding’ GO terms.

Master thesis
Student: Dominik Achten
Supervisor: Burkhard Rost, Shaila Roessle

Evaluation of methods to predict transmembrane alpha-helices in proteins

Proteins containing transmembrane -helices constitute about 20-30% of all proteins in an organism. Due to their lipophilic transmembrane region, high resolution structures of these proteins are even more challenging to obtain than those of soluble proteins. Therefore, methods to predict the position of transmembrane -helices and their topology from sequence are of great interest, for example in whole genome analyses. Today, a large number of transmembrane -helix prediction methods exists. The objective of this thesis is to evaluate a selected set of twelve transmembrane-helix prediction methods on an independent set of sequences for which three-dimensional structures are available. Most of these methods have not been independently evaluated before. The dataset was created from UniProt, using additional information from PDB and PDBTM. The set was divided into several subsets to allow for a fine grained evaluation. One of these subsets consists of 41 sequences that are considered new proteins, as they, or similar sequences, were not present in the training set of any of the evaluated prediction methods. The scores to assess prediction performance have been used in earlier studies and can be divided into per-segment and per-residue scores. Furthermore, the runtime is evaluated by performing a prediction of the complete human proteome
retrieved from UniProt.
Although most prediction methods give good results, MEMSAT-SVM, PolyPhobius and SCAMPI, stand out as performing generally better. However, a few weaknesses could be observed. MEMSAT-SVM loses most performance on multipass transmembrane proteins and SCAMPI has a large performance decrease on prokaryotic sequences. PolyPhobius’ performance is the most stable and the method also offers a good trade-off between reliable predictions and a reasonable runtime. MEMSAT-SVM on the other hand is one order of magnitude slower, while SCAMPI is around one hundred times faster and the method of choice when speed is the most important criterion. The results on the subset with new sequences, also indicate an apparent problem with overfitting of the machine learning based approaches. It has to be expected that performances in previous evaluations are often largely overestimated. Interestingly, we observed a significantly better performance by all methods on prokaryotic compared to eukaryotic proteins. Depending on the origin of the sequence and the number of sequences for which predictions are required, we propose the use of one of the three top-scoring methods MEMSAT-SVM, PolyPhobius and SCAMPI.

Bachelor thesis
Student: Jonas Reeb
Supervisor: Burkhard Rost, Edda Kloppmann

Identification of DNA-binding residues from amino acid sequence data

Motivation - DNA-protein interactions are essential for many biological processes e.g. for DNA packaging, DNA replication, DNA recombination and DNA repair. There is currently no established high throughput technology available to experimentally screen DNA-protein binding. At the same time, the number of known proteins explodes. For these purposes we need an in silico method to predict DNA-protein interactions.
Prediction level - There are methods which can predict that a protein will principally inter- act with DNA. Given such a DNA-interacting protein, our novel method focuses on the prediction of DNA-interactivity for each residue.
Input types - Some methods use the tertiary structure (3D), which promises to predict the DNA-binding residues with high accuracy. A weakness of tertiary structure based methods is that they depend on experimental data, which is mostly not applicable. Therefore, our method exclusively uses the raw amino acid sequence data (1D) and nothing more. We generate all necessary features from the raw amino acid sequence e.g. with PredictProtein. As a result, our method is applicable to all known proteins.
Novel dataset - For training, we used a novel dataset containing transcription factors, enzymes and structural / DNA-binding proteins e.g. histone-like proteins. Our purpose is to build a robust classifier for these three protein classes.

Diploma thesis
Student: Michael Menden
Supervisor: Burkhard Rost, Shaila Roessle


Sole usage of amino acid propensities results in robis performance for predicting structural change in protein fragments

Bachelor thesis
Student: Yannik Mahlich
Supervisor: Burkhard Rost, Christian Schaefer