Comparison of protein families for different organisms
The aim of this thesis is to compare the protein sequences of different organisms. Comparative studies on the genomes of different organism have been shown to provide valuable information about the evolutionary relationships of the analysed organisms, give insights into the regulation of genes and proteins and the role of highly conserved regions. Sequence comparison between proteins can also hint at the function of newly sequenced proteins. This thesis compares the proteomes of 29 different organisms. Their reference proteomes are available in the EBI database and include members of the three main domains of living organisms: bacteria, eukaryotes and archaea.
Bachelor thesis
Student: Christiane Gasperi
Supervisor: Burkhard Rost, Edda Kloppmann
An evaluation of SNP and functional site analysis methods based on structural and evolutionary inference approaches
In this thesis, several SNP effect prediction and functional site prediction methods were evaluated based on their ability to predict SNP effects and functionally important residues. This involved the adaption of the SNP effect prediction methods on functional site prediction and vice versa. Fur- thermore, a representative dataset for the evaluation of the prediction of the effect of SNPs and the prediction of functionally important sites was needed. Therefore, a SNP dataset and functional site dataset were created. Next, the methods were adapted to their new applications and subsequently the prediction power of each method on both applications was evaluated. To get more insight into the prediction power of the methods several subsets were created according to the cell localization of the protein or the region of a SNP within a protein. Finally, an ensemble method of several methods was developed for both applications with the aim of outperforming the best method for that prediction approach available so far.
Master thesis
Student: Verena Link
Supervisor: Burkhard Rost
Building PSSH2 - new comprehensive database of alignments between protein sequences and tertiary structures
Bachelor thesis
Student: Maria Kalemanov
Supervisor: Burkhard Rost, Andrea Schafferhans
In-depth comparison of predicted high- and low-impact SNPs from the 1000 Genomes Project
Since the human genome was completely sequenced and assembled in 2002, technologies in this area of research have made incredible progress. Today, the challenge to modern genetics is not the sequencing any more, but the processing of the resulting data. One focus in the analysis of large scale sequencing data is the determination of differences between individuals on DNA level, so called single nucleotide polymorphisms (SNPs). The ultimate goal of this analysis is to measure the effects of SNPs on the organism or more specific on the protein function. For a few SNPs this can be done by experiments. However, wet lab experiments are too expensive and time consuming to apply them, for example, in personalized medicine or extensive QTL studies. A potential relief in this situation are in-silico predictions. The increase of available data and computational power has lead to quite reliable results of actual prediction tools. In this work the convincing prediction performance of SNAP is used as measurement of the effect, single amino acid polymorphisms (SAAPs) have on protein function. This allows to use all amino acid changing mutations in the 1,000 Genomes Project data to analyse the effects of sequence- and structure-based mutation properties in detail. The therefore increased amount of data is improving the statistical significance and the reliability of the findings. Further, dependencies between the properties are also examined. It turned out that different properties have strong dependencies with each other. Especially the type of exchange contains information about structure and even conservation. It is also possible to statistically estimate the damaging potential of every type of exchange. Used for prediction purposes the derived matrix of damaging potentials (damaging matrix) outperforms all other matrix/sequence-based prediction tools and is capable of giving a quick and reliable idea how a SAAP affects protein function.
Master thesis
Student: Veit Höhn
Supervisor: Burkhard Rost, Marc Offman
Predicting protein function through gene ontology
Master thesis
Student: Vivien Klose
Supervisor: Burkhard Rost, Christian Schaefer
Automatic protein name recognition
This thesis has as primary goal the development of a text mining tool for the automatic recognition of protein names in articles’ abstracts. The tool’s design is conditioned by its final purpose, namely, the elaboration of a bioinformatics database with a comprehensive mapping between articles and amino acid sequences, respectively mapped to their—possibly multiple—names. Implemented as a web service, this system would be the first of its kind and would boost research in the field by providing new facilities including, but not limited to, search articles by sequence thus avoiding possible name ambiguity, directly find papers on proteins that have a similar sequence, or notify users upon publication of new experiments without the need to specify search-keywords. The accuracy and coverage of current state-of-the-art protein taggers, moreover in combination with protein normalizers, are still insufficient to make the proposed service realizable and, consequently, this thesis’s efforts will be mainly directed to solve this problem.
Master thesis
Student: Juan Miguel Cejuela
Supervisor: Burkhard Rost
Improvement of DNA- and RNA-Protein Binding Prediction
Polynucleotide-protein interactions play an important role in many essential molecular processes, especially those dealing with the synthesis of proteins. A polynucleotide is either deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). Transcription factors are a prominent example for proteins binding directly to DNA to regulate gene expression. As is known today, there are manifold post-transcriptional modifications made to the RNA such as alternative splicing, which are initiated by RNA-binding proteins, like the spliceosome, a eukaryotic protein-RNA complex. RNA-protein-complexes, e.g. ribosomes, are involved in a multitude of important processes in the cell.
Although there are already various methods to predict polynucleotide binding on a per-residue basis showing good performance, none of them deals with the differentiation of proteins that bind polynucleotides, and those which do not. The approach presented in this work handles this problem by using neural networks together with a clean dataset containing proteins that are definitively not involved in polynucleotide binding. This allows distinguishing between proteins that bind DNA, RNA or none of those combined with a residue-based prediction of the binding. These predictions shall serve the demand for experimentalists to find new targets involved in those essential molecular processes followed by an identification of polynucleotide binding regions inside those proteins in one tool: SomeNA.
Diploma thesis
Student: Peter Hönigschmid
Supervisor: Burkhard Rost, Edda Kloppmann
Extracting binding residues from the Protein Data Bank
The Protein Data Bank is an archive of 3D structure models for different large biological molecules such as proteins and nucleotides. Many models in the database are shipped with complexes which contain several subunits or combine macromolecule and their ligands together. With the structural informations e.g. atom coordinates, molecular linkages, compiled in the collection we can examine where and how the chemical compounds interact with each other. These models give clue about the binding affinity between enzymes and ligands, the structure of the catalytic site, mechanism about the folding process etc. The most obvious usage of these informations is the data mining task through which predictions in structural, biochemical, medical aspects can be made. The purpose of this thesis is to implement a program which analyze the structure of 3D models of macromolecules and categorize different interactions between proteins and various types of other biological molecules and extract the binding residues from Protein Data Bank regarding the properties of different interactions.
Bachelor thesis
Student: Shen Wei
Supervisor: Burkhard Rost, Christian Schaefer
Evaluation of sequence-to-structure alignments
The sequence and structure visualisation tool SRS3D uses a pre-calculated data base of sequence-to-structure alignments that is derived from an enhanced version of HSSP. Likewise, many sequence based prediction methods use sequence-to-structure alignments as their input. Some of these are based on HSSP, others use PSI-Blast results as their input. The purpose of this project is to evaluate different methods of sequence-to-structure alignments for their alignment quality and computational overhead to develop guidelines for the effiicient usage of appropriate methods in the respective context. The standard of truth will be comparison to structural alignments as well as quality of prediction results (e.g. homology models, SNP effects) based on the alignments.
Master thesis
Student: Benjamin Wellmann
Supervisor: Andrea Schafferhans
Transmembrane protein 3D structure prediction from evolutionary sequence variation
Alpha-helical transmembrane proteins are an abundant class of proteins involved in a variety of important biological processes such as signaling or transport. Yet, due to the difficulty of solving membrane protein structures experimentally, many protein families remain without structural information inferrable by homology. In this master thesis, we aim to establish a de novo 3D structure prediction method for alpha-helical transmembrane proteins which is exclusively based on sequence information, without the use of homology modeling, threading or sequence fragments.
Master thesis
Student: Thomas Hopf
Supervisor: Burkhard Rost, Chris Sander, Debora Marks
Feature construction and selection for predicting structural change upon point mutation in proteins
In this bachelor thesis we investigate basic amino acid propensities with respect to their ability to improve the prediction of local structural change upon point mutation within protein sequences.
Bachelor thesis
Student: Yannick Mahlich
Supervisor: Burkhard Rost, Christian Schaefer
Improving predictions of functional effect of non-synonymous SNP in human
Abstract: In the near future, personal genome sequencing and analysis will become more and more affordable to private persons and therefore increase the public interest in characterizing the effect of single nucleotide polymorphisms (SNPs) in our own genomes. This master thesis is aimed at improving both speed and accuracy for predictions of functional effect of SNPs in human. We will investigate how to limit the search space of homologous proteins to those of a few organisms that best reflect the spectrum of human proteins thus reducing the necessary computational time needed for every prediction. Additionally, a machine learning device will be trained and optimized towards the prediction of SNPs in human by using feature selection techniques.
Master thesis
Student: Maximilian Hecht
Supervisor: Burkhard Rost
Predict subcellular localization for proteins in all kingdoms
An automatic approach for predicting the subcellular localization of proteins, which is an important step towards understanding their function, is developed in this master's thesis project. The sequence-based approach utilizes a number of Support Vector Machines for mimicking the cascading mechanism of cellular sorting. The approach is applicable to soluble and membrane proteins in all taxonomic kingdoms.
Master thesis
Student: Tatyana Goldberg
Supervisor: Burkhard Rost, Tobias Hamp
Compare effects of nsSNPs from human variations with disease related SNPs
Since the sequencing of the human genome was completed in 2000, the analysis of SNPs gained more and more attention in the past few years. SNPs make up about 90% of all human genetic variation and are known to cause several diseases such as cancer. Therefore the analysis of their effects on protein function is of major interest. In this master thesis a comprehensive analysis of the functional effects of SNPs identified by the 1000 genomes project was performed using the SNAP prediction method. The number of effect SNPs in the 1000 genome data was found to be as high as the number of neutral SNPs and the cumulative distribution of SNAP scores looked like the distribution for completely random generated SNPs. These findings led to the conclusion that there exist more SNPs with functional effects in naturally occurring human variation than one has assumed in the past years. Furthermore 15 proteins were identified that accumulated the most function changing SNPs. With these candidate proteins and their direct neighbors a Protein Protein interaction network was built and a GO enrichment analysis was performed. As a result, most of the genes were predicted to belong to the ’intracellular signaling’ and ’protein binding’ GO terms.
Master thesis
Student: Dominik Achten
Supervisor: Burkhard Rost, Shaila Roessle
Evaluation of methods to predict transmembrane alpha-helices in proteins
Proteins containing transmembrane alpha-helices assumedly constitute about 25% of all proteins in an organism. Due to their lipophilic transmembrane region, high-resolution structures of these proteins are even more challenging to obtain than those of soluble proteins. Therefore, methods to predict the position of transmembrane helices and their topology from sequence are of great interest, for example in whole genome analyses. Today, a large number of alpha-helix transmembrane prediction methods exist. The objective of this thesis is an evaluation of several well-known prediction methods.
Bachelor thesis
Student: Jonas Reeb
Supervisor: Burkhard Rost, Edda Kloppmann
Identification of DNA-binding residues from amino acid sequence data
Motivation - DNA-protein interactions are essential for many biological processes e.g. for DNA packaging, DNA replication, DNA recombination and DNA repair. There is currently no established high throughput technology available to experimentally screen DNA-protein binding. At the same time, the number of known proteins explodes. For these purposes we need an in silico method to predict DNA-protein interactions.
Prediction level - There are methods which can predict that a protein will principally inter- act with DNA. Given such a DNA-interacting protein, our novel method focuses on the prediction of DNA-interactivity for each residue.
Input types - Some methods use the tertiary structure (3D), which promises to predict the DNA-binding residues with high accuracy. A weakness of tertiary structure based methods is that they depend on experimental data, which is mostly not applicable. Therefore, our method exclusively uses the raw amino acid sequence data (1D) and nothing more. We generate all necessary features from the raw amino acid sequence e.g. with PredictProtein. As a result, our method is applicable to all known proteins.
Novel dataset - For training, we used a novel dataset containing transcription factors, enzymes and structural / DNA-binding proteins e.g. histone-like proteins. Our purpose is to build a robust classifier for these three protein classes.
Diploma thesis
Student: Michael Menden
Supervisor: Burkhard Rost, Shaila Roessle