LocTree2 - Protein sub-cellular localization prediction for all domains of life

From Rost Lab Open
Revision as of 11:54, 10 June 2012 by Goldberg (talk | contribs)

Introduction

Subcellular localization is one and easily definable aspect of protein function. Computational prediction of localization continues to provide an invaluable help especially in whole genome analyses and annotations. Several methods have been developed to predict localization, yet many challenges remain to be tackled.

We at RostLab developed a novel method, LocTree2 that predicts localization for all proteins in all domains of life. Similar to our previous method, LocTree, we incorporate a system of hierarchically organized Support Vector Machines to mimic the protein trafficking mechanism in cells. Please note that other than the hierarchy and the name LocTree and LocTree2 have nothing in common.

Amongst the novel aspects of LocTree2 are:

  • the stunning number of 18 classes predicted for Eukaryota
  • 6 classes for Bacteria and 3 classes for Archaea
  • incorporation of no other information than evolutionary profiles
  • very accurate in distinction: membrane/water-soluble globular proteins
  • high robustness against sequencing errors
  • top performance even for protein fragments

Method design

LocTree2 combines three different systems of classification trees to predict 3 localization classes in Archaea (cytosol, plasma membrane and extra-cellular), 6 classes in Bacteria (cytosol, plasma membrane, periplasmic space, outer membrane, fimbrium and extra-cellular) and 18 classes in Eukaryota (ER, Golgi, extra-cellular, vacuole, peroxisome, mitochondria, chloroplast, plastid, cytosol, nucleus, ER membrane, plasma membrane, Golgi membrane, nucleus membrane, vacuole membrane, peroxisome membrane, chloroplast membrane and mitochondria membrane) (Figure 1).

Fig 1: Hierarchical architecture of LocTree2. Prediction of protein localization follows a different tree for each of the three domains of life: (a) Archaea, (b) Bacteria and (c) Eukaryota. Abbreviations: CHL, chloroplast; CHLM, chloroplast membrane; CYT, cytosol; ER, endoplasmic reticulum; ERM, endoplasmic reticulum membrane; EXT, extra-cellular; FIM, fimbrium; GOL, Golgi apparatus; GOLM, Golgi apparatus membrane; MIT, mitochondria; MITM, mitochondria membrane; NUC, nucleus; NUCM, nucleus membrane; OM, outer membrane; PERI, periplasmic space; PER, peroxisome; PERM, peroxisome membrane; PM, plasma membrane; PLAS, plastid; VAC, vacuole; VACM, vacuole membrane.

Input

LocTree2 requires a .fasta file and a .profile file for each sequence to be predicted as input. A profile file can be obtained by e.g. using PSI-BLAST (Position-Specific Iterated BLAST). We built our profiles using a 80% non-redundant database combining SWISS-PROT, TrEMBL and PDB.

Example input files can be viewed here.

Profile Kernel

In order to assign a localization class to a query protein Support Vector Machines (SVMs; implemented as internal nodes in LocTree2 trees) take a protein in its input space (as a sequence-profile tuple) and map it to a higher dimensional feature space where it is represented as a feature vector. The vector of a query protein is then 'compared' to the vectors of proteins used for the training. The predicted localization class of a query protein is then the class of the most 'similar' vector of a protein used for the training.

In our case, the mapping and the comparison are carried out by a string kernel function, called the Profile Kernel. A feature vector built by the The Profile Kernel is indexed by all possible subsequences of length k from the alphabet of 20 amino acids. Each element in the vector represents one particular k-mer with a score below a user-defined threshold sigma. This score is calculated as the ungapped cumulative substitution score in the corresponding sequence profile. The similarity between the training and the test proteins is calculated as the dot product between their k-mer vector representations and is given is a single positive integer number.

In short, the Profile Kernel identifies sets of k-mers (stretches of k addjacent residues) that are most informative for the prediction of localization and then matches these in the query protein.

Additional information about the kernel function can be found in the LocTree2 manuscript and the corresponding Profile Kernel publication.

Prediction algorithm

Each hierarchy mimics the biological sorting mechanism in that domain (in eukaryotes membrane and non-membrane proteins are treated separately). The branches represent paths of the protein sorting, the leaves the final prediction of one localization class, and the internal nodes are the decision points along the path. These decisions are implemented as binary Support Vector Machines (SVMs)

Classification trees of SVMs

Reliability index

Accuracy of localization prediction

Reference

Availability/ Download

  • The program can be accessed online via the PredictProtein service
  • Standalone version can be downloaded as a zip file here

Data

Data sets used for development and evaluation of LocTree2 can be accessed here.

Contact

For questions, please contact localization@rostlab.org