Bottom - Index of papers - Paper in HTML - Abstract - Paper as PDF - RostGroup
| Title: | Distinguishing protein-coding from non-coding RNA through support vector machines |
| Author: | Jinfeng Liu, Julian Gough, & Burkhard Rost |
| Quote: | PLoS Genetics, 2006, Apr. 2(4):e29. Epub 2006 Apr 28. |
RIKENŐs FANTOM project revealed many previously unknown coding sequences, as well as, an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs (ncRNA). The distinction of protein-coding from non-coding RNA transcripts is therefore an important problem in understanding the transcriptome and in annotating fully sequenced organisms. Very few in silico methods have specifically addressed this problem. Here, we described a novel method based on support vector machines (SVMs) that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologues from database searches, and alignment entropy. Nucleotide frequencies were also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss-Prot database constituted the set of true positives, non-coding RNAs from RNAdb and NONCODE the true negatives. Ten-fold cross-validation suggested that our SVM-based method distinguished coding from non-coding at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM-3 data set, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000.
Top -
Index of papers -
Paper in HTML -
Abstract -
RostGroup