What is it about?

We revisited the classification of coding sequences (CDS) based on nucleotide statistics using the Universal Feature Method (UFM). We show that the rules (i) G1>G2 (G1 and G2 are the guanine levels in 1st and 2nd position of contiguous DNA triplets, respectively) and (ii) T1<A2 (T1 and A2 are the thymine and adenine levels in 1st and 2nd position of contiguous DNA triplets, respectively) improve the success rate of CDS classification. The combination of G1>G2 and T1<A2 rules causes the decrease of the classification error due to the confusion between +1 and -1 or -2 frames without affecting significantly the detection rate. We also show how the information due to purine bias can be complemented by that of stop codon frequency to achieve high success rate together with low error rate.

Featured Image

Why is it important?

UFM provide a simple tool to gather necessary prior knowledge from transcriptome data for training of other investigative tools, such as Markov models or other machine learning processes that can be used for de novo genome annotation of eukaryote species. Alternatively, it could be used to extract coding information from samples of bulk metagenomic sequencing.

Perspectives

This method does not need any previous knowledge, which means that there is no theoretical impediment to the sequencing of any transcriptomes or metagenomic data without previous knowledge. The only limitation being the access financial means. With a MinION sequencer, a LandRover and a laptop, it would be possible to go in the wild for sequencing exome on the fly with 95% sensitivity and 95% success rate.

Nicolas Carels
Oswaldo Cruz Foundation

Read the Original

This page is a summary of: THE CONTRIBUTION OF STOP CODON FREQUENCY AND PURINE BIAS TO THE CLASSIFICATION OF CODING SEQUENCES, June 2013, World Scientific Pub Co Pte Lt,
DOI: 10.1142/9789814520829_0018.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page