What is it about?
We revisited the classification of coding sequences (CDS) based on nucleotide statistics using the Universal Feature Method (UFM). We show that the rules (i) G1>G2 (G1 and G2 are the guanine levels in 1st and 2nd position of contiguous DNA triplets, respectively) and (ii) T1<A2 (T1 and A2 are the thymine and adenine levels in 1st and 2nd position of contiguous DNA triplets, respectively) improve the success rate of CDS classification. The combination of G1>G2 and T1<A2 rules causes the decrease of the classification error due to the confusion between +1 and -1 or -2 frames without affecting significantly the detection rate. We also show how the information due to purine bias can be complemented by that of stop codon frequency to achieve high success rate together with low error rate.
Featured Image
Why is it important?
UFM provide a simple tool to gather necessary prior knowledge from transcriptome data for training of other investigative tools, such as Markov models or other machine learning processes that can be used for de novo genome annotation of eukaryote species. Alternatively, it could be used to extract coding information from samples of bulk metagenomic sequencing.
Perspectives
Read the Original
This page is a summary of: THE CONTRIBUTION OF STOP CODON FREQUENCY AND PURINE BIAS TO THE CLASSIFICATION OF CODING SEQUENCES, June 2013, World Scientific Pub Co Pte Lt,
DOI: 10.1142/9789814520829_0018.
You can read the full text:
Resources
Contributors
The following have contributed to this page