What is it about?

The mutual information theory was used for the certification of annotated coding sequences of rice from both GenBank and TIGR databases. Considering coding sequences larger than 600 bp, we successfully screened out genes with aberrant compositional features. We found that they represent about 10% of both datasets after cleaning for gene redundancy. Most of the rejected accessions showed a different trend in GC3% vs GC2% plot compared to the set of accessions that have been published in international journals.

Featured Image

Why is it important?

These results were used to argue the contamination of coding sequence samples in public databases with spurious non-coding sequences as a bias of pattern recognition algorithms introduced by gene prediction softwares.

Perspectives

The mutual information as a low level of sensitivity but had the merit to show the existence of a sistematic bias of nucleotide composition in coding sequences. Later on, it has been possible to identify the parameters of this bias as reported in a publication entitled "A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences" (doi:10.4137/BBI.S10053) and to use it for coding sequence classification without training step.

Nicolas Carels
Oswaldo Cruz Foundation

Read the Original

This page is a summary of: The mutual information theory for the certification of rice coding sequences, FEBS Letters, May 2004, Wiley,
DOI: 10.1016/j.febslet.2004.05.026.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page