What is it about?
DNA sequences contain essential information about the genetic makeup of living organisms and are used in various fields, including genomics and medical research. However, the large amounts of data generated by DNA sequencing techniques can be challenging to store and transmit, making it essential to develop efficient data compressors for DNA sequences. In this study, the authors present a new lossless data compressor that can effectively compress DNA sequences representing different domains and kingdoms. The compressor uses a competitive prediction model to choose between two classes of models (weighted context models and stochastic repeat models) to compress each symbol in the DNA sequence before applying arithmetic encoding. The proposed compressor outperforms state-of-the-art approaches regarding compression ratio on a diverse benchmark while using a reasonable amount of computational resources. An efficient implementation of the compressor is publicly available. This work could improve the storage and transmission of DNA sequence data and the performance of compression-based methods used in biomedical and anthropological research.
Featured Image
Photo by Markus Spiske on Unsplash
Why is it important?
There is a growing need to efficiently store and transmit large amounts of DNA sequence data, which is crucial for reducing storage and bandwidth requirements and analysis. This study presents a new lossless compression algorithm for DNA sequences that achieves improved compression capabilities for various domains and kingdoms. This reference-free method uses a competitive prediction model to determine the best models to use before applying arithmetic encoding. The proposed method was found to have a higher compression ratio than current state-of-the-art approaches on a diverse benchmark while using a competitive level of computational resources. The efficient implementation of the method is publicly available under the GPLv3 license. This work is timely and important as it addresses the increasing need for efficient data compressors for DNA sequences, which directly impacts the outcomes of anthropological and biomedical compression-based methods.
Perspectives
Read the Original
This page is a summary of: A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models, Entropy, November 2019, MDPI AG,
DOI: 10.3390/e21111074.
You can read the full text:
Resources
Contributors
The following have contributed to this page