What is it about?

DNA sequences contain essential information about the genetic makeup of living organisms and are used in various fields, including genomics and medical research. However, the large amounts of data generated by DNA sequencing techniques can be challenging to store and transmit, making it essential to develop efficient data compressors for DNA sequences. In this study, the authors present a new lossless data compressor that can effectively compress DNA sequences representing different domains and kingdoms. The compressor uses a competitive prediction model to choose between two classes of models (weighted context models and stochastic repeat models) to compress each symbol in the DNA sequence before applying arithmetic encoding. The proposed compressor outperforms state-of-the-art approaches regarding compression ratio on a diverse benchmark while using a reasonable amount of computational resources. An efficient implementation of the compressor is publicly available. This work could improve the storage and transmission of DNA sequence data and the performance of compression-based methods used in biomedical and anthropological research.

Featured Image

Why is it important?

There is a growing need to efficiently store and transmit large amounts of DNA sequence data, which is crucial for reducing storage and bandwidth requirements and analysis. This study presents a new lossless compression algorithm for DNA sequences that achieves improved compression capabilities for various domains and kingdoms. This reference-free method uses a competitive prediction model to determine the best models to use before applying arithmetic encoding. The proposed method was found to have a higher compression ratio than current state-of-the-art approaches on a diverse benchmark while using a competitive level of computational resources. The efficient implementation of the method is publicly available under the GPLv3 license. This work is timely and important as it addresses the increasing need for efficient data compressors for DNA sequences, which directly impacts the outcomes of anthropological and biomedical compression-based methods.

Perspectives

The development of efficient data compressors for DNA sequences is crucial for reducing storage and bandwidth for transmission, as well as for analysis purposes. This work presents a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate which models to use before applying arithmetic encoding. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches on a diverse benchmark, using a competitive level of computational resources. This is important because it allows for more efficient storage and analysis of DNA sequences, which can significantly impact fields such as genomics, anthropology, and biomedical research. The method is also publicly available, making it accessible for others to use and build upon. Overall, this work represents a significant advancement in the field of DNA sequence compression and has the potential to improve the storage and analysis of genomic data greatly.

Dr. Jorge Miguel Silva
Universidade de Aveiro

Read the Original

This page is a summary of: A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models, Entropy, November 2019, MDPI AG,
DOI: 10.3390/e21111074.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page