What is it about?
When studying the genetic makeup of organisms, a common challenge is the classification of DNA sequences that do not match any known biological sequences from previous research. This can make it difficult to identify the organism and understand its characteristics. To address this challenge, researchers have developed methods for taxonomic identification that do not rely on pre-existing references, using tools called compressors to analyze the DNA sequences. However, with many compressors available and the computational resources needed to run them, it can be difficult to choose the best ones for classification tasks with limited resources. This paper presents a two-step pipeline to evaluate the performance of nine different compressors for taxonomic identification. To do this, we selected 500 DNA sequences from five different taxonomic groups and ran each sequence through the compressors to see how well they could classify the sequences. Our results show that the Normalized Compression (NC) feature, calculated using the compressors, can provide valuable information about the nature and complexity of a given DNA sequence. However, we also found that the compression capabilities of the compressors, or the compressibility of the sequences themselves, do not necessarily correlate with classification accuracy. Our findings suggest that the NC feature can be a valuable tool for taxonomic identification but that the specific compressor used may not be as important. This could make it easier for researchers to choose the best compressors for their classification tasks without evaluating every available option. Additionally, our work highlights the potential for using compressors as a feature for machine learning algorithms to improve the accuracy of taxonomic identification and better understand the characteristics of unknown organisms.
Featured Image
Photo by Louis Reed on Unsplash
Why is it important?
The use of DNA sequencing technologies has dramatically expanded our understanding of the genetic makeup of organisms. Still, one of the challenges of this field is the classification of unknown DNA sequences. This work presents a two-step pipeline for evaluating the performance of compressors for taxonomic identification, which could help researchers more easily choose the best tools for their classification tasks. Additionally, the findings suggest that compressors could be helpful as a feature for machine learning algorithms, improving the accuracy of taxonomic identification and our understanding of unknown organisms. This work is timely and relevant in metagenomics, where the ability to identify and classify organisms accurately is crucial for various applications.
Perspectives
Read the Original
This page is a summary of: The value of compression for taxonomic identification, July 2022, Institute of Electrical & Electronics Engineers (IEEE),
DOI: 10.1109/cbms55023.2022.00055.
You can read the full text:
Resources
Contributors
The following have contributed to this page