The value of compression for taxonomic identification

Jorge Miguel Silva; Joao Rafael Almeida

doi:10.1109/cbms55023.2022.00055

What is it about?

When studying the genetic makeup of organisms, a common challenge is the classification of DNA sequences that do not match any known biological sequences from previous research. This can make it difficult to identify the organism and understand its characteristics. To address this challenge, researchers have developed methods for taxonomic identification that do not rely on pre-existing references, using tools called compressors to analyze the DNA sequences. However, with many compressors available and the computational resources needed to run them, it can be difficult to choose the best ones for classification tasks with limited resources. This paper presents a two-step pipeline to evaluate the performance of nine different compressors for taxonomic identification. To do this, we selected 500 DNA sequences from five different taxonomic groups and ran each sequence through the compressors to see how well they could classify the sequences. Our results show that the Normalized Compression (NC) feature, calculated using the compressors, can provide valuable information about the nature and complexity of a given DNA sequence. However, we also found that the compression capabilities of the compressors, or the compressibility of the sequences themselves, do not necessarily correlate with classification accuracy. Our findings suggest that the NC feature can be a valuable tool for taxonomic identification but that the specific compressor used may not be as important. This could make it easier for researchers to choose the best compressors for their classification tasks without evaluating every available option. Additionally, our work highlights the potential for using compressors as a feature for machine learning algorithms to improve the accuracy of taxonomic identification and better understand the characteristics of unknown organisms.

Photo by Louis Reed on Unsplash

Why is it important?

The use of DNA sequencing technologies has dramatically expanded our understanding of the genetic makeup of organisms. Still, one of the challenges of this field is the classification of unknown DNA sequences. This work presents a two-step pipeline for evaluating the performance of compressors for taxonomic identification, which could help researchers more easily choose the best tools for their classification tasks. Additionally, the findings suggest that compressors could be helpful as a feature for machine learning algorithms, improving the accuracy of taxonomic identification and our understanding of unknown organisms. This work is timely and relevant in metagenomics, where the ability to identify and classify organisms accurately is crucial for various applications.

Perspectives

Our work presents a novel approach to evaluating the performance of compressors for taxonomic identification. Our two-step pipeline allows for a comprehensive evaluation of multiple compressors. Our findings suggest that the Normalized Compression feature could be a valuable tool for taxonomic identification. Additionally, our work highlights the potential for using compressors as a feature for machine learning algorithms in metagenomics, which could improve the accuracy of organism classification. We hope that it will be helpful for researchers working on taxonomic identification.
Dr. Jorge Miguel Silva
Universidade de Aveiro

A novel method for improving taxonomic identification of unknown organisms through the use of compression has been developed by me and my colleagues. The study, published in the Institute of Electrical & Electronics Engineers (IEEE) in July 2022, presents a two-step pipeline for evaluating the performance of nine different compressors in classifying 500 DNA sequences from five different taxonomic groups. The researchers found that the Normalized Compression (NC) feature, calculated using the compressors, provides valuable information about the nature and complexity of a given DNA sequence. However, they also found that the compressibility of the sequences themselves does not necessarily correlate with classification accuracy. The findings of this study suggest that the NC feature can be a valuable tool for taxonomic identification, but the specific compressor used may not be as important. This could make it easier for researchers to choose the best compressors for their classification tasks without evaluating every available option. Furthermore, the results highlight the potential for using compressors as a feature for machine learning algorithms to improve the accuracy of taxonomic identification and increase our understanding of unknown organisms. This work is highly relevant in the field of metagenomics, where accurate identification and classification of organisms is crucial for various applications. The development of this novel method has the potential to make significant contributions to our understanding of the genetic makeup of organisms and pave the way for advancements in disease treatment and prevention.
João Rafael Almeida

This page is a summary of: The value of compression for taxonomic identification, July 2022, Institute of Electrical & Electronics Engineers (IEEE),
DOI: 10.1109/cbms55023.2022.00055.
You can read the full text:

Read

Resources

Presentation
Paper presentation
Presentation of the paper

Contributors

The following have contributed to this page

Using compression to improve taxonomic identification of unknown organisms

What is it about?

Why is it important?

Perspectives

Resources

Paper presentation

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Using compression to improve taxonomic identification of unknown organisms

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Resources

Paper presentation

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management