What is it about?

In this research, the authors compare various classification algorithms for RNA sequencing (RNA-Seq) data, which uses next-generation sequencing technologies for gene expression profiling. Traditional statistical methods, which are based on a continuous scale, cannot be directly applied to RNA-Seq data due to its discrete distribution. Therefore, count-based classifiers, such as PLDA with power transformation, NBLDA, and microarray-based classifiers after rlog/vst transformations, are proposed. The study also examines the impact of several parameters, including sample size, overdispersion, and the number of genes and classes, on model performance. The results indicate that increasing the sample size, decreasing the dispersion parameter, and the number of groups lead to an increase in classification accuracy. The authors conclude that PLDA after a power transformation may be a good choice as a count-based classifier, while NBLDA performance is not satisfactory. RF, SVM, and bagSVM may give accurate results after an rlog or vst transformation. Moreover, the efficiency of the bagSVM is improved markedly with increasing sample size. An R/BIOCONDUCTOR package, MLSeq, is developed for the classification of RNA-Seq data.

Featured Image

Why is it important?

The study is important because it focuses on the classification of RNA-Seq data, which is a powerful technique for gene expression profiling. The increasing use of RNA-Seq in research and diagnostics highlights the need for effective classification algorithms that can handle the unique characteristics of RNA-Seq data, such as overdispersion and continuous scaling. Key Takeaways: 1. RNA-Seq data is overdispersed, which can negatively impact classification performance. 2. Count-based classifiers, such as PLDA with power transformation and NBLDA, can efficiently handle overdispersed RNA-Seq data. 3. Microarray-based classifiers, after rlog/vst transformations, can also be used for classifying RNA-Seq data. 4. The PLDA classifier after a power transformation may be a good choice as a count-based classifier due to its sparsity and efficiency. 5. Further research is needed to improve the performance of NBLDA as a count-based classifier and to extend it into a sparse classifier. 6. An R/BIOCONDUCTOR package, MLSeq, is available for the classification of RNA-Seq data.

AI notice

Some of the content on this page has been created using generative AI.

Read the Original

This page is a summary of: A comprehensive simulation study on classification of RNA-Seq data, PLoS ONE, August 2017, PLOS,
DOI: 10.1371/journal.pone.0182507.
You can read the full text:

Read
Open access logo

Contributors

The following have contributed to this page