What is it about?

Outliers are rare occurrences within a population of data. These can represent unique or abnormal data or errors within your dataset in practical scenarios. A powerful and common way to deal with building powerful AI models capable of detecting such rare occurrences depends on generating synthetic outliers. However, how to generate a "realistically rare" occurrence is not a well-defined problem. This paper explores classical properties of high-dimensional outliers observed in practice, to then build an efficient and well-motivated method to generate such data. As an use-case, we utilize these synthetic data points to build a machine-learning model capable of rivaling SOTA models in Outlier Detection and in Oversampling.

Featured Image

Why is it important?

As outliers tend to be scarce by definition, datasets containing labeled outliers are also rare. This greatly difficult the production of models capable of detecting them, effectively reducing the ability to detect them. Being able to produce realistic outliers for oversampling porpuses plays a crucial and extremely important role in Data Engineering. Key takeaways: Our realistic synthetic outliers have shown to both: 1._ have better quality than its competitors 2._ be applicable where other oversampling techniques fail

Perspectives

It was a pleasure to collaborate with all my co-authors on this paper. The daunting task of contributing in a large area like Outlier generation and Oversampling was made less of a hassle thanks to their help. I think (and hope) that we have made a great contribution to the field of Outlier Generation. I hope that any reader coming across this paper is now as excited as we are about realistic synthetic outliers, and considers their implementation in any future project!

Jose Cribeiro-Ramallo
Karlsruher Institut fur Technologie

Read the Original

This page is a summary of: Efficient Generation of Hidden Outliers for Improved Outlier Detection, ACM Transactions on Knowledge Discovery from Data, August 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3690827.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page