Efficient Generation of Hidden Outliers for Improved Outlier Detection

Jose Cribeiro-Ramallo; Vadim Arzamasov; Klemens Böhm

doi:10.1145/3690827

What is it about?

Outliers are rare occurrences within a population of data. These can represent unique or abnormal data or errors within your dataset in practical scenarios. A powerful and common way to deal with building powerful AI models capable of detecting such rare occurrences depends on generating synthetic outliers. However, how to generate a "realistically rare" occurrence is not a well-defined problem. This paper explores classical properties of high-dimensional outliers observed in practice, to then build an efficient and well-motivated method to generate such data. As an use-case, we utilize these synthetic data points to build a machine-learning model capable of rivaling SOTA models in Outlier Detection and in Oversampling.

Photo by Eunkwang Choi on Unsplash

Why is it important?

As outliers tend to be scarce by definition, datasets containing labeled outliers are also rare. This greatly difficult the production of models capable of detecting them, effectively reducing the ability to detect them. Being able to produce realistic outliers for oversampling porpuses plays a crucial and extremely important role in Data Engineering. Key takeaways: Our realistic synthetic outliers have shown to both: 1._ have better quality than its competitors 2._ be applicable where other oversampling techniques fail

Perspectives

It was a pleasure to collaborate with all my co-authors on this paper. The daunting task of contributing in a large area like Outlier generation and Oversampling was made less of a hassle thanks to their help. I think (and hope) that we have made a great contribution to the field of Outlier Generation. I hope that any reader coming across this paper is now as excited as we are about realistic synthetic outliers, and considers their implementation in any future project!
Jose Cribeiro-Ramallo
Karlsruher Institut fur Technologie

This page is a summary of: Efficient Generation of Hidden Outliers for Improved Outlier Detection, ACM Transactions on Knowledge Discovery from Data, August 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3690827.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page

Jose Cribeiro-Ramallo
Karlsruher Institut fur Technologie

Generating realistic high-dimensional outlying data

What is it about?

Why is it important?

Perspectives

Resources

Check out our official implementation!

Check out this great implementation of our project in Python! (supported by us)

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Generating realistic high-dimensional outlying data

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Resources

Check out our official implementation!

Check out this great implementation of our project in Python! (supported by us)

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management