What is it about?
Outliers are rare occurrences within a population of data. These can represent unique or abnormal data or errors within your dataset in practical scenarios. A powerful and common way to deal with building powerful AI models capable of detecting such rare occurrences depends on generating synthetic outliers. However, how to generate a "realistically rare" occurrence is not a well-defined problem. This paper explores classical properties of high-dimensional outliers observed in practice, to then build an efficient and well-motivated method to generate such data. As an use-case, we utilize these synthetic data points to build a machine-learning model capable of rivaling SOTA models in Outlier Detection and in Oversampling.
Featured Image
Photo by Eunkwang Choi on Unsplash
Why is it important?
As outliers tend to be scarce by definition, datasets containing labeled outliers are also rare. This greatly difficult the production of models capable of detecting them, effectively reducing the ability to detect them. Being able to produce realistic outliers for oversampling porpuses plays a crucial and extremely important role in Data Engineering. Key takeaways: Our realistic synthetic outliers have shown to both: 1._ have better quality than its competitors 2._ be applicable where other oversampling techniques fail
Perspectives
Read the Original
This page is a summary of: Efficient Generation of Hidden Outliers for Improved Outlier Detection, ACM Transactions on Knowledge Discovery from Data, August 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3690827.
You can read the full text:
Resources
Check out our official implementation!
Link to the official GitHub repo of the project (in R)
Check out this great implementation of our project in Python! (supported by us)
Third-party implementation of our algorithm, BISECT. We are supporting this Python implementation, and plan on update it in the future
Contributors
The following have contributed to this page