What is it about?
Do we really need more spreadsheets? Yes! Generative modelling is gaining traction outside of creating funny images of cats and helping improve your resumé. Tabular micro-data (e.g. patient- or customer data) can be a treasure trove of insights in the right hands, but a serious vulnerability should they fall into the wrong ones. Synthetic data, with sufficient quality, can work as a proxy for real sensitive records and make some amount of insights more accessible while protecting privacy. This survey documents recent trends in methods for creating synthetic tabular data, and what tools are being used for assessing the quality of it.
Featured Image
Photo by Mika Baumeister on Unsplash
Why is it important?
We investigate what models are in active use, which methods are used for evaluation, and how models can be compared. We find that while deep learning approaches are still the most popular, some more traditional machine learning approaches remain highly relevant, but have been overlooked in recent surveys. We attribute this to a disturbing lack of direction in the field of evaluation, which makes it difficult to effectively compare generation methods. Additionally, lack of standardisation and proper guidelines wrt. privacy makes it a challenge to apply synthetic data to meet open data requirements.
Perspectives
Read the Original
This page is a summary of: Systematic Review of Generative Modelling Tools and Utility Metrics for Fully Synthetic Tabular Data, ACM Computing Surveys, November 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3704437.
You can read the full text:
Contributors
The following have contributed to this page