What is it about?

Do we really need more spreadsheets? Yes! Generative modelling is gaining traction outside of creating funny images of cats and helping improve your resumé. Tabular micro-data (e.g. patient- or customer data) can be a treasure trove of insights in the right hands, but a serious vulnerability should they fall into the wrong ones. Synthetic data, with sufficient quality, can work as a proxy for real sensitive records and make some amount of insights more accessible while protecting privacy. This survey documents recent trends in methods for creating synthetic tabular data, and what tools are being used for assessing the quality of it.

Featured Image

Why is it important?

We investigate what models are in active use, which methods are used for evaluation, and how models can be compared. We find that while deep learning approaches are still the most popular, some more traditional machine learning approaches remain highly relevant, but have been overlooked in recent surveys. We attribute this to a disturbing lack of direction in the field of evaluation, which makes it difficult to effectively compare generation methods. Additionally, lack of standardisation and proper guidelines wrt. privacy makes it a challenge to apply synthetic data to meet open data requirements.

Perspectives

We hope this paper provides an interesting perspective on the current scene of synthetic data generation in tabular data. We find that the principal challenge holding back progress in this field is evaluation. Promoting new models is difficult without baselines, common denominators, and well-structured benchmarks. The challenges must be met on two fronts: Practical (software): Better, more user-friendly tools must be created to promote thorough evaluation. As a result of this work, we designed SynthEval, a framework that can provide a foundation for such efforts. Theoretical (legislative): Authorities need to adopt a stance on synthetic data, and provide guidelines concerning its use. This issue is mostly concerned with privacy, and designing more effective metrics can be part of this solution. Guiding principles such as CAIR can greatly benefit this process.

Anton Lautrup
University of Southern Denmark (SDU)

Read the Original

This page is a summary of: Systematic Review of Generative Modelling Tools and Utility Metrics for Fully Synthetic Tabular Data, ACM Computing Surveys, November 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3704437.
You can read the full text:

Read

Contributors

The following have contributed to this page