What is it about?

Survey data is a commonly used data type in social science research. However, its data sharing has been compromised by data privacy risks and previous coarse de-identification techniques. In this study, we tackle this challenge by systematically evaluating four common synthetic data models, Synthpop, CTGAN, REaLDTabFormer, and TVAE, across three key dimensions: utility, fidelity, and privacy.

Featured Image

Why is it important?

Our findings reveal that each model has distinct strengths: Synthpop excels in general utility, CTGAN prioritizes privacy, and REaLDTabFormer and TVAE perform best in downstream applications. We recommend that future researchers select a generative method by considering the trade-offs between performance across various evaluation dimensions, training size, data type, and computational infrastructure.

Perspectives

This paper introduces an end-to-end pipeline to streamline and standardize synthetic data generation and evaluation for survey researchers. We hope to provide a practical guide on the strengths and limitations of these methods regarding social science survey data.

Yanru Jiang
University of California Los Angeles

Read the Original

This page is a summary of: Synthetic Survey Data Generation and Evaluation, July 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3690624.3709421.
You can read the full text:

Read

Contributors

The following have contributed to this page