What is it about?
Survey data is a commonly used data type in social science research. However, its data sharing has been compromised by data privacy risks and previous coarse de-identification techniques. In this study, we tackle this challenge by systematically evaluating four common synthetic data models, Synthpop, CTGAN, REaLDTabFormer, and TVAE, across three key dimensions: utility, fidelity, and privacy.
Featured Image
Photo by Firmbee.com on Unsplash
Why is it important?
Our findings reveal that each model has distinct strengths: Synthpop excels in general utility, CTGAN prioritizes privacy, and REaLDTabFormer and TVAE perform best in downstream applications. We recommend that future researchers select a generative method by considering the trade-offs between performance across various evaluation dimensions, training size, data type, and computational infrastructure.
Perspectives
This paper introduces an end-to-end pipeline to streamline and standardize synthetic data generation and evaluation for survey researchers. We hope to provide a practical guide on the strengths and limitations of these methods regarding social science survey data.
Yanru Jiang
University of California Los Angeles
Read the Original
This page is a summary of: Synthetic Survey Data Generation and Evaluation, July 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3690624.3709421.
You can read the full text:
Contributors
The following have contributed to this page







