What is it about?

This article explores methods for generating synthetic data, algorithmically created for testing and training machine learning models. Probabilistic approaches, such as Monte Carlo simulation and Generative Adversarial Networks (GANs), rely on random sampling. Non-probabilistic methods, like Inverse Copula Sampling and Cholesky Decomposition, preserve dependencies and covariance structures. The proposed algorithm focuses on generating synthetic data that preserves both marginal distributions and correlations. The generated data is validated using the Kolmogorov-Smirnov (K-S) test. An empirical example demonstrates the effectiveness of the methodology, showcasing the generation of synthetic data and validation against original distributions. This research provides valuable insights into synthetic data generation, aiding researchers and practitioners in data analysis and model development.

Featured Image

Why is it important?

Synthetic data is a type of data that is algorithmically generated rather than obtained by direct measurement or collection from the real world. It is a means of simulating real-world data, often used for testing, validation, and training of machine learning models or other computational systems.

Read the Original

This page is a summary of: Automated Algorithm for Multi-variate Data Synthesis with Cholesky Decomposition, October 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3631908.3631909.
You can read the full text:

Read

Contributors

The following have contributed to this page