Trustworthy AI is all about data privacy and accurate, fair, and explainable AI. However, robust AI models training and validation requires a large amount of data. At the same time, privacy regulations are tightened. Sharing real-world data with colleagues or partners brings some concerns. Collecting data is costly and time-consuming. Missing and imbalanced data are a reality. So, how can we tackle those challenges?
From patients’ records, financial transactions to incidents logs for predictive maintenance, having well accurately generated and rigorously validated data makes sense. Synthetic data retains statistical properties and patterns without exposing sensitive information. It helps to enrich datasets and balance rare events like disease outbreaks or fraud.
This article outlines the seven essential steps for a successful synthetic data approach in your organization. We will focus on structured and tabular input data only.
You do not generate synthetic data purely for its own sake. As usual, there is a business need, an end goal. The use case is everything: sometimes synthetic data is relevant, other times, real data is still irreplaceable.
Whether your goal is to detect insurance fraud, simulate patient cohorts and outcomes for clinical trials or emulate electronic health records to ease information exchange between research teams, the process, methods, and tests will not be the same. Nevertheless, the goal is always to preserve quality and privacy.
Next consider the challenges you need to address. Do you want to augment your data to improve models’ accuracy when training AI models? Do you have some underrepresented patient groups? Are you working with rare events? What type of privacy requirements do you need to meet?
Well, if you have the answers, you are ready for step 2 😉
The well-known principle ‘Garbage In, Garbage Out’ also applies to synthetic data generation. Do not neglect the essential steps of data profiling and cleaning: check for missing values and impute them, if necessary, remove duplicates and verify and correct errors. Some marginal data points or outliers also exist in real life: rare or unpredictable events, and extreme scenarios like climate risks. You may want to include these extreme cases in your data to reflect the real-world variability. It is also important to consider whether you want to amplify certain minority groups, keeping in mind that the data generated reflects the biases in the original data.
Do not limit yourself to a single table. Synthetic data engines support multi-table datasets and preserve statistical distributions across tables when training and generating large datasets. Imagine a patient whose information is spread across multiple tables: patient information, visit schedules and patient medication tables.
SylvieFaucillon_0-1762162847636.pngSAS® Data Maker – multi-tables support
The choice of algorithm depends on the structure and complexity of the data. While statistical methods may be enough for simple data, deep learning may be necessary for complex data and purposes. The generation technique will differ for a dataset having information on patient’s health status: age, health condition, treatment, compared to a dataset having time series data for each patient’s consultation over the past ten years. The choice also depends highly on your business goal: fraud or electronic health records emulation, may require a different algorithm.
Let’s talk about some of them.
Bayesian models are one-shot learning models that deliver very high-quality results on various data sets.
The model learns the structure of a Bayesian network that captures the dependencies between attributes in the dataset.
GANs involved two deep learning networks: a generator that creates synthetic data and a discriminator who acts like an adversary by distinguishing artificial and real data. They work iteratively until the discriminator is no longer able to distinguish between synthetic and real data.
Synthetic Minority Oversampling TEchnique (SMOTE) is an oversampling method used to address the problem of imbalanced data sets by generating synthetic samples. Instead of sampling with or without replacement, the SMOTE method randomly selects a sample and its k-nearest neighbors from the same class. While well-suited for small datasets, SMOTE is not recommended for use cases requiring an important level of privacy protection.
There are many privacy protection mechanisms available. If you have already tried some, you may have found that: methods such as randomization, permutation or pseudonymization have a higher risk of re-identification or that generalization reduces the usability of data in machine learning models.
I would recommend trying differential privacy, considered the most effective technique for minimizing the influence that an individual’s data has. Differential privacy is a mathematical framework that quantifies and controls privacy risk. It adds "noise" or introduces a random element into the dataset and is therefore applied at an algorithm level. It ensures that the removal or addition of a single record does not significantly affect the output of any analysis.
Well, your job does not end after the training phase and the first generation. Next comes the validation and evaluation of your generated data.
Evaluation metrics exist to assess the quality, utility and privacy of the synthetic dataset generated compared to the original.
Similarity metrics (histogram similarity, Mutual Information Similarity, Degree Distribution Similarity, Cross-Table Mutual Information Similarity...) quantify the similarity between the synthetic data and the original data.
The utility score reflects the capability of synthetic datasets to be used to train AI models.
Caution: privacy metrics (density disclosure, presence disclosure...) do not constitute a guarantee of privacy, but only an estimate of the privacy risk.
SylvieFaucillon_0-1762163010770.png
Data Maker evaluates synthetic data quality with visual evaluation metrics.
Generating synthetic data is not an isolated task but one part of an end-to-end process. There are upstream and downstream steps to consider.
SylvieFaucillon_1-1762163055857.png
The SAS Data Maker process: Plan, Prepare, and Produce.
Synthetic data generation is not a once-and-done action. It is not about generating synthetic data once, using it, and then forgetting about it. Synthetic data is an asset that is an integral part of a larger AI system.
As with any AI modeling project, continuous monitoring must be implemented to anticipate model drift over time and maintain trust.
Documentation is essential. Keep track of the data privacy score, the utility metrics and similarity metrics for each iteration and always document your data movements.
Robust synthetic data practice requires proper processes and governance. The essential steps are well-known to AI modelers: define your goal, prepare your data, choose the method that is best suited to your data and business goal, measure the quality and utility results and assess privacy risks.
Each step and each data movement require documentation. Under these conditions, you will be able to harness into the potential of synthetic data within your organization.
And you, what would be your number 8?
Interesting readings
Step Into SAS Data Maker: A Practical First Look, October 16, 2025 by Manoj Singh
A Human Generated Introduction to Generative AI, Part 1: Synthetic Data Generation
PharmaSUG 2025 - Paper RW - 340 You Don't Have to Handle the Truth! Three Things to Know about Synthetic Data, Catherine Briggs, S. Robert Collins, Sundaresh Sankaran @Sundaresh1, SAS Institute
https://pharmasug.org/proceedings/2025/RW/PharmaSUG-2025-RW-340.pdf
To learn more about SAS Data Maker: https://www.sas.com/en_us/software/data-maker.html
sasinnovate.png
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just 495ドル!