Synthetic data: A secret ingredient for better language models
It's increasingly clear that the quality of a large language model (LLM) is hugely dependent on the quality of the data used during training. Take DeepSeek for example, an open source frontier model developed for an incredibly low cost, which incorporated synthetic data for training. In addition, for enterprises requiring specialized models for business use cases, real-world data can be challenging to obtain and annotate for model training. That’s why today’s artificial intelligence (AI) breakthroughs are being powered by synthetic data—a transformative approach to training and refining language models that’s faster, cheaper and more scalable than traditional methods.
The data dilemma: Why synthetic data matters
Training high-performing LLMs requires vast amounts of diverse, high-quality data. Historically, this meant scraping all parts of the internet, manually curating datasets or paying teams of annotators—a process that’s expensive, slow and sometimes ethically complicated. At the same time, for specialized models in industries like healthcare, traditional data may contain personally identifiable information (PII) and be subject to regulations such as HIPPA or the EU AI Act.
The data dilemma Why synthetic data mattersFor organizations developing custom AI models, the data preparation process can be quite intensive
This is where synthetic data comes in. Synthetic data is artificially generated data used to augment real-world data, helping reduce the time, expense and legal hurdles of collecting and labeling large datasets. By leveraging LLMs, it’s possible to generate specialized queries that can be more diverse and comprehensive than human-curated alternatives.
How synthetic data generation works
Synthetic data isn’t "fake" data, instead it is designed to mimic real-world data where that data may be scarce. Let’s take a look at some primary methods for generating synthetic data that are commonly used in the language models of today.
Knowledge transfer: Model distillation for synthetic data
One of the most effective ways to generate synthetic data is known as distillation, where a larger, more advanced "teacher" model (such as Llama 405B) creates training examples for a smaller "student" model. This facilitates the transfer of knowledge into a more specialized, small language model (SLM), which has faster response times and is less resource-intensive once deployed. Notably, Meta’s Llama license changed in the 3.1 release to explicitly allow distillation, although it’s important to note that many proprietary models don’t permit these workflows.
Knowledge transfer Model distillation for synthetic dataWith traditional model distillation, the teacher model is distilled in a knowledge transfer to the smaller, student model
Iterative self-improvement
Models can also refine their own outputs through iterative self-improvement. Here the model starts with plain text and human-written prompts, and iteratively evolves those into complex queries covering all sorts of edge cases. This is an important part of post-training and fine-tuning, where a model already has a "base" of knowledge that can be supplemented with additional domain-specific data to develop specialized expertise related to a certain task or field of knowledge.
Microsoft’s Evol-Instruct is one example of an automated pipeline, but at Red Hat, with the partnership of IBM Research, we’ve been developing InstructLab to help overcome scalability challenges in the instruction-tuning phase of LLM training. By structuring a taxonomy of data and enabling subject matter experts to contribute without needing deep data science expertise, InstructLab offers a cost-effective and more scalable solution for enhancing LLM capabilities with knowledge and skills.
InstructLab works in part by translating traditional data sources as seed examples for synthetic data generation, enabling more diverse outputs, implementing rigorous filtering of produced data and providing multiphase tuning to incrementally improve the model's performance. This phased approach helps maintain training stability and provides a replay buffer to help prevent catastrophic forgetting.
Iterative self-improvementWith InstructLab, a taxonomy organizes initial seed data to be used in synthetic data generation for model customization
Data refinement and synthetic augmentation in model training
While we’ve been looking at ways to improve smaller and more specialized models, you might be wondering how the next evolution of foundation models is using synthetic data to improve benchmarks and evaluations.
One example is the Microsoft Phi-4 open source model which incorporates a significant amount of synthetic data in its training process, with multi-agent prompting and self-revision workflows, excelling at complex reasoning. During pre-training, a combination of data refinement, filtering and classification can be performed to produce higher-quality corpuses of data.
For example the HuggingFace Cosmopedia dataset has over 25 billion tokens and one of the largest open synthetic datasets to date. An initial web sample from a dataset like RefinedWeb might include articles on "baking techniques" with a straightforward explanation of how to bake bread. By using the Mixtral-8x7B-Instruct model to generate a synthetic re-representation of the data, the custom prompt might instruct the model to create a detailed guide for beginners, such as, "Write a step-by-step tutorial on baking sourdough bread, including tips for achieving the perfect crust." This resulting synthetic data expands on the original idea with added depth and clarity, increasing the diversity and quality of the original data sample.
Data refinement and synthetic augmentation in model trainingIn Cosmopedia, initial web extracts and seed examples are re-represented to provide more context and deeper background context
Quality and bias risks with synthetic data
While synthetic data has immense potential only bottlenecked by computational speeds, there are a number of risks to keep in mind. First is what’s known as model collapse, where repeated training on synthetic data can degrade model performance, leading to "hallucinations" or oversimplified outputs. In addition, initial flaws or biases in the training data or samples can persist or worsen during the generation process, and careful human curation of seed examples and post-generation review is necessary. Ultimately, a systematic pipeline with a critic system—an AI component that evaluates the performance of another AI model or system—can help to filter only high-quality examples, and a combination of LLM annotation and human annotation can provide the best results.
The future is synthetic (and open)
As AI models continue to grow more sophisticated, the role of synthetic data will become the default, not the exception, but its success is based on transparency and rigorous validation. With InstructLab, we are building a future that addresses the challenges of synthetic data through grounded source data, a diversity of contributions from subject matter experts and automated quality checks. This means organizations and enterprises have the potential to develop their own unique AI models, customized for their specific use cases and free from vendor lock-in. As we always say at Red Hat, the future is open!
product trial
Red Hat Enterprise Linux AI | Product Trial
About the authors
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech