订阅内容

Synthetic data: A secret ingredient for better language models

2025年 3月 25日Cedric Clyburn, Carol Chen 4 分钟阅读

分享

订阅

It's increasingly clear that the quality of a large language model (LLM) is hugely dependent on the quality of the data used during training. Take DeepSeek for example, an open source frontier model developed for an incredibly low cost, which incorporated synthetic data for training. In addition, for enterprises requiring specialized models for business use cases, real-world data can be challenging to obtain and annotate for model training. That’s why today’s artificial intelligence (AI) breakthroughs are being powered by synthetic data—a transformative approach to training and refining language models that’s faster, cheaper and more scalable than traditional methods.

The data dilemma: Why synthetic data matters

Training high-performing LLMs requires vast amounts of diverse, high-quality data. Historically, this meant scraping all parts of the internet, manually curating datasets or paying teams of annotators—a process that’s expensive, slow and sometimes ethically complicated. At the same time, for specialized models in industries like healthcare, traditional data may contain personally identifiable information (PII) and be subject to regulations such as HIPPA or the EU AI Act.

The data dilemma Why synthetic data matters

For organizations developing custom AI models, the data preparation process can be quite intensive

This is where synthetic data comes in. Synthetic data is artificially generated data used to augment real-world data, helping reduce the time, expense and legal hurdles of collecting and labeling large datasets. By leveraging LLMs, it’s possible to generate specialized queries that can be more diverse and comprehensive than human-curated alternatives.

How synthetic data generation works

Synthetic data isn’t "fake" data, instead it is designed to mimic real-world data where that data may be scarce. Let’s take a look at some primary methods for generating synthetic data that are commonly used in the language models of today.

Knowledge transfer: Model distillation for synthetic data

One of the most effective ways to generate synthetic data is known as distillation, where a larger, more advanced "teacher" model (such as Llama 405B) creates training examples for a smaller "student" model. This facilitates the transfer of knowledge into a more specialized, small language model (SLM), which has faster response times and is less resource-intensive once deployed. Notably, Meta’s Llama license changed in the 3.1 release to explicitly allow distillation, although it’s important to note that many proprietary models don’t permit these workflows.

Knowledge transfer Model distillation for synthetic data

With traditional model distillation, the teacher model is distilled in a knowledge transfer to the smaller, student model

Iterative self-improvement

Models can also refine their own outputs through iterative self-improvement. Here the model starts with plain text and human-written prompts, and iteratively evolves those into complex queries covering all sorts of edge cases. This is an important part of post-training and fine-tuning, where a model already has a "base" of knowledge that can be supplemented with additional domain-specific data to develop specialized expertise related to a certain task or field of knowledge.

Microsoft’s Evol-Instruct is one example of an automated pipeline, but at Red Hat, with the partnership of IBM Research, we’ve been developing InstructLab to help overcome scalability challenges in the instruction-tuning phase of LLM training. By structuring a taxonomy of data and enabling subject matter experts to contribute without needing deep data science expertise, InstructLab offers a cost-effective and more scalable solution for enhancing LLM capabilities with knowledge and skills.

InstructLab works in part by translating traditional data sources as seed examples for synthetic data generation, enabling more diverse outputs, implementing rigorous filtering of produced data and providing multiphase tuning to incrementally improve the model's performance. This phased approach helps maintain training stability and provides a replay buffer to help prevent catastrophic forgetting.

Iterative self-improvement

With InstructLab, a taxonomy organizes initial seed data to be used in synthetic data generation for model customization

Data refinement and synthetic augmentation in model training

While we’ve been looking at ways to improve smaller and more specialized models, you might be wondering how the next evolution of foundation models is using synthetic data to improve benchmarks and evaluations.

One example is the Microsoft Phi-4 open source model which incorporates a significant amount of synthetic data in its training process, with multi-agent prompting and self-revision workflows, excelling at complex reasoning. During pre-training, a combination of data refinement, filtering and classification can be performed to produce higher-quality corpuses of data.

For example the HuggingFace Cosmopedia dataset has over 25 billion tokens and one of the largest open synthetic datasets to date. An initial web sample from a dataset like RefinedWeb might include articles on "baking techniques" with a straightforward explanation of how to bake bread. By using the Mixtral-8x7B-Instruct model to generate a synthetic re-representation of the data, the custom prompt might instruct the model to create a detailed guide for beginners, such as, "Write a step-by-step tutorial on baking sourdough bread, including tips for achieving the perfect crust." This resulting synthetic data expands on the original idea with added depth and clarity, increasing the diversity and quality of the original data sample.

Data refinement and synthetic augmentation in model training

In Cosmopedia, initial web extracts and seed examples are re-represented to provide more context and deeper background context

Quality and bias risks with synthetic data

While synthetic data has immense potential only bottlenecked by computational speeds, there are a number of risks to keep in mind. First is what’s known as model collapse, where repeated training on synthetic data can degrade model performance, leading to "hallucinations" or oversimplified outputs. In addition, initial flaws or biases in the training data or samples can persist or worsen during the generation process, and careful human curation of seed examples and post-generation review is necessary. Ultimately, a systematic pipeline with a critic system—an AI component that evaluates the performance of another AI model or system—can help to filter only high-quality examples, and a combination of LLM annotation and human annotation can provide the best results.

The future is synthetic (and open)

As AI models continue to grow more sophisticated, the role of synthetic data will become the default, not the exception, but its success is based on transparency and rigorous validation. With InstructLab, we are building a future that addresses the challenges of synthetic data through grounded source data, a diversity of contributions from subject matter experts and automated quality checks. This means organizations and enterprises have the potential to develop their own unique AI models, customized for their specific use cases and free from vendor lock-in. As we always say at Red Hat, the future is open!

product trial

红帽企业 Linux AI | 产品试用

下载红帽企业 Linux AI 的 60 天免费试用,用于训练和运行 Granite 系列的 LLM。

关于作者

Cedric Clyburn (@cedricclyburn), Senior Developer Advocate at Red Hat, is an enthusiastic software technologist with a background in Kubernetes, DevOps, and container tools. He has experience speaking and organizing conferences including DevNexus, WeAreDevelopers, The Linux Foundation, KCD NYC, and more. Cedric loves all things open-source, and works to make developer's lives easier! Based out of New York.

Read full bio

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Original series icon

原创节目

关于企业技术领域的创客和领导者们有趣的故事