InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Unlock the full InfoQ experience

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.

Don't have an InfoQ account?

Stay updated on topics and peers that matter to youReceive instant alerts on the latest insights and trends.
Quickly access free resources for continuous learningMinibooks, videos with transcripts, and training materials.
Save articles and read at anytimeBookmark articles to read whenever youre ready.

Logo - Back to homepage

News Articles Presentations Podcasts Guides

Topics

Development

Featured in Development

Go Channels: Understanding Happens-Before for Safe Concurrency

This article dives into the happens-before semantics of Go channels, explaining how they relate to memory visibility, synchronization, and concurrency correctness. We'll examine subtle pitfalls, illustrate them with examples, and explore the architectural implications for system designers.

Go Channels: Understanding Happens-Before for Safe Concurrency

All in development

Architecture & Design

Featured in Architecture & Design

If Architectures Could Talk, They’d Quote Your Boss

Software architecture reflects how organizations communicate and make decisions. Failures stem from misaligned incentives, unclear ownership, and structural gaps—not technical flaws. Architects must design not just systems, but the conditions for systems to thrive, using platform thinking to reduce friction and foster autonomy.

If Architectures Could Talk, They’d Quote Your Boss

All in architecture-design

AI Infrastructure

Featured in AI, ML & Data Engineering

Deploy MultiModal RAG Systems with vLLM

Stephen Batifol discusses building and optimizing self-hosted, multimodal RAG systems. He breaks down vector search, nearest neighbor indexes (FLAT, IVF, HNSW), and the critical role of choosing the right embedding model. He then explains vLLM inference optimization (paged attention, quantization) and uses Mistral's Pixtral to detail multimodal large language model architecture.

Deploy MultiModal RAG Systems with vLLM

All in ai-ml-data-eng

Culture & Methods

Featured in Culture & Methods

Creating Impactful Teams Across Diverse Work Environments

Natan Žabkar Nordberg shares actionable strategies for creating impactful teams across diverse work environments, focusing on the link between culture, diversity, and ROI. He discusses how to build trust through early delegation, empower teams with guided autonomy (using improv examples), and improve communication via a "session 0" framework, offering key takeaways for all engineering leaders.

Creating Impactful Teams Across Diverse Work Environments

All in culture-methods

DevOps

Featured in DevOps

From Grassroots to Enterprise: Vanguard's Journey in SRE Transformation

Christina Yakomin shares Vanguard's SRE transformation: from quarterly testing of monoliths to a mature DevOps model with continuous delivery. She explains the SRE coaching hub, self-service tools, and advanced techniques like request-rate autoscaling. She details modern challenges, including region failure game days and testing AI-backed contact centers.

From Grassroots to Enterprise: Vanguard's Journey in SRE Transformation

All in devops

Events

Helpful links

Choose your language

QCon San Francisco 2025

Get production-proven patterns from the leaders who scaled a GenAI search platform to millions, migrated a core ML system without downtime, and architected a global streaming service from the ground up.

Early Bird ends Nov 11.

QCon AI New York 2025

Move beyond AI demos to real engineering impact. Discover how teams embed LLMs, govern models, and scale inference pipelines to accelerate development securely.

Early Bird ends Nov 11.

QCon London 2025

Benchmark your systems against leading engineering teams. See what really works in FinOps, modern Java, and distributed data architectures to balance cost, scale, and reliability.

Early Bird ends Nov 11.

InfoQ Homepage News Hugging Face Publishes Guide on Efficient LLM Training across GPUs

AI, ML & Data Engineering

Hugging Face Publishes Guide on Efficient LLM Training across GPUs

This item in japanese

Mar 04, 2025 2 min read

Daniel Dominguez

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Listen to this article - 0:00

Audio ready to play

0:00

Reading list

Hugging Face has published the Ultra-Scale Playbook: Training LLMs on GPU Clusters, an open-source guide that provides a detailed exploration of the methodologies and technologies involved in training LLMs across GPU clusters. The playbook is based on over 4000 scaling experiments conducted using up to 512 GPUs, with a focus on optimizing throughput, GPU utilization, and training efficiency. It aims to provide practical guidance for researchers and engineers working on large-scale model training, offering reproducible benchmarks, implementation details, and performance optimizations.

The guide covers various parallelism strategies essential for scaling LLM training. Data parallelism (DP) enables multiple GPUs to process different batches of data simultaneously, while tensor parallelism (TP) distributes the model’s weights across GPUs to balance memory usage and computation. Pipeline parallelism (PP) splits the model into segments distributed across GPUs, allowing different parts of the model to be processed concurrently. Context parallelism (CP) is also explored as an emerging technique to improve scalability.

Memory management is another key topic in the playbook, addressing challenges such as memory constraints and optimization techniques. Activation recomputation is introduced as a method to reduce memory consumption by recalculating intermediate activations when needed rather than storing them. Gradient accumulation is highlighted as a way to achieve larger effective batch sizes without exceeding memory limits, improving training stability and efficiency. These techniques are essential for training LLMs that exceed the memory capacity of individual GPUs.

The playbook also provides extensive benchmarking insights, demonstrating the importance of empirical testing in optimizing training configurations. By testing various setups to determine the best balance between batch size, model architecture, and the number of GPUs used. Effective benchmarking helps refine training speed, resource allocation, and computational efficiency, which are crucial for large-scale training.

Communication overhead between GPUs is another factor influencing training efficiency. The playbook discusses methods for reducing idle GPU time by overlapping communication with computation, such as using all-reduce operations during the backward pass. Strategies for optimizing network bandwidth and minimizing synchronization delays are also explored to improve overall training performance.

Posts about the playbook reflect a wave of excitement and appreciation for this open-source guide. Leandro von Werra, head of researcher at Hugging Face, who announced the playbook, shared:

Learn how to train your own DeepSeek-V3 model using 5D parallelism, ZeRO, fast kernels, compute/comm overlap and bottlenecks with theory, interactive plots and 4000+ scaling experiments and audio!

And AI developer Denis Redozubov posted:

There are some very cool bits like a widget calculating a memory breakdown of the transformer model

Finally, the playbook also touches on future directions in LLM training, anticipating advancements in hardware and software that will continue to shape the field. Research into optimizing communication, reducing memory overhead, and refining parallelism techniques is expected to drive further improvements in scalability and efficiency.

About the Author

Daniel Dominguez

Show moreShow less

This content is in the AI, ML & Data Engineering topic

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Hugging Face Publishes Guide on Efficient LLM Training across GPUs

Write for InfoQ

About the Author

Daniel Dominguez

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

Related Content

The InfoQ Newsletter