Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

DataStates/datastates-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

42 Commits

Repository files navigation

DataStates-LLM: Asynchronous I/O Engine


License arXiv

DataStates-LLM is a high-performance I/O engine for GPU-accelerated workloads, with a particular focus on large-scale DeepSpeed/Megatron training. It provides lazy, asynchronous checkpointing backed by CUDA and io_uring, allowing you to overlap checkpoint I/O with forward/backward passes and reduce checkpoint overheads at scale.

For a detailed description of the design, implementation, and evaluation against state-of-the-art checkpointing engines, see our HPDC’24 paper:

Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae.
"DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models".
HPDC’24: The 33rd International Symposium on High-Performance Parallel and Distributed Computing (Pisa, Italy, 2024).


1. Features Overview

  • Lazy, asynchronous checkpointing for large language model training
  • GPU-aware checkpoint engine (CUDA) with host/device tiers
  • io_uring-based file I/O path for low-overhead persistence
  • C++ core library with a thin Python binding (nanobind)
  • Integrates with DeepSpeed’s async checkpointing API
  • Designed for HPC/AI environments (multi-GPU, multi-node)

2. Installation and Tests

2.1. Prerequisites

  • Linux (tested on recent x86_64 distributions)
  • C++17 compiler (GCC 11+ recommended)
  • CUDA toolkit (matching your cluster environment)
  • liburing development headers
  • CMake ≥ 3.15
  • Python ≥ 3.8
  • (For Python bindings)
    • PyTorch
    • fasteners
    • nanobind (pulled in automatically by install.sh)

2.2. Clone and Build

git clone https://github.com/DataStates/datastates-llm.git
cd datastates-llm/
# Activate your target Python/conda environment first.
# By default, install.sh will:
# - detect the active Python env
# - install into its site-packages
# - build C++ core + Python bindings
./install.sh

By default, install.sh builds the C++ core library and the Python bindings and installs into the active environment's site-packages.

You can also control the install prefix and whether Python bindings are built:

# 1st arg: install prefix (optional)
# 2nd arg: build Python bindings? [on/off/yes/no/1/0]
# Example: install into a custom prefix WITH Python bindings
./install.sh /path/to/prefix on
# Example: install core library only, no Python bindings,
# into the active Python environment’s site-packages
./install.sh "" off

2.3. Tests

C++ core engine test

# Run the test binary directly
./build/tests/test_core_engine
# Or run through ctests
cd build/
ctest 

Python tests

python tests/python/test_base_state_provider.py # Without DeepSpeed
python tests/python/test_llm_ckpt_state_engine.py # With DeepSpeed

3. Using DataStates-LLM with DeepSpeed

DeepSpeed provides an official tutorial on enabling DataStates-based asynchronous checkpointing through a single JSON entry in the config file: Official DeepSpeed Tutorial

That tutorial covers:

  • Configuring DeepSpeed to use DataStates-LLM as the asynchronous checkpoint backend
  • Relevant DeepSpeed configuration options
  • Example training scripts integrating DataStates-LLM

4. Contributions and Issues

We welcome feedback, bug reports, and contributions.

  • File issues and feature requests via the GitHub Issue tracker.
  • Contributions in the form of bug fixes, portability improvements, and integration with additional frameworks are particularly appreciated.

About

LLM checkpointing for DeepSpeed/Megatron

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

Languages

AltStyle によって変換されたページ (->オリジナル) /