Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

korovod/datastates

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

26 Commits

Repository files navigation

DataStates

Efficient asynchronous checkpointing engine.

For a detailed description about design principles, implementation, and performance evaluation against state-of-the-art checkpointing engines, please refer to the HPDC'24 paper.

Usage

Requirements

  • Python
  • pybind11
  • PyTorch

Installation

Using Spack:

git clone -c feature.manyFiles=true --depth=2 https://github.com/spack/spack.git
git clone https://github.com/korovod/korovod-spack-packages.git
cd spack/bin
./spack repo add korovod-spack-packages
./spack install py-datastates

Using pip:

git clone https://github.com/korovod/datastates.git
cd datastates
# Install the CPP/Python binding
pip install . -v

Using DataStates in your Python project

from datastates import CkptEngine

Tests

# Test with a simple PyTorch code, DeepSpeed not required.
python tests/test_ckpt_engine.py 
# Test with a simple DeepSpeed code.
python tests/test_datastates_llm.py 

Citation

Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae. "DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models". HPDC'24: The 33rd International Symposium on High-Performance Parallel and Distributed Computing (Pisa, Italy, 2024).

About

Efficient asynchronous checkpointing using CUDA copy engines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 55.2%
  • Python 44.8%

AltStyle によって変換されたページ (->オリジナル) /