recipes/distributed_async_checkpoint_recipe

Asynchronous Saving with Distributed Checkpoint (DCP)#

Created On: Jul 22, 2024 | Last Updated: Sep 29, 2025 | Last Verified: Nov 05, 2024

Author: Lucas Pasqualin, Iris Zhang, Rodrigo Kumpera, Chien-Chin Huang

Checkpointing is often a bottle-neck in the critical path for distributed training workloads, incurring larger and larger costs as both model and world sizes grow. One excellent strategy for offsetting this cost is to checkpoint in parallel, asynchronously. Below, we expand the save example from the Getting Started with Distributed Checkpoint Tutorial to show how this can be integrated quite easily with torch.distributed.checkpoint.async_save.

What you will learn

How to use DCP to generate checkpoints in parallel
Effective strategies to optimize performance

Prerequisites

PyTorch v2.4.0 or later
Getting Started with Distributed Checkpoint Tutorial

Asynchronous Checkpointing Overview#

Before getting started with Asynchronous Checkpointing, it’s important to understand it’s differences and limitations as compared to synchronous checkpointing. Specifically:

Memory requirements - Asynchronous checkpointing works by first copying models into internal CPU-buffers.
This is helpful since it ensures model and optimizer weights are not changing while the model is still checkpointing, but does raise CPU memory by a factor of checkpoint_size_per_rank X number_of_ranks. Additionally, users should take care to understand the memory constraints of their systems. Specifically, pinned memory implies the usage of page-lock memory, which can be scarce as compared to pageable memory.
Checkpoint Management - Since checkpointing is asynchronous, it is up to the user to manage concurrently run checkpoints. In general, users can
employ their own management strategies by handling the future object returned form async_save. For most users, we recommend limiting checkpoints to one asynchronous request at a time, avoiding additional memory pressure per request.

importos
importtorch
importtorch.distributedasdist
importtorch.distributed.checkpointasdcp
importtorch.multiprocessingasmp
importtorch.nnasnn
fromtorch.distributed.fsdpimport fully_shard
fromtorch.distributed.checkpoint.state_dictimport get_state_dict, set_state_dict
fromtorch.distributed.checkpoint.statefulimport Stateful
CHECKPOINT_DIR = "checkpoint"
classAppState(Stateful):
"""This is a useful wrapper for checkpointing the Application State. Since this object is compliant
 with the Stateful protocol, DCP will automatically call state_dict/load_stat_dict as needed in the
 dcp.save/load APIs.
 Note: We take advantage of this wrapper to hande calling distributed state dict methods on the model
 and optimizer.
 """
 def__init__(self, model, optimizer=None):
 self.model = model
 self.optimizer = optimizer
 defstate_dict(self):
 # this line automatically manages FSDP FQN's, as well as sets the default state dict type to FSDP.SHARDED_STATE_DICT
 model_state_dict, optimizer_state_dict = get_state_dict(self.model, self.optimizer)
 return {
 "model": model_state_dict,
 "optim": optimizer_state_dict
 }
 defload_state_dict(self, state_dict):
 # sets our state dicts on the model and optimizer, now that we've loaded
 set_state_dict(
 self.model,
 self.optimizer,
 model_state_dict=state_dict["model"],
 optim_state_dict=state_dict["optim"]
 )
classToyModel(nn.Module):
 def__init__(self):
 super(ToyModel, self).__init__()
 self.net1 = nn.Linear(16, 16)
 self.relu = nn.ReLU()
 self.net2 = nn.Linear(16, 8)
 defforward(self, x):
 return self.net2(self.relu(self.net1(x)))
defsetup(rank, world_size):
 os.environ["MASTER_ADDR"] = "localhost"
 os.environ["MASTER_PORT"] = "12355 "
 # initialize the process group
 dist.init_process_group("gloo", rank=rank, world_size=world_size)
 torch.cuda.set_device(rank)
defcleanup():
 dist.destroy_process_group()
defrun_fsdp_checkpoint_save_example(rank, world_size):
 print(f"Running basic FSDP checkpoint saving example on rank {rank}.")
 setup(rank, world_size)
 # create a model and move it to GPU with id rank
 model = ToyModel().to(rank)
 model = fully_shard(model)
 loss_fn = nn.MSELoss()
 optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
 checkpoint_future = None
 for step in range(10):
 optimizer.zero_grad()
 model(torch.rand(8, 16, device="cuda")).sum().backward()
 optimizer.step()
 # waits for checkpointing to finish if one exists, avoiding queuing more then one checkpoint request at a time
 if checkpoint_future is not None:
 checkpoint_future.result()
 state_dict = { "app": AppState(model, optimizer) }
 checkpoint_future = dcp.async_save(state_dict, checkpoint_id=f"{CHECKPOINT_DIR}_step{step}")
 cleanup()
if __name__ == "__main__":
 world_size = torch.cuda.device_count()
 print(f"Running async checkpoint example on {world_size} devices.")
 mp.spawn(
 run_fsdp_checkpoint_save_example,
 args=(world_size,),
 nprocs=world_size,
 join=True,
 )

Even more performance with Pinned Memory#

If the above optimization is still not performant enough, you can take advantage of an additional optimization for GPU models which utilizes a pinned memory buffer for checkpoint staging. Specifically, this optimization attacks the main overhead of asynchronous checkpointing, which is the in-memory copying to checkpointing buffers. By maintaining a pinned memory buffer between checkpoint requests users can take advantage of direct memory access to speed up this copy.

Note

The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without the pinned memory optimization (as demonstrated above), any checkpointing buffers are released as soon as checkpointing is finished. With the pinned memory implementation, this buffer is maintained between steps, leading to the same peak memory pressure being sustained through the application life.

importos
importtorch
importtorch.distributedasdist
importtorch.distributed.checkpointasdcp
importtorch.multiprocessingasmp
importtorch.nnasnn
fromtorch.distributed.fsdpimport fully_shard
fromtorch.distributed.checkpoint.state_dictimport get_state_dict, set_state_dict
fromtorch.distributed.checkpoint.statefulimport Stateful
fromtorch.distributed.checkpointimport FileSystemWriter as StorageWriter
CHECKPOINT_DIR = "checkpoint"
classAppState(Stateful):
"""This is a useful wrapper for checkpointing the Application State. Since this object is compliant
 with the Stateful protocol, DCP will automatically call state_dict/load_stat_dict as needed in the
 dcp.save/load APIs.
 Note: We take advantage of this wrapper to hande calling distributed state dict methods on the model
 and optimizer.
 """
 def__init__(self, model, optimizer=None):
 self.model = model
 self.optimizer = optimizer
 defstate_dict(self):
 # this line automatically manages FSDP FQN's, as well as sets the default state dict type to FSDP.SHARDED_STATE_DICT
 model_state_dict, optimizer_state_dict = get_state_dict(self.model, self.optimizer)
 return {
 "model": model_state_dict,
 "optim": optimizer_state_dict
 }
 defload_state_dict(self, state_dict):
 # sets our state dicts on the model and optimizer, now that we've loaded
 set_state_dict(
 self.model,
 self.optimizer,
 model_state_dict=state_dict["model"],
 optim_state_dict=state_dict["optim"]
 )
classToyModel(nn.Module):
 def__init__(self):
 super(ToyModel, self).__init__()
 self.net1 = nn.Linear(16, 16)
 self.relu = nn.ReLU()
 self.net2 = nn.Linear(16, 8)
 defforward(self, x):
 return self.net2(self.relu(self.net1(x)))
defsetup(rank, world_size):
 os.environ["MASTER_ADDR"] = "localhost"
 os.environ["MASTER_PORT"] = "12355 "
 # initialize the process group
 dist.init_process_group("gloo", rank=rank, world_size=world_size)
 torch.cuda.set_device(rank)
defcleanup():
 dist.destroy_process_group()
defrun_fsdp_checkpoint_save_example(rank, world_size):
 print(f"Running basic FSDP checkpoint saving example on rank {rank}.")
 setup(rank, world_size)
 # create a model and move it to GPU with id rank
 model = ToyModel().to(rank)
 model = fully_shard(model)
 loss_fn = nn.MSELoss()
 optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
 # The storage writer defines our 'staging' strategy, where staging is considered the process of copying
 # checkpoints to in-memory buffers. By setting `cached_state_dict=True`, we enable efficient memory copying
 # into a persistent buffer with pinned memory enabled.
 # Note: It's important that the writer persists in between checkpointing requests, since it maintains the
 # pinned memory buffer.
 writer = StorageWriter(cache_staged_state_dict=True, path=CHECKPOINT_DIR)
 checkpoint_future = None
 for step in range(10):
 optimizer.zero_grad()
 model(torch.rand(8, 16, device="cuda")).sum().backward()
 optimizer.step()
 state_dict = { "app": AppState(model, optimizer) }
 if checkpoint_future is not None:
 # waits for checkpointing to finish, avoiding queuing more then one checkpoint request at a time
 checkpoint_future.result()
 checkpoint_future = dcp.async_save(state_dict, storage_writer=writer, checkpoint_id=f"{CHECKPOINT_DIR}_step{step}")
 cleanup()
if __name__ == "__main__":
 world_size = torch.cuda.device_count()
 print(f"Running fsdp checkpoint example on {world_size} devices.")
 mp.spawn(
 run_fsdp_checkpoint_save_example,
 args=(world_size,),
 nprocs=world_size,
 join=True,
 )

Conclusion#

In conclusion, we have learned how to use DCP’s async_save() API to generate checkpoints off the critical training path. We’ve also learned about the additional memory and concurrency overhead introduced by using this API, as well as additional optimizations which utilize pinned memory to speed things up even further.

Asynchronous Saving with Distributed Checkpoint (DCP)#

Asynchronous Checkpointing Overview#

Even more performance with Pinned Memory#

Conclusion#

Docs

Tutorials

Resources