Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Project-HAMi/HAMi-core

Repository files navigation

HAMi-core —— Hook library for CUDA Environments

English | 中文

Introduction

HAMi-core is the in-container gpu resource controller, it has beed adopted by HAMi, volcano

Features

HAMi-core has the following features:

  1. Virtualize device meory
  2. Limit device utilization by self-implemented time shard
  3. Real-time device utilization monitor

image

Design

HAMi-core operates by Hijacking the API-call between CUDA-Runtime(libcudart.so) and CUDA-Driver(libcuda.so), as the figure below:

Build in Docker

make build-in-docker

Usage

CUDA_DEVICE_MEMORY_LIMIT indicates the upper limit of device memory (eg 1g,1024m,1048576k,1073741824)

CUDA_DEVICE_SM_LIMIT indicates the sm utility percentage of each device

# Add 1GiB memory limit and set max SM utility to 50% for all devices
export LD_PRELOAD=./libvgpu.so
export CUDA_DEVICE_MEMORY_LIMIT=1g
export CUDA_DEVICE_SM_LIMIT=50

If you run CUDA applications locally, please create the local directory first.

mkdir /tmp/vgpulock/
If you have updated `CUDA_DEVICE_MEMORY_LIMIT` or `CUDA_DEVICE_SM_LIMIT`, please delete the local cache file.

rm /tmp/cudevshr.cache


## Docker Images
```bash
# Build docker image
docker build . -f=dockerfiles/Dockerfile -t cuda_vmem:tf1.8-cu90
# Configure GPU device and library mounts for container
export DEVICE_MOUNTS="--device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidiactl:/dev/nvidiactl"
export LIBRARY_MOUNTS="-v /usr/cuda_files:/usr/cuda_files -v $(which nvidia-smi):/bin/nvidia-smi"
# Run container and check nvidia-smi output
docker run ${LIBRARY_MOUNTS} ${DEVICE_MOUNTS} -it \
 -e CUDA_DEVICE_MEMORY_LIMIT=2g \
 -e LD_PRELOAD=/libvgpu/build/libvgpu.so \
 cuda_vmem:tf1.8-cu90 \
 nvidia-smi

After running, you will see nvidia-smi output similar to the following, showing memory limited to 2GiB:

...
[HAMI-core Msg(1:140235494377280:libvgpu.c:836)]: Initializing.....
Mon Dec 2 04:38:12 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:03:00.0 Off | N/A |
| 30% 36C P8 7W / 170W | 0MiB / 2048MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
[HAMI-core Msg(1:140235494377280:multiprocess_memory_limit.c:497)]: Calling exit handler 1

Log

Use environment variable LIBCUDA_LOG_LEVEL to set the visibility of logs

LIBCUDA_LOG_LEVEL description
0 errors only
1(default),2 errors,warnings,messages
3 infos,errors,warnings,messages
4 debugs,errors,warnings,messages

Test Raw APIs

./test/test_alloc

AltStyle によって変換されたページ (->オリジナル) /