-
Notifications
You must be signed in to change notification settings - Fork 203
[infra] h200-nb_0 / h200-nb_1: enroot /mnt/image-storage full — can't unpack TRT-LLM v1.3.0rc14 image #1496
Description
Summary
Two H200 self-hosted runners — h200-nb_0 and h200-nb_1 — are unable to unpack newer TensorRT-LLM container images because their /mnt/image-storage/enroot/data/ partition is full. Every sweep job that lands on either runner fails identically:
enroot-mount: failed to create directory:
/mnt/image-storage/enroot/data/pyxis_nvcr.io_nvidia_tensorrt-llm_release_1.3.0rc14-gharunner/var/run:
No space left on device
...and pyxis: failed to create container filesystem during the squashfs extraction step. The benchmark script never runs.
It's been temporarily worked around by removing the h200 SLURM partition tag from these two nodes — they currently can't pick up jobs at all — but they should be put back into service once the disk is freed.
How we got here
The recently-tagged nvcr.io/nvidia/tensorrt-llm/release:v1.3.0rc14 image is significantly larger than v1.1.0rc2.post2 (it bundles Python 3.12 + new CUDA libs). Old enroot caches haven't been pruned, so /mnt/image-storage filled up trying to extract the new image.
Confirmed on these PRs (where every failure landed on h200-nb_* and every success landed on h200-dgxc-slurm_* or h200-cw_01):
- #1491 —
gptoss-fp4-h200-trtv1.3.0rc11 → v1.3.0rc14 (11/12 failures = disk; 1 = stale port-8888 leak onh200-cw_01) - #1487 —
dsr1-fp8-h200-trt(+mtp) v1.1.0rc2.post2 → v1.3.0rc14 (12/12 failures = disk)
Failed CI runs:
- https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26016892349
- https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26016868638
Suggested SRE fix
Any of:
- Prune the enroot image cache on both nodes:
ssh root@h200-nb_0 # and h200-nb_1 enroot list # see what's still there enroot remove --force '*' # nuclear option, drops everything # OR: rm -rf /mnt/image-storage/enroot/data/<stale-pyxis-dirs> df -h /mnt/image-storage
- Expand
/mnt/image-storageon these nodes — they're chronically near-full since the v1.1 → v1.3 image sizes diverged. - Add a periodic cron / systemd-timer that prunes pyxis directories older than ~7d (eg.
find /mnt/image-storage/enroot/data -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +).
Once any of those is done, re-add the h200 SLURM partition tag and the affected PRs can be re-swept.
Bonus issue on h200-cw_01
A separate one-off failure on h200-cw_01 during the same #1491 sweep was an Address already in use on port 8888 — left over from a previous trtllm-serve that didn't shut down cleanly. Worth a kill pass on that node too.