[infra] h200-nb_0 / h200-nb_1: enroot /mnt/image-storage full — can't unpack TRT-LLM v1.3.0rc14 image #1496

Open

Description

opened

on May 18, 2026

Summary

Two H200 self-hosted runners — h200-nb_0 and h200-nb_1 — are unable to unpack newer TensorRT-LLM container images because their /mnt/image-storage/enroot/data/ partition is full. Every sweep job that lands on either runner fails identically:

enroot-mount: failed to create directory:
 /mnt/image-storage/enroot/data/pyxis_nvcr.io_nvidia_tensorrt-llm_release_1.3.0rc14-gharunner/var/run:
 No space left on device

...and pyxis: failed to create container filesystem during the squashfs extraction step. The benchmark script never runs.

It's been temporarily worked around by removing the h200 SLURM partition tag from these two nodes — they currently can't pick up jobs at all — but they should be put back into service once the disk is freed.

How we got here

The recently-tagged nvcr.io/nvidia/tensorrt-llm/release:v1.3.0rc14 image is significantly larger than v1.1.0rc2.post2 (it bundles Python 3.12 + new CUDA libs). Old enroot caches haven't been pruned, so /mnt/image-storage filled up trying to extract the new image.

Confirmed on these PRs (where every failure landed on h200-nb_* and every success landed on h200-dgxc-slurm_* or h200-cw_01):

#1491 — gptoss-fp4-h200-trt v1.3.0rc11 → v1.3.0rc14 (11/12 failures = disk; 1 = stale port-8888 leak on h200-cw_01)
#1487 — dsr1-fp8-h200-trt (+mtp) v1.1.0rc2.post2 → v1.3.0rc14 (12/12 failures = disk)

Failed CI runs:

Suggested SRE fix

Any of:

Prune the enroot image cache on both nodes:

ssh root@h200-nb_0 # and h200-nb_1
enroot list # see what's still there
enroot remove --force '*' # nuclear option, drops everything
# OR: rm -rf /mnt/image-storage/enroot/data/<stale-pyxis-dirs>
df -h /mnt/image-storage

Expand /mnt/image-storage on these nodes — they're chronically near-full since the v1.1 → v1.3 image sizes diverged.
Add a periodic cron / systemd-timer that prunes pyxis directories older than ~7d (eg. find /mnt/image-storage/enroot/data -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +).

Once any of those is done, re-add the h200 SLURM partition tag and the affected PRs can be re-swept.

Bonus issue on `h200-cw_01`

A separate one-off failure on h200-cw_01 during the same #1491 sweep was an Address already in use on port 8888 — left over from a previous trtllm-serve that didn't shut down cleanly. Worth a kill pass on that node too.

cc @sre / @platform

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Fields

Give feedback

No fields configured for issues without a type.

Projects

InferenceMAX Board

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[infra] h200-nb_0 / h200-nb_1: enroot /mnt/image-storage full — can't unpack TRT-LLM v1.3.0rc14 image #1496

Description

Summary

How we got here

Suggested SRE fix

Bonus issue on `h200-cw_01`

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[infra] h200-nb_0 / h200-nb_1: enroot /mnt/image-storage full — can't unpack TRT-LLM v1.3.0rc14 image #1496

Description

Summary

How we got here

Suggested SRE fix

Bonus issue on h200-cw_01

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bonus issue on `h200-cw_01`