feat(bootstrap,cli): switch GPU injection to CDI where supported by elezar · Pull Request #495 · NVIDIA/OpenShell

elezar · 2026-03-20T07:41:28Z

Summary

Switch GPU device injection in cluster bootstrap to use CDI (Container Device Interface) when enabled in Docker (the docker info endpoint returns a non-empty list of CDI spec directories). When this is not the case existing --gpus all NVIDIA DeviceRequest path is used as a fallback. The --gpu flag on gateway start is extended to let users force the legacy injection mode.

Related Issue

Part of #398

Changes

feat(bootstrap): Auto-select CDI (driver="cdi", device_ids=["nvidia.com/gpu=all"]) if CDI is enabled on the daemon; fall back to legacy driver="nvidia" on older daemons or when CDI spec dirs are absent
feat(cli): --gpu now accepts an optional value: omit for auto-select, --gpu=legacy to force the legacy --gpus all path
test(e2e): Extend gateway start help smoke test to cover --gpu and --recreate flags

Testing

mise run pre-commit passes
Unit tests added/updated (resolve_gpu_device_ids coverage)
E2E tests added/updated

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

crates/openshell-bootstrap/src/docker.rs

github-actions · 2026-03-20T08:07:56Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/OpenShell/pr-preview/pr-495/
Built to branch `gh-pages` at 2026-03-27 11:20 UTC. Preview will be ready when the GitHub Pages deployment is complete.

architecture/gateway-single-node.md

Use an explicit CDI device request (driver="cdi", device_ids=["nvidia.com/gpu=all"]) when the Docker daemon reports CDI spec directories via GET /info (SystemInfo.CDISpecDirs). This makes device injection declarative and decouples spec generation from consumption. When the daemon reports no CDI spec directories, fall back to the legacy NVIDIA device request (driver="nvidia", count=-1) which relies on the NVIDIA Container Runtime hook. Failure modes for both paths are equivalent: a missing or stale NVIDIA Container Toolkit installation will cause container start to fail. CDI spec generation is out of scope for this change; specs are expected to be pre-generated out-of-band, for example by the NVIDIA Container Toolkit. Signed-off-by: Evan Lezar <elezar@nvidia.com>

The --gpu flag on `gateway start` now accepts an optional value: --gpu Auto-select: CDI on Docker >= 28.2.0, legacy otherwise --gpu=legacy Force the legacy nvidia DeviceRequest (driver="nvidia") Internally, the gpu bool parameter to ensure_container is replaced with a device_ids slice. resolve_gpu_device_ids resolves the "auto" sentinel to a concrete device ID list based on the Docker daemon version, keeping the resolution logic in one place at deploy time. Signed-off-by: Evan Lezar <elezar@nvidia.com>

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar requested a review from a team as a code owner March 20, 2026 07:41

elezar self-assigned this Mar 20, 2026

elezar marked this pull request as draft March 20, 2026 07:44

benhadad mentioned this pull request Mar 20, 2026

OpenShell gateway on WSL2/Docker Desktop installs but openshell-0 stays in ContainerCreating because TLS secrets are never created NVIDIA/NemoClaw#333

Open

klueska reviewed Mar 20, 2026

View reviewed changes

crates/openshell-bootstrap/src/docker.rs Outdated Show resolved Hide resolved

elezar force-pushed the feat/cdi-in-cluster branch from b1e6015 to dd2682c Compare March 20, 2026 08:06

elezar force-pushed the feat/cdi-in-cluster branch 2 times, most recently from 808270c to f304997 Compare March 20, 2026 09:56

elezar marked this pull request as ready for review March 20, 2026 10:04

elezar force-pushed the feat/cdi-in-cluster branch from f304997 to 501b0c1 Compare March 20, 2026 21:29

elezar requested a review from klueska March 23, 2026 12:15

elezar force-pushed the feat/cdi-in-cluster branch 2 times, most recently from abee8e5 to 7389f4b Compare March 25, 2026 13:42

This was referenced Mar 25, 2026

feat(gpu): add WSL2 CDI spec watcher for GPU passthrough #608

Open

fix(gpu): add Tegra/Jetson GPU support #625

Open

klueska reviewed Mar 26, 2026

View reviewed changes

architecture/gateway-single-node.md Outdated Show resolved Hide resolved

cheese-head mentioned this pull request Mar 26, 2026

feat(cli): add generic sandbox device request flags #628

Open

elezar added 3 commits March 27, 2026 12:10

test(e2e): extend gateway start help smoke test to cover key flags

c07c0f8

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the feat/cdi-in-cluster branch from 7389f4b to c07c0f8 Compare March 27, 2026 11:19

elezar requested a review from klueska March 27, 2026 11:25

johntmyers approved these changes Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bootstrap,cli): switch GPU injection to CDI where supported#495

feat(bootstrap,cli): switch GPU injection to CDI where supported#495
elezar wants to merge 3 commits intomainfrom
feat/cdi-in-cluster

elezar commented Mar 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Mar 20, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-27 11:20 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

elezar commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

Uh oh!

github-actions bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-03-27 11:20 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elezar commented Mar 20, 2026 •

edited

Loading

github-actions bot commented Mar 20, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-27 11:20 UTC.
Preview will be ready when the GitHub Pages deployment is complete.