feat(bootstrap,cli): switch GPU injection to CDI where supported#495
Open
feat(bootstrap,cli): switch GPU injection to CDI where supported#495
Conversation
klueska
reviewed
Mar 20, 2026
b1e6015 to
dd2682c
Compare
|
808270c to
f304997
Compare
f304997 to
501b0c1
Compare
abee8e5 to
7389f4b
Compare
This was referenced Mar 25, 2026
klueska
reviewed
Mar 26, 2026
Use an explicit CDI device request (driver="cdi", device_ids=["nvidia.com/gpu=all"]) when the Docker daemon reports CDI spec directories via GET /info (SystemInfo.CDISpecDirs). This makes device injection declarative and decouples spec generation from consumption. When the daemon reports no CDI spec directories, fall back to the legacy NVIDIA device request (driver="nvidia", count=-1) which relies on the NVIDIA Container Runtime hook. Failure modes for both paths are equivalent: a missing or stale NVIDIA Container Toolkit installation will cause container start to fail. CDI spec generation is out of scope for this change; specs are expected to be pre-generated out-of-band, for example by the NVIDIA Container Toolkit. Signed-off-by: Evan Lezar <elezar@nvidia.com>
The --gpu flag on `gateway start` now accepts an optional value: --gpu Auto-select: CDI on Docker >= 28.2.0, legacy otherwise --gpu=legacy Force the legacy nvidia DeviceRequest (driver="nvidia") Internally, the gpu bool parameter to ensure_container is replaced with a device_ids slice. resolve_gpu_device_ids resolves the "auto" sentinel to a concrete device ID list based on the Docker daemon version, keeping the resolution logic in one place at deploy time. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
7389f4b to
c07c0f8
Compare
johntmyers
approved these changes
Mar 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Switch GPU device injection in cluster bootstrap to use CDI (Container Device Interface) when enabled in Docker (the
docker infoendpoint returns a non-empty list of CDI spec directories). When this is not the case existing--gpus allNVIDIADeviceRequestpath is used as a fallback. The--gpuflag ongateway startis extended to let users force the legacy injection mode.Related Issue
Part of #398
Changes
feat(bootstrap): Auto-select CDI (driver="cdi",device_ids=["nvidia.com/gpu=all"]) if CDI is enabled on the daemon; fall back to legacydriver="nvidia"on older daemons or when CDI spec dirs are absentfeat(cli):--gpunow accepts an optional value: omit for auto-select,--gpu=legacyto force the legacy--gpus allpathtest(e2e): Extend gateway start help smoke test to cover--gpuand--recreateflagsTesting
mise run pre-commitpassesresolve_gpu_device_idscoverage)Checklist