[pull] master from NVIDIA:master by pull[bot] · Pull Request #80 · jolorunyomi/deepops

pull · 2021-12-04T02:03:06Z

See Commits and Changes for more details.

Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

Add more tests to cover containerd/driver container k8s deployments

Update role nvidia.nvidia_driver to v2.1.0

Update Deep Learning Examples automation to handle containerd

Update jenkins matrix test to support driver container vs host driver

Add docker vs containerd to manual jenkins options

Hybrid clusters which run both Kubernetes and Slurm are not currently well-supported or tested in DeepOps. NVIDIA's preferred solution for this type of cluster is currently NVIDIA Bright Cluster Manager, which provides a more robust solution to managing multiple workload managers within the same cluster, and additionally supports UGE and PBS. This PR makes the following changes: - Clarify support in `README.md` - Remove hybrid references in `docs/README.md` - Completely remove `docs/deepops/dgx-pod.md`, which is also out of date in several respects (such as specifying a Ceph storage system) A good future project would be to rewrite the DGX POD doc based on an up to date process with the latest DeepOps release.

Removed a debug variable.

- Keep ansible==4.8.0 for lint job (ansible-lint 5.4.0 is incompatible with ansible-core 2.16); use Python 3.10 for compatibility - Use molecule-plugins[docker] instead of molecule[docker] (driver moved to separate package in newer molecule versions) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

The packaging module is used for version comparisons but was not installed until after those comparisons ran. This caused ImportError when ansible was already installed in the venv. Install packaging immediately after pip upgrade, before the version check block. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

- Remove deprecated apt_key tasks from nvidia_cuda and nvidia_dcgm (cuda-keyring .deb package supersedes old GPG key management) - Replace action: keyword with proper module syntax in easy-build - Replace inline key=value module args with YAML dict syntax in easy-build and kerberos_client - Widen kerberos_client version checks for RHEL 8+ and Ubuntu 20+ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

Remove dead code paths for EOL platforms (CentOS 7 EOL Jun 2024, Ubuntu 18.04 EOL Apr 2023). Changes: - setup.sh: Remove DEPS_EL7, simplify RHEL package install - slurm: Remove CentOS 7 yum tasks, widen RHEL 8 dnf conditions - lmod: Remove CentOS 7 yum task and Ubuntu 18.04 posix_c bugfix - nfs: Remove RHEL 7 libsemanage-python task - kerberos_client: Consolidate to single RHEL and Ubuntu task/vars - openshift: Remove python2-openshift CentOS 7 task - ood-wrapper: Update singularity image from 18.04 to 22.04 - molecule configs: Remove 1804/centos-7, add ubuntu-2204 platforms - config.example: Update NGC container tags to current versions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

- Ansible: 9.13.0 -> 10.7.0 (ansible-core 2.16 -> 2.17) - ansible-lint: 5.4.0 -> 26.1.1 (now compatible with Ansible 10.x) - kubespray: v2.27.0+88 -> v2.30.0 (latest stable) - jmespath: 1.0.1 -> 1.1.0 - ansible.posix: 1.5.4 -> 2.1.0 - community.general: 7.2.0 -> 12.3.0 - community.docker: 3.10.2 -> 5.0.6 - nvidia.nvidia_driver: v2.3.0 -> v2.3.1 - dev-sec.ssh-hardening: 9.7.0 -> 10.5.0 - geerlingguy.ntp: 2.3.2 -> 4.0.0 - gantsign.golang: 3.1.6 -> 3.5.0 Also fixes: - docker.yml: Update kubespray defaults path (main.yml -> main/main.yml) - docker.yml, k8s-cluster.yml: Remove CentOS 7 docker repo overrides - CI: Remove ansible-lint/ansible 4.8.0 version workaround Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

- ansible.cfg: Replace removed community.general.yaml callback with ansible.builtin.default + result_format=yaml - requirements.yml: Migrate dev-sec.ssh-hardening role to devsec.hardening collection (standalone role repo stopped at 9.7.0, 10.x+ is collection-only) - playbooks: Update include_role references from dev-sec.ssh-hardening to devsec.hardening.ssh_hardening (FQCN) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

kubespray v2.30.0 renamed kubespray-defaults to kubespray_defaults (underscore) and removed the defaults/ dir from the old location. Update vars_files path and role reference in docker.yml accordingly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

Modern Ubuntu (22.04+) enforces PEP 668 'externally-managed-environment' which blocks system-wide pip installs. Replace pip: name=docker with package: name=python3-docker across all roles that need the Docker Python SDK. Also removes dead Python 2 code paths. Affected roles: standalone-container-registry, docker-login, prometheus, alertmanager, nginx-docker-registry-cache Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

The passlib module is required by Ansible's password_hash filter used in the users playbook. Without it, password hashing fails with 'No module named passlib' on modern systems. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

Adds a Python script that queries the MAAS REST API to auto-discover deployed machines and map MAAS tags to Ansible groups. Tag a VM with slurm-master in MAAS and it appears in the [slurm-master] group. - scripts/maas_inventory.py: inventory script (Python stdlib only) - config.example/maas-inventory.yml: example configuration - docs/pxe/maas.md: added Dynamic Inventory section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

… CodeQL alert - Remove hardcoded StrictHostKeyChecking=no from SSH args (security) - Return empty dict for --host since _meta provides all hostvars (perf) - Remove URL from error message to avoid CodeQL sensitive data alert Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

etcd is a separate group in the static inventory, not a child of kube-master. Users should tag etcd nodes explicitly rather than having all kube-master nodes implicitly join etcd. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

…C entry - Update pre-requisites from Ubuntu 18.04 to 22.04/24.04 - Note MAAS 2.8 is original version, current is 3.x - Update maas_repo PPA from 2.8 to 3.5 - Add Dynamic Inventory section to table of contents Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

kubespray v2.30 requires underscored group names: - kube-master -> kube_control_plane - kube-node -> kube_node - k8s-cluster -> k8s_cluster Updated inventory templates, group_vars filename, group_vars content, and all playbook references. Directory paths (playbooks/k8s-cluster/) are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

The 'native' snapshotter was a workaround for old cri-tools issues (#436, #710) that are long resolved. It causes 'no unpack platforms defined' errors with containerd v2.x. Switch to 'overlayfs' which is kubespray's default and works correctly on ext4/xfs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

- Add project-level .ansible-lint with profile:min and skip_list for pre-existing issues (fqcn, name casing, truthy, octal, etc.) - Rewrite lint script to run from project root using project config - Remove per-role .ansible-lint files (conflicted with v26 syntax) - Molecule: drop Ubuntu 20.04 platforms (EOL), keep 22.04 only - Molecule: use cgroupns_mode:host, remove command:/sbin/init and tmpfs that caused systemd temp dir failures on cgroup v2 hosts - Molecule: add privileged:true where missing, remove max-parallel limit, set fail-fast:false, upgrade runner to ubuntu-24.04 - Add ANSIBLE_ROLES_PATH and passlib to molecule workflow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

- spack: Replace gcc-7/gfortran-7 with unversioned gcc/gfortran - Remove abims_sbr.singularity from requirements.yml (dead project) - Molecule CI: Remove 5 roles that can't run in Docker containers: nis_client, rsyslog_client, rsyslog_server, slurm (need systemd services), singularity_wrapper (broken upstream Galaxy dep). These are all verified end-to-end on real MAAS VMs. - Remaining 11 molecule roles all pass in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

Modernize for Ansible 10.x, Ubuntu 24.04, kubespray v2.30

Add MAAS dynamic inventory script

Add scripts/maas_deploy.sh with --os, --profile, --status, --release, and --tags-only flags for repeatable VM deploy/tag/test/redeploy cycles. Reads config from config/maas-inventory.yml (no hardcoded secrets). Update maas_inventory.py to exit gracefully when MAAS is not configured (returns empty inventory instead of error), enabling dual inventory in ansible.cfg without breaking non-MAAS users. Wire ansible.cfg for dual inventory (static + dynamic). Add machines field to config.example/maas-inventory.yml. Add test-playbooks skill. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

Replace synchronize (rsync) with fetch module for copying kubectl from the cluster — rsync spawns its own SSH that bypasses ansible_ssh_common_args, failing when a bastion/proxy is required. Add cross-platform support: detect when the ansible controller and cluster have different OS/arch (e.g., macOS ARM vs Linux x86-64) and download the correct kubectl binary from dl.k8s.io matching the cluster's K8s version. Install kubectl to the active virtualenv ($VIRTUAL_ENV/bin/) instead of /usr/local/bin, eliminating the need for sudo and keeping it scoped to the DeepOps environment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

Follow-up to PR #1336: rename remaining kube-master references to kube_control_plane and k8s-cluster to k8s_cluster in config.example group_vars, example playbook, and helper scripts (debug.sh, deploy_rook.sh). Also update ssh-hardening collection reference (dev-sec.ssh-hardening -> devsec.hardening) in config.example/group_vars/all.yml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

Recognize placeholder values from config.example (angle brackets, CONSUMER_KEY:TOKEN_KEY:TOKEN_SECRET) as unconfigured and return empty inventory instead of attempting to connect and failing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

feat: MAAS deploy workflow and dynamic inventory integration

fix: kubectl binary copy works through bastion and cross-platform

fix: update config.example and scripts for kubespray v2.30 group names

- Set executable mode on fetched kubectl binary (fetch doesn't preserve) - Add changed_when: false to kubectl version command for idempotency - Add SHA256 checksum verification for cross-platform kubectl download - Respect proxy_env for get_url in proxy environments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

- Move API key validation from maas_auth_header() to load_config() so exit works properly (exit in command substitution only kills subshell) - Accept MAAS_SSH_BASTION env var (consistent with inventory script) and convert to ProxyCommand; MAAS_SSH_PROXY still works as direct override - Quote ssh_bastion value in proxy command to handle spaces/special chars - Use os.environ instead of shell interpolation for network_filter in get_ip() to prevent potential code injection - Deduplicate hosts in inventory when machine has both old and aliased tags (e.g., both kube-master and kube_control_plane) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>

fix: MAAS deploy/inventory hardening from Copilot review

fix: kubectl download hardening from Copilot review

pull bot added the ⤵️ pull label Dec 4, 2021

ajdecon and others added 29 commits March 30, 2022 15:34

Add Jenkins tests for DLE deployment

4f30949

make the DLE test blocking

fa6f437

fix jenkins test for dle - source jenkins vars

4b669a4

Merge pull request #1139 from supertetelman/more-tests

704a097

Add more tests to cover containerd/driver container k8s deployments

Merge branch 'master' into dle-examples-kaniko

48d21db

Bump GPU Operator 1.9.1 -> 1.10.0

58e6719

Bump GPU Feature Discovery 1.4.1->1.5.0

ca97801

Bump GPU Device Plugin 0.10.0->0.11.0

9cfde15

Align DeepOps docs with EGX -> Cloud Native Core

8924a42

fix misspelling

5125245

remove DLE test from PR Jenkins test

380bd1a

Merge pull request #1143 from ajdecon/update-nv-driver-release-v2.10

8190461

Update role nvidia.nvidia_driver to v2.1.0

Merge pull request #1145 from ajdecon/dle-examples-kaniko

2e6a2f6

Update Deep Learning Examples automation to handle containerd

Update jenkins matrix test to support driver container vs host driver

9d1290c

Merge pull request #1150 from supertetelman/jenkins-matrix

d15a097

Update jenkins matrix test to support driver container vs host driver

Add docker vs containerd to manual jenkins options

de727c2

Merge pull request #1151 from supertetelman/jenkins-matrix

5e70871

Add docker vs containerd to manual jenkins options

Adding Nvidia network operator for DGX

837edc0

Moved software version number to variable files.

ab16747

Removed a debug variable.

Adding network operator README file.

b7191c2

Add more details on IB configure, removed a typo.

616c195

Fixed typos, minior editing

1dbe222

Added a section to NVIDIA network operator

fe21aa4

Use variables for URLs

fe69578

Add comment on why need openshift module

bf99f7e

Comment on the example of slurm-val run via ansible-playbook.

af08cba

Fix linting errors in network operator role

15c4c2b

Set molecule github action to use ansible 2.9.27 to match deepops

50f34dc

dholt and others added 30 commits February 18, 2026 13:38

Merge pull request #1336 from dholt/fix/setup-distutils-to-packaging

f2ffb8b

Modernize for Ansible 10.x, Ubuntu 24.04, kubespray v2.30

Merge pull request #1337 from dholt/feature/maas-dynamic-inventory

d462d56

Add MAAS dynamic inventory script

Merge pull request #1338 from dholt/feature/maas-deploy-workflow-v2

0618771

feat: MAAS deploy workflow and dynamic inventory integration

Merge pull request #1339 from dholt/fix/kubectl-cross-platform

d15df99

fix: kubectl binary copy works through bastion and cross-platform

Merge pull request #1340 from dholt/fix/config-group-renames

97051de

fix: update config.example and scripts for kubespray v2.30 group names

Merge pull request #1341 from dholt/fix/maas-copilot-feedback

91ed38e

fix: MAAS deploy/inventory hardening from Copilot review

Merge pull request #1342 from dholt/fix/kubectl-copilot-feedback

428ca13

fix: kubectl download hardening from Copilot review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from NVIDIA:master#80

[pull] master from NVIDIA:master#80
pull[bot] wants to merge 584 commits intojolorunyomi:masterfrom
NVIDIA:master

pull bot commented Dec 4, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pull bot commented Dec 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pull bot commented Dec 4, 2021 •

edited

Loading