Skip to content

[pull] master from NVIDIA:master#80

Open
pull[bot] wants to merge 584 commits intojolorunyomi:masterfrom
NVIDIA:master
Open

[pull] master from NVIDIA:master#80
pull[bot] wants to merge 584 commits intojolorunyomi:masterfrom
NVIDIA:master

Conversation

@pull
Copy link

@pull pull bot commented Dec 4, 2021

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added the ⤵️ pull label Dec 4, 2021
ajdecon and others added 29 commits March 30, 2022 15:34
Add more tests to cover containerd/driver container k8s deployments
Update role nvidia.nvidia_driver to v2.1.0
Update Deep Learning Examples automation to handle containerd
Update jenkins matrix test to support driver container vs host driver
Add docker vs containerd to manual jenkins options
Hybrid clusters which run both Kubernetes and Slurm are not currently
well-supported or tested in DeepOps. NVIDIA's preferred solution for
this type of cluster is currently NVIDIA Bright Cluster Manager, which
provides a more robust solution to managing multiple workload managers
within the same cluster, and additionally supports UGE and PBS.

This PR makes the following changes:

- Clarify support in `README.md`
- Remove hybrid references in `docs/README.md`
- Completely remove `docs/deepops/dgx-pod.md`, which is also out of date
in several respects (such as specifying a Ceph storage system)

A good future project would be to rewrite the DGX POD doc based on an up
to date process with the latest DeepOps release.
dholt and others added 30 commits February 18, 2026 13:38
- Keep ansible==4.8.0 for lint job (ansible-lint 5.4.0 is incompatible
  with ansible-core 2.16); use Python 3.10 for compatibility
- Use molecule-plugins[docker] instead of molecule[docker] (driver
  moved to separate package in newer molecule versions)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
The packaging module is used for version comparisons but was not
installed until after those comparisons ran. This caused ImportError
when ansible was already installed in the venv. Install packaging
immediately after pip upgrade, before the version check block.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Remove deprecated apt_key tasks from nvidia_cuda and nvidia_dcgm
  (cuda-keyring .deb package supersedes old GPG key management)
- Replace action: keyword with proper module syntax in easy-build
- Replace inline key=value module args with YAML dict syntax
  in easy-build and kerberos_client
- Widen kerberos_client version checks for RHEL 8+ and Ubuntu 20+

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
Remove dead code paths for EOL platforms (CentOS 7 EOL Jun 2024,
Ubuntu 18.04 EOL Apr 2023). Changes:

- setup.sh: Remove DEPS_EL7, simplify RHEL package install
- slurm: Remove CentOS 7 yum tasks, widen RHEL 8 dnf conditions
- lmod: Remove CentOS 7 yum task and Ubuntu 18.04 posix_c bugfix
- nfs: Remove RHEL 7 libsemanage-python task
- kerberos_client: Consolidate to single RHEL and Ubuntu task/vars
- openshift: Remove python2-openshift CentOS 7 task
- ood-wrapper: Update singularity image from 18.04 to 22.04
- molecule configs: Remove 1804/centos-7, add ubuntu-2204 platforms
- config.example: Update NGC container tags to current versions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Ansible: 9.13.0 -> 10.7.0 (ansible-core 2.16 -> 2.17)
- ansible-lint: 5.4.0 -> 26.1.1 (now compatible with Ansible 10.x)
- kubespray: v2.27.0+88 -> v2.30.0 (latest stable)
- jmespath: 1.0.1 -> 1.1.0
- ansible.posix: 1.5.4 -> 2.1.0
- community.general: 7.2.0 -> 12.3.0
- community.docker: 3.10.2 -> 5.0.6
- nvidia.nvidia_driver: v2.3.0 -> v2.3.1
- dev-sec.ssh-hardening: 9.7.0 -> 10.5.0
- geerlingguy.ntp: 2.3.2 -> 4.0.0
- gantsign.golang: 3.1.6 -> 3.5.0

Also fixes:
- docker.yml: Update kubespray defaults path (main.yml -> main/main.yml)
- docker.yml, k8s-cluster.yml: Remove CentOS 7 docker repo overrides
- CI: Remove ansible-lint/ansible 4.8.0 version workaround

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
- ansible.cfg: Replace removed community.general.yaml callback with
  ansible.builtin.default + result_format=yaml
- requirements.yml: Migrate dev-sec.ssh-hardening role to devsec.hardening
  collection (standalone role repo stopped at 9.7.0, 10.x+ is collection-only)
- playbooks: Update include_role references from dev-sec.ssh-hardening to
  devsec.hardening.ssh_hardening (FQCN)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
kubespray v2.30.0 renamed kubespray-defaults to kubespray_defaults
(underscore) and removed the defaults/ dir from the old location.
Update vars_files path and role reference in docker.yml accordingly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
Modern Ubuntu (22.04+) enforces PEP 668 'externally-managed-environment'
which blocks system-wide pip installs. Replace pip: name=docker with
package: name=python3-docker across all roles that need the Docker
Python SDK. Also removes dead Python 2 code paths.

Affected roles: standalone-container-registry, docker-login, prometheus,
alertmanager, nginx-docker-registry-cache

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
The passlib module is required by Ansible's password_hash filter used
in the users playbook. Without it, password hashing fails with
'No module named passlib' on modern systems.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
Adds a Python script that queries the MAAS REST API to auto-discover
deployed machines and map MAAS tags to Ansible groups. Tag a VM with
slurm-master in MAAS and it appears in the [slurm-master] group.

- scripts/maas_inventory.py: inventory script (Python stdlib only)
- config.example/maas-inventory.yml: example configuration
- docs/pxe/maas.md: added Dynamic Inventory section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
… CodeQL alert

- Remove hardcoded StrictHostKeyChecking=no from SSH args (security)
- Return empty dict for --host since _meta provides all hostvars (perf)
- Remove URL from error message to avoid CodeQL sensitive data alert

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
etcd is a separate group in the static inventory, not a child of
kube-master. Users should tag etcd nodes explicitly rather than having
all kube-master nodes implicitly join etcd.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
…C entry

- Update pre-requisites from Ubuntu 18.04 to 22.04/24.04
- Note MAAS 2.8 is original version, current is 3.x
- Update maas_repo PPA from 2.8 to 3.5
- Add Dynamic Inventory section to table of contents

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
kubespray v2.30 requires underscored group names:
- kube-master -> kube_control_plane
- kube-node -> kube_node
- k8s-cluster -> k8s_cluster

Updated inventory templates, group_vars filename, group_vars content,
and all playbook references. Directory paths (playbooks/k8s-cluster/)
are unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
The 'native' snapshotter was a workaround for old cri-tools issues
(#436, #710) that are long resolved. It causes 'no unpack platforms
defined' errors with containerd v2.x. Switch to 'overlayfs' which
is kubespray's default and works correctly on ext4/xfs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Add project-level .ansible-lint with profile:min and skip_list for
  pre-existing issues (fqcn, name casing, truthy, octal, etc.)
- Rewrite lint script to run from project root using project config
- Remove per-role .ansible-lint files (conflicted with v26 syntax)
- Molecule: drop Ubuntu 20.04 platforms (EOL), keep 22.04 only
- Molecule: use cgroupns_mode:host, remove command:/sbin/init and
  tmpfs that caused systemd temp dir failures on cgroup v2 hosts
- Molecule: add privileged:true where missing, remove max-parallel
  limit, set fail-fast:false, upgrade runner to ubuntu-24.04
- Add ANSIBLE_ROLES_PATH and passlib to molecule workflow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
- spack: Replace gcc-7/gfortran-7 with unversioned gcc/gfortran
- Remove abims_sbr.singularity from requirements.yml (dead project)
- Molecule CI: Remove 5 roles that can't run in Docker containers:
  nis_client, rsyslog_client, rsyslog_server, slurm (need systemd
  services), singularity_wrapper (broken upstream Galaxy dep).
  These are all verified end-to-end on real MAAS VMs.
- Remaining 11 molecule roles all pass in CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
Modernize for Ansible 10.x, Ubuntu 24.04, kubespray v2.30
Add scripts/maas_deploy.sh with --os, --profile, --status, --release,
and --tags-only flags for repeatable VM deploy/tag/test/redeploy cycles.
Reads config from config/maas-inventory.yml (no hardcoded secrets).

Update maas_inventory.py to exit gracefully when MAAS is not configured
(returns empty inventory instead of error), enabling dual inventory in
ansible.cfg without breaking non-MAAS users.

Wire ansible.cfg for dual inventory (static + dynamic). Add machines
field to config.example/maas-inventory.yml. Add test-playbooks skill.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
Replace synchronize (rsync) with fetch module for copying kubectl from
the cluster — rsync spawns its own SSH that bypasses ansible_ssh_common_args,
failing when a bastion/proxy is required.

Add cross-platform support: detect when the ansible controller and cluster
have different OS/arch (e.g., macOS ARM vs Linux x86-64) and download the
correct kubectl binary from dl.k8s.io matching the cluster's K8s version.

Install kubectl to the active virtualenv ($VIRTUAL_ENV/bin/) instead of
/usr/local/bin, eliminating the need for sudo and keeping it scoped to
the DeepOps environment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
Follow-up to PR #1336: rename remaining kube-master references to
kube_control_plane and k8s-cluster to k8s_cluster in config.example
group_vars, example playbook, and helper scripts (debug.sh, deploy_rook.sh).

Also update ssh-hardening collection reference (dev-sec.ssh-hardening ->
devsec.hardening) in config.example/group_vars/all.yml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
Recognize placeholder values from config.example (angle brackets,
CONSUMER_KEY:TOKEN_KEY:TOKEN_SECRET) as unconfigured and return
empty inventory instead of attempting to connect and failing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
feat: MAAS deploy workflow and dynamic inventory integration
fix: kubectl binary copy works through bastion and cross-platform
fix: update config.example and scripts for kubespray v2.30 group names
- Set executable mode on fetched kubectl binary (fetch doesn't preserve)
- Add changed_when: false to kubectl version command for idempotency
- Add SHA256 checksum verification for cross-platform kubectl download
- Respect proxy_env for get_url in proxy environments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Move API key validation from maas_auth_header() to load_config() so
  exit works properly (exit in command substitution only kills subshell)
- Accept MAAS_SSH_BASTION env var (consistent with inventory script) and
  convert to ProxyCommand; MAAS_SSH_PROXY still works as direct override
- Quote ssh_bastion value in proxy command to handle spaces/special chars
- Use os.environ instead of shell interpolation for network_filter in
  get_ip() to prevent potential code injection
- Deduplicate hosts in inventory when machine has both old and aliased
  tags (e.g., both kube-master and kube_control_plane)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Holt <dholt@nvidia.com>
fix: MAAS deploy/inventory hardening from Copilot review
fix: kubectl download hardening from Copilot review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.