[pull] master from NVIDIA:master#80
Open
pull[bot] wants to merge 584 commits intojolorunyomi:masterfrom
Open
Conversation
Add more tests to cover containerd/driver container k8s deployments
Update role nvidia.nvidia_driver to v2.1.0
Update Deep Learning Examples automation to handle containerd
Update jenkins matrix test to support driver container vs host driver
Add docker vs containerd to manual jenkins options
Hybrid clusters which run both Kubernetes and Slurm are not currently well-supported or tested in DeepOps. NVIDIA's preferred solution for this type of cluster is currently NVIDIA Bright Cluster Manager, which provides a more robust solution to managing multiple workload managers within the same cluster, and additionally supports UGE and PBS. This PR makes the following changes: - Clarify support in `README.md` - Remove hybrid references in `docs/README.md` - Completely remove `docs/deepops/dgx-pod.md`, which is also out of date in several respects (such as specifying a Ceph storage system) A good future project would be to rewrite the DGX POD doc based on an up to date process with the latest DeepOps release.
Removed a debug variable.
- Keep ansible==4.8.0 for lint job (ansible-lint 5.4.0 is incompatible with ansible-core 2.16); use Python 3.10 for compatibility - Use molecule-plugins[docker] instead of molecule[docker] (driver moved to separate package in newer molecule versions) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
The packaging module is used for version comparisons but was not installed until after those comparisons ran. This caused ImportError when ansible was already installed in the venv. Install packaging immediately after pip upgrade, before the version check block. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Remove deprecated apt_key tasks from nvidia_cuda and nvidia_dcgm (cuda-keyring .deb package supersedes old GPG key management) - Replace action: keyword with proper module syntax in easy-build - Replace inline key=value module args with YAML dict syntax in easy-build and kerberos_client - Widen kerberos_client version checks for RHEL 8+ and Ubuntu 20+ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
Remove dead code paths for EOL platforms (CentOS 7 EOL Jun 2024, Ubuntu 18.04 EOL Apr 2023). Changes: - setup.sh: Remove DEPS_EL7, simplify RHEL package install - slurm: Remove CentOS 7 yum tasks, widen RHEL 8 dnf conditions - lmod: Remove CentOS 7 yum task and Ubuntu 18.04 posix_c bugfix - nfs: Remove RHEL 7 libsemanage-python task - kerberos_client: Consolidate to single RHEL and Ubuntu task/vars - openshift: Remove python2-openshift CentOS 7 task - ood-wrapper: Update singularity image from 18.04 to 22.04 - molecule configs: Remove 1804/centos-7, add ubuntu-2204 platforms - config.example: Update NGC container tags to current versions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Ansible: 9.13.0 -> 10.7.0 (ansible-core 2.16 -> 2.17) - ansible-lint: 5.4.0 -> 26.1.1 (now compatible with Ansible 10.x) - kubespray: v2.27.0+88 -> v2.30.0 (latest stable) - jmespath: 1.0.1 -> 1.1.0 - ansible.posix: 1.5.4 -> 2.1.0 - community.general: 7.2.0 -> 12.3.0 - community.docker: 3.10.2 -> 5.0.6 - nvidia.nvidia_driver: v2.3.0 -> v2.3.1 - dev-sec.ssh-hardening: 9.7.0 -> 10.5.0 - geerlingguy.ntp: 2.3.2 -> 4.0.0 - gantsign.golang: 3.1.6 -> 3.5.0 Also fixes: - docker.yml: Update kubespray defaults path (main.yml -> main/main.yml) - docker.yml, k8s-cluster.yml: Remove CentOS 7 docker repo overrides - CI: Remove ansible-lint/ansible 4.8.0 version workaround Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
- ansible.cfg: Replace removed community.general.yaml callback with ansible.builtin.default + result_format=yaml - requirements.yml: Migrate dev-sec.ssh-hardening role to devsec.hardening collection (standalone role repo stopped at 9.7.0, 10.x+ is collection-only) - playbooks: Update include_role references from dev-sec.ssh-hardening to devsec.hardening.ssh_hardening (FQCN) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
kubespray v2.30.0 renamed kubespray-defaults to kubespray_defaults (underscore) and removed the defaults/ dir from the old location. Update vars_files path and role reference in docker.yml accordingly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
Modern Ubuntu (22.04+) enforces PEP 668 'externally-managed-environment' which blocks system-wide pip installs. Replace pip: name=docker with package: name=python3-docker across all roles that need the Docker Python SDK. Also removes dead Python 2 code paths. Affected roles: standalone-container-registry, docker-login, prometheus, alertmanager, nginx-docker-registry-cache Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
The passlib module is required by Ansible's password_hash filter used in the users playbook. Without it, password hashing fails with 'No module named passlib' on modern systems. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
Adds a Python script that queries the MAAS REST API to auto-discover deployed machines and map MAAS tags to Ansible groups. Tag a VM with slurm-master in MAAS and it appears in the [slurm-master] group. - scripts/maas_inventory.py: inventory script (Python stdlib only) - config.example/maas-inventory.yml: example configuration - docs/pxe/maas.md: added Dynamic Inventory section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
… CodeQL alert - Remove hardcoded StrictHostKeyChecking=no from SSH args (security) - Return empty dict for --host since _meta provides all hostvars (perf) - Remove URL from error message to avoid CodeQL sensitive data alert Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
etcd is a separate group in the static inventory, not a child of kube-master. Users should tag etcd nodes explicitly rather than having all kube-master nodes implicitly join etcd. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
…C entry - Update pre-requisites from Ubuntu 18.04 to 22.04/24.04 - Note MAAS 2.8 is original version, current is 3.x - Update maas_repo PPA from 2.8 to 3.5 - Add Dynamic Inventory section to table of contents Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
kubespray v2.30 requires underscored group names: - kube-master -> kube_control_plane - kube-node -> kube_node - k8s-cluster -> k8s_cluster Updated inventory templates, group_vars filename, group_vars content, and all playbook references. Directory paths (playbooks/k8s-cluster/) are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
The 'native' snapshotter was a workaround for old cri-tools issues (#436, #710) that are long resolved. It causes 'no unpack platforms defined' errors with containerd v2.x. Switch to 'overlayfs' which is kubespray's default and works correctly on ext4/xfs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Add project-level .ansible-lint with profile:min and skip_list for pre-existing issues (fqcn, name casing, truthy, octal, etc.) - Rewrite lint script to run from project root using project config - Remove per-role .ansible-lint files (conflicted with v26 syntax) - Molecule: drop Ubuntu 20.04 platforms (EOL), keep 22.04 only - Molecule: use cgroupns_mode:host, remove command:/sbin/init and tmpfs that caused systemd temp dir failures on cgroup v2 hosts - Molecule: add privileged:true where missing, remove max-parallel limit, set fail-fast:false, upgrade runner to ubuntu-24.04 - Add ANSIBLE_ROLES_PATH and passlib to molecule workflow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
- spack: Replace gcc-7/gfortran-7 with unversioned gcc/gfortran - Remove abims_sbr.singularity from requirements.yml (dead project) - Molecule CI: Remove 5 roles that can't run in Docker containers: nis_client, rsyslog_client, rsyslog_server, slurm (need systemd services), singularity_wrapper (broken upstream Galaxy dep). These are all verified end-to-end on real MAAS VMs. - Remaining 11 molecule roles all pass in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
Modernize for Ansible 10.x, Ubuntu 24.04, kubespray v2.30
Add MAAS dynamic inventory script
Add scripts/maas_deploy.sh with --os, --profile, --status, --release, and --tags-only flags for repeatable VM deploy/tag/test/redeploy cycles. Reads config from config/maas-inventory.yml (no hardcoded secrets). Update maas_inventory.py to exit gracefully when MAAS is not configured (returns empty inventory instead of error), enabling dual inventory in ansible.cfg without breaking non-MAAS users. Wire ansible.cfg for dual inventory (static + dynamic). Add machines field to config.example/maas-inventory.yml. Add test-playbooks skill. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
Replace synchronize (rsync) with fetch module for copying kubectl from the cluster — rsync spawns its own SSH that bypasses ansible_ssh_common_args, failing when a bastion/proxy is required. Add cross-platform support: detect when the ansible controller and cluster have different OS/arch (e.g., macOS ARM vs Linux x86-64) and download the correct kubectl binary from dl.k8s.io matching the cluster's K8s version. Install kubectl to the active virtualenv ($VIRTUAL_ENV/bin/) instead of /usr/local/bin, eliminating the need for sudo and keeping it scoped to the DeepOps environment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
Follow-up to PR #1336: rename remaining kube-master references to kube_control_plane and k8s-cluster to k8s_cluster in config.example group_vars, example playbook, and helper scripts (debug.sh, deploy_rook.sh). Also update ssh-hardening collection reference (dev-sec.ssh-hardening -> devsec.hardening) in config.example/group_vars/all.yml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
Recognize placeholder values from config.example (angle brackets, CONSUMER_KEY:TOKEN_KEY:TOKEN_SECRET) as unconfigured and return empty inventory instead of attempting to connect and failing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
feat: MAAS deploy workflow and dynamic inventory integration
fix: kubectl binary copy works through bastion and cross-platform
fix: update config.example and scripts for kubespray v2.30 group names
- Set executable mode on fetched kubectl binary (fetch doesn't preserve) - Add changed_when: false to kubectl version command for idempotency - Add SHA256 checksum verification for cross-platform kubectl download - Respect proxy_env for get_url in proxy environments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
- Move API key validation from maas_auth_header() to load_config() so exit works properly (exit in command substitution only kills subshell) - Accept MAAS_SSH_BASTION env var (consistent with inventory script) and convert to ProxyCommand; MAAS_SSH_PROXY still works as direct override - Quote ssh_bastion value in proxy command to handle spaces/special chars - Use os.environ instead of shell interpolation for network_filter in get_ip() to prevent potential code injection - Deduplicate hosts in inventory when machine has both old and aliased tags (e.g., both kube-master and kube_control_plane) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Holt <dholt@nvidia.com>
fix: MAAS deploy/inventory hardening from Copilot review
fix: kubectl download hardening from Copilot review
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )