Skip to content

CASMTRIAGE-9035/CASMTRIAGE-8987/CASMNET-2387 - NMN Isolation bugfixes#751

Open
spillerc-hpe wants to merge 8 commits intomainfrom
CASMTRIAGE-9035
Open

CASMTRIAGE-9035/CASMTRIAGE-8987/CASMNET-2387 - NMN Isolation bugfixes#751
spillerc-hpe wants to merge 8 commits intomainfrom
CASMTRIAGE-9035

Conversation

@spillerc-hpe
Copy link
Copy Markdown
Contributor

@spillerc-hpe spillerc-hpe commented Mar 4, 2026

Summary and Scope

Fix three bugs in the MANAGED_NODE_ISOLATION ACL and CDU switch templates affecting NMN-isolated systems.

Fix 1 — CASMTRIAGE-9035: BMC nodes cannot resolve DNS, FAS firmware updates fail

Root cause: OSPF advertises the HMNLB DNS VIP (10.94.100.x) with a nexthop of 10.252.0.x on vlan 2 (NMN). This means BMC DNS queries enter the spine via vlan 2 and are evaluated by MANAGED_NODE_ISOLATION. The previous DNS rules only matched:

  • permit udp any eq dns any — traffic from port 53 (DNS replies), not queries
  • permit udp MANAGED_NODES NMN_K8S_SERVICE eq dns — DNS queries to the NMN K8S service range only; the BMC source (10.104.0.x) is not in MANAGED_NODES and the HMNLB DNS is not in NMN_K8S_SERVICE

Fix: Replace the three narrow DNS rules with four unrestricted rules covering DNS queries (dst-port 53) and replies (src-port 53) for both UDP and TCP, regardless of source or destination. The over-specific permit udp MANAGED_NODES NMN_K8S_SERVICE eq dns is now subsumed and removed (net +1 TCAM entry).


Fix 2 — CASMTRIAGE-8987: IMS remote build node inaccessible to k8s containers

Root cause: IMS remote build jobs map containerised SSH servers to host ports starting at 2022 (incrementing per concurrent job). The ACL only permitted eq ssh (port 22) between NCNs and managed nodes, blocking the non-standard ports.

Fix: Add an SSH_ALTERNATE port object group (eq ssh + range 2022 2040) to services_objects.j2 and replace the two eq ssh rules in services_acl.j2 with group SSH_ALTERNATE. The range 2022–2040 supports up to 19 concurrent IMS remote build jobs per node. Net cost: +2 TCAM entries.


Fix 3 — CASMNET-2387: NCN cannot SSH or run UDP traceroute to HMN cabinet BMCs under NMN isolation

Root cause: NCNs route the HMN cabinet network (10.104.0.0/22, vlan3000) via bond0.hmn0, so all traffic from NCNs to BMCs arrives at CDU switches with source 10.254.1.x (HMN range), not the NMN /32s that make up the NCN object-group. Under NMN isolation, the MANAGED_NODE_ISOLATION ACL applied inbound on vlan 2 dropped this traffic before it could reach any routed-in ACL — the NCN-sourced SSH and established rules (rules 280/290) never matched, and rule 430 (implicit deny) caught everything. ICMP traceroute worked because a separate permit icmp any any rule existed; UDP traceroute did not.

Fix:

  • Add permit any HMN HMN_MTN / permit any HMN_MTN HMN ACEs (rules 300/310) to MANAGED_NODE_ISOLATION in services_acl.j2, guarded by HMN_MTN. NCN HMN addresses need unrestricted management access to BMCs (SSH, HTTPS, IPMI/RMCP, traceroute, etc.) so permit any is intentional.
  • Add cdu_hmn_cabinet_acl.j2 (new): CDU-specific ACL for vlan3000 (cabinet HMN). Permits full HMN↔HMN_MTN traffic, denies NMN↔HMN_MTN, then permit any any any.
  • Add cdu_nmn_routed_acl.j2 (new): CDU routed-in/out ACL for interface vlan2/2000. Prepends HMN↔HMN_MTN permit any rules before the inter-zone deny rules, then permit any any any.
  • Update mtn_hmn_vlan.j2: apply cdu-hmn-cabinet on vlan3000 when NMN isolation is active (previously nmn-hmn, which had no HMN-sourced exceptions).
  • Update mtn_nmn_vlan.j2, sw-cdu.primary.j2, sw-cdu.secondary.j2: use cdu-nmn-routed for routed-in/out; fix HMN_MTN Jinja2 guard from HMN_MTN_NETWORK_IP (not always present in the variables dict) to HMN_MTN (always initialised).
  • Fix canu/config/network/network.py: HMN_MTN, HMN_MTN_NETWORK_IP, HMN_MTN_NETMASK, HMN_NETWORK_IP, and HMN_NETMASK were computed by parse_sls_for_config but omitted from the variables dict passed to templates, causing UndefinedError in test_config_network_dry_run.

The two new utility scripts and their README are also included to help future developers regenerate golden configs after ACL changes.

  • I have added new tests to cover the new code
  • If adding a new file, I have updated pyinstaller.py (scripts in tests/ are not bundled)
  • I have added entries in CHANGELOG.md for the changes in this PR (no CHANGELOG.md exists)

Issues and Related PRs

  • Resolves: CASMTRIAGE-9035 — Odin: DNS servers are not reachable from BMCs. FAS update fails.
  • Resolves: CASMTRIAGE-8987 — Remote build node inaccessible to k8s container.
  • Resolves: CASMNET-2387 — NCN cannot SSH or traceroute to HMN cabinet BMCs under NMN isolation.

Testing

I will fill out the manual testing done by hand.

Automated: All golden configs updated across full, TDS, and custom 1.7 architectures; isolation configs regenerated and unit tests pass.

Changed files:

File Change
network_modeling/configs/templates/1.7/aruba/common/services_acl.j2 Replace 3 DNS rules with 4 broad unrestricted rules; replace eq ssh with group SSH_ALTERNATE; add permit any HMN↔HMN_MTN rules (Fix 3)
network_modeling/configs/templates/1.7/aruba/common/services_objects.j2 Add SSH_ALTERNATE port group; add HMN object-group; remove redundant dns from service groups; consolidate ports 6817–6819 to a range
network_modeling/configs/templates/1.7/aruba/common/cdu_hmn_cabinet_acl.j2 New: CDU ACL for vlan3000 (cabinet HMN) — permits HMN↔HMN_MTN, denies NMN↔HMN_MTN
network_modeling/configs/templates/1.7/aruba/common/cdu_nmn_routed_acl.j2 New: CDU routed-in/out ACL for vlan2/2000 — HMN↔HMN_MTN permit-any before inter-zone denies
network_modeling/configs/templates/1.7/aruba/common/mtn_hmn_vlan.j2 Apply cdu-hmn-cabinet on vlan3000 under NMN isolation
network_modeling/configs/templates/1.7/aruba/common/mtn_nmn_vlan.j2 Use cdu-nmn-routed for routed-in/out; fix HMN_MTN Jinja2 guard
network_modeling/configs/templates/1.7/aruba/common/sw-cdu.primary.j2 Include new ACL templates; apply cdu-nmn-routed on vlan2/2000; fix HMN_MTN guard
network_modeling/configs/templates/1.7/aruba/common/sw-cdu.secondary.j2 Same as primary
canu/config/network/network.py Add missing HMN_MTN* and HMN_NETWORK* variables to template variables dict
tests/data/golden_configs/**/*-isolation.cfg (7 files) Regenerated — updated DNS, SSH, and HMN↔HMN_MTN rules
tests/data/golden_configs/**/*.cfg (remaining files) Regenerated — CANU version bump only
tests/data/golden_configs/individual_templates_1.7/services_acl.j2.cfg Regenerated golden config for template unit test
tests/data/golden_configs/individual_templates_1.7/services_objects.j2.cfg Regenerated golden config for template unit test
tests/scripts/regenerate_golden_configs_1.7.sh New: shell script to regenerate all CSM 1.7 golden configs
tests/scripts/regenerate_individual_templates_1.7.py New: Python script to regenerate individual template golden configs
tests/scripts/README.md New: documentation for test helper scripts

@spillerc-hpe spillerc-hpe requested a review from a team as a code owner March 4, 2026 11:53
@spillerc-hpe spillerc-hpe changed the title CASMTRIAGE-9035 - DNS servers are not reachable from BMCs. FAS update fails. CASMTRIAGE-9035/CASMTRIAGE-8987 - NMN Isolation bugfixes Mar 4, 2026
spillerc-hpe and others added 3 commits March 6, 2026 10:45
…ror as the switch always renders it by name not port
…ASMNET-2387)

NCNs route the HMN cabinet network (10.104.0.0/22) via bond0.hmn0, so
all SSH/UDP/ICMP traffic from NCNs to BMCs arrives at CDU switches with
a source address in 10.254.0.0/17 (HMN), not the NMN /32s that make up
the NCN object-group. Under NMN isolation the MANAGED_NODE_ISOLATION ACL
on vlan2 dropped these packets before they could reach any routed-in ACL,
causing SSH and UDP traceroute to fail.

Changes:
- services_acl.j2: add 'permit any HMN HMN_MTN' ACEs (rules 300/310)
  inside the HMN_MTN guard so HMN-sourced traffic bidirectionally
  reaches cabinet BMCs; NMN-sourced rules retained as dead-code intent
- cdu_hmn_cabinet_acl.j2 (new): ACL for vlan3000 (cabinet HMN); permits
  full HMN<->HMN_MTN traffic, denies NMN<->HMN_MTN, then permits all
- cdu_nmn_routed_acl.j2 (new): ACL for CDU routed-in/out on vlan2/2000;
  prepends HMN<->HMN_MTN permit-any rules before inter-zone deny rules
- mtn_hmn_vlan.j2: apply cdu-hmn-cabinet on vlan3000 when NMN isolation
  is active (previously nmn-hmn, which had no HMN-sourced exceptions)
- mtn_nmn_vlan.j2: use cdu-nmn-routed for routed-in/out ACL on NMN VLANs
- sw-cdu.primary.j2, sw-cdu.secondary.j2: include new ACL templates and
  apply cdu-nmn-routed on interface vlan2/2000; fix HMN_MTN guard from
  HMN_MTN_NETWORK_IP to HMN_MTN
- services_objects.j2: add HMN object-group for use in ACL rules
- canu/config/network/network.py: include HMN_MTN, HMN_MTN_NETWORK_IP,
  HMN_MTN_NETMASK, HMN_NETWORK_IP, HMN_NETMASK in the variables dict
  passed to templates (parse_sls_for_config already computed them but
  they were omitted, causing UndefinedError in test_config_network_dry_run)
- golden configs updated to reflect new ACL content

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spillerc-hpe spillerc-hpe changed the title CASMTRIAGE-9035/CASMTRIAGE-8987 - NMN Isolation bugfixes CASMTRIAGE-9035/CASMTRIAGE-8987/CASMNET-2387 - NMN Isolation bugfixes Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant