-
Notifications
You must be signed in to change notification settings - Fork 50
[Intel-SIG] Fix sched domain build error for GNR, CWF in SNC-3 mode #101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
aubreyli
wants to merge
4
commits into
openvelinux:6.6-velinux
Choose a base branch
from
openvelinux:intel-numa-sched-domain-6.6
base: 6.6-velinux
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[Intel-SIG] Fix sched domain build error for GNR, CWF in SNC-3 mode #101
aubreyli
wants to merge
4
commits into
openvelinux:6.6-velinux
from
openvelinux:intel-numa-sched-domain-6.6
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
commit d12a828 upstream. Add support for autopointers for bitmaps allocated with bitmap_alloc() et al. Intel-SIG: commit d12a828 Define a cleanup function for bitmaps. Backport fix NUMA sched domain build errors for GNR and CWF. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Yury Norov <yury.norov@gmail.com> Link: https://lore.kernel.org/r/20240122124243.44002-2-brgl@bgdev.pl [ Aubrey Li: amend commit log ] Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
commit 06f2c90885e92992d1ce55d3f35b65b44d5ecc25 upstream. Allow architecture specific sched domain NUMA distances that are modified from actual NUMA node distances for the purpose of building NUMA sched domains. Keep actual NUMA distances separately if modified distances are used for building sched domains. Such distances are still needed as NUMA balancing benefits from finding the NUMA nodes that are actually closer to a task numa_group. Consolidate the recording of unique NUMA distances in an array to sched_record_numa_dist() so the function can be reused to record NUMA distances when the NUMA distance metric is changed. No functional change and additional distance array allocated if there're no arch specific NUMA distances being defined. Intel-SIG: commit 06f2c90885e9 Create architecture specific sched domain distances. Backport fix NUMA sched domain build errors for GNR and CWF. Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> [ Aubrey Li: amend commit log ] Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
commit 4d6dd05d07d00bc3bd91183dab4d75caa8018db9 upstream.
It is possible for Granite Rapids (GNR) and Clearwater Forest
(CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
is enabled, each die will become a separate NUMA node in the package
with different distances between dies within the same package.
For example, on GNR, we see the following numa distances for a 2 socket
system with 3 dies per socket:
package 1 package2
----------------
| |
--------- ---------
| 0 | | 3 |
--------- ---------
| |
--------- ---------
| 1 | | 4 |
--------- ---------
| |
--------- ---------
| 2 | | 5 |
--------- ---------
| |
----------------
node distances:
node 0 1 2 3 4 5
0: 10 15 17 21 28 26
1: 15 10 15 23 26 23
2: 17 15 10 26 23 21
3: 21 28 26 10 15 17
4: 23 26 23 15 10 15
5: 26 23 21 17 15 10
The node distances above led to 2 problems:
1. Asymmetric routes taken between nodes in different packages led to
asymmetric scheduler domain perspective depending on which node you
are on. Current scheduler code failed to build domains properly with
asymmetric distances.
2. Multiple remote distances to respective tiles on remote package create
too many levels of domain hierarchies grouping different nodes between
remote packages.
For example, the above GNR topology lead to NUMA domains below:
Sched domains from the perspective of a CPU in node 0, where the number
in bracket represent node number.
NUMA-level 1 [0,1] [2]
NUMA-level 2 [0,1,2] [3]
NUMA-level 3 [0,1,2,3] [5]
NUMA-level 4 [0,1,2,3,5] [4]
Sched domains from the perspective of a CPU in node 4
NUMA-level 1 [4] [3,5]
NUMA-level 2 [3,4,5] [0,2]
NUMA-level 3 [0,2,3,4,5] [1]
Scheduler group peers for load balancing from the perspective of CPU 0
and 4 are different. Improper task could be chosen for load balancing
between groups such as [0,2,3,4,5] [1]. Ideally you should choose nodes
in 0 or 2 that are in same package as node 1 first. But instead tasks
in the remote package node 3, 4, 5 could be chosen with an equal chance
and could lead to excessive remote package migrations and imbalance of
load between packages. We should not group partial remote nodes and
local nodes together.
Simplify the remote distances for CWF and GNR for the purpose of
sched domains building, which maintains symmetry and leads to a more
reasonable load balance hierarchy.
The sched domains from the perspective of a CPU in node 0 NUMA-level 1
is now
NUMA-level 1 [0,1] [2]
NUMA-level 2 [0,1,2] [3,4,5]
The sched domains from the perspective of a CPU in node 4 NUMA-level 1
is now
NUMA-level 1 [4] [3,5]
NUMA-level 2 [3,4,5] [0,1,2]
We have the same balancing perspective from node 0 or node 4. Loads are
now balanced equally between packages.
Intel-SIG: commit 4d6dd05d07d0 Fix sched domain build error for GNR, CWF in SNC-3 mode.
Backport fix NUMA sched domain build errors for GNR and CWF.
Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
[ Aubrey Li: amend commit log ]
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
commit 73cbcfe255f7edca915d978a7d1b0a11f2d62812 upstream.
A compile warning slipped through:
arch/x86/kernel/smpboot.c:548:5: warning: no previous prototype for function 'arch_sched_node_distance' [-Wmissing-prototypes]
Intel-SIG: commit 73cbcfe255f7 Fix build warning.
Backport fix NUMA sched domain build errors for GNR and CWF.
Fixes: 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ Aubrey Li: amend commit log ]
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While testing Granite Rapids (GNR) and Clearwater Forest (CWF) systems in SNC-3 mode, we encountered sched domain build errors in dmesg. The scheduler domain code did not expect asymmetric node distances from a local node to multiple nodes in a remote package. As a result, remote nodes ended up being grouped partially with local nodes with asymemtric groupings, and creating too many levels in the NUMA sched domain hierarchy.
To address this, we simplify remote node distances for the purpose of sched domain construction on GNR and CWF. Specifically, we replace the individual distances to nodes within the same remote package with their average distance. This resolves the domain build errors and reduces the number of NUMA sched domain levels.
The actual SLIT NUMA node distances are still preserved separately, in case they are needed when building sched domains. NUMA balancing continues to use the true distances when selecting a closer remote node for a task’s numa_group.
The following two commits backported:
as well as its necessary dependencies:
Testing result w/o fixes:
[ 8.260954] CPU0 attaching sched-domain(s):[ 8.261112] domain-0: span=0,192 level=SMT[ 8.262111] groups: 0:{ span=0 cap=976 }, 192:{ span=192 cap=1022 }[ 8.263111] domain-1: span=0-31,192-223 level=MC[ 8.264110] groups: 0:{ span=0,192 cap=1998 }, 1:{ span=1,193 cap=2046 }, 2:{ span=2,194 cap=2045 }, 3:{ span=3,195 cap=2046 }, 4:{ span=4,196 cap=2044 }, 5:{ span=5,197 cap=2045 }, 6:{ span=6,198 cap=2046 }, 7:{ span=7,199 cap=2045 }, 8:{ span=8,200 cap=2045 }, 9:{ span=9,201 cap=2047 }, 10:{ span=10,202 cap=2045 }, 11:{ span=11,203 cap=2047 }, 12:{ span=12,204 cap=2044 }, 13:{ span=13,205 cap=2045 }, 14:{ span=14,206 cap=2045 }, 15:{ span=15,207 cap=2045 }, 16:{ span=16,208 cap=2045 }, 17:{ span=17,209 cap=2048 }, 18:{ span=18,210 cap=2047 }, 19:{ span=19,211 cap=2045 }, 20:{ span=20,212 cap=2045 }, 21:{ span=21,213 cap=2046 }, 22:{ span=22,214 cap=2048 }, 23:{ span=23,215 cap=2045 }, 24:{ span=24,216 cap=2047 }, 25:{ span=25,217 cap=2046 }, 26:{ span=26,218 cap=2046 }, 27:{ span=27,219 cap=2045 }, 28:{ span=28,220 cap=2046 }, 29:{ span=29,221 cap=2046 }, 30:{ span=30,222 cap=2044 }, 31:{ span=31,223 cap=2046 }[ 8.265119] domain-2: span=0-63,192-255 level=NUMA[ 8.266110] groups: 0:{ span=0-31,192-223 cap=65413 }, 32:{ span=32-63,224-255 cap=65457 }[ 8.267111] domain-3: span=0-95,192-287 level=NUMA[ 8.268110] groups: 0:{ span=0-63,192-255 mask=0-31,192-223 cap=130870 }, 64:{ span=32-95,224-287 mask=64-95,256-287 cap=131001 }[ 8.269111] domain-4: span=0-127,192-319 level=NUMA[ 8.270110] groups: 0:{ span=0-95,192-287 cap=196381 }, 96:{ span=96-127,288-319 cap=65451 }[ 8.271111] domain-5: span=0-127,160-319,352-383 level=NUMA[ 8.272110] groups: 0:{ span=0-127,192-319 mask=0-31,192-223 cap=261832 }, 160:{ span=160-191,352-383 cap=65475 }[ 8.273112] domain-6: span=0-383 level=NUMA[ 8.274110] groups: 0:{ span=0-127,160-319,352-383 mask=0-31,192-223 cap=327307 }[ 8.275111] ERROR: groups don't span domain->spanTesting result w/ fixes:
[ 8.187368] CPU0 attaching sched-domain(s):[ 8.188143] domain-0: span=0,192 level=SMT[ 8.189142] groups: 0:{ span=0 cap=887 }, 192:{ span=192 }[ 8.190141] domain-1: span=0-31,192-223 level=MC[ 8.191141] groups: 0:{ span=0,192 cap=1911 }, 1:{ span=1,193 cap=2021 }, 2:{ span=2,194 cap=2038 }, 3:{ span=3,195 cap=2040 }, 4:{ span=4,196 cap=2039 }, 5:{ span=5,197 cap=2045 }, 6:{ span=6,198 cap=2041 }, 7:{ span=7,199 cap=2041 }, 8:{ span=8,200 cap=2042 }, 9:{ span=9,201 cap=2033 }, 10:{ span=10,202 cap=2033 }, 11:{ span=11,203 cap=2033 }, 12:{ span=12,204 cap=2045 }, 13:{ span=13,205 cap=2027 }, 14:{ span=14,206 cap=2038 }, 15:{ span=15,207 cap=2035 }, 16:{ span=16,208 cap=2044 }, 17:{ span=17,209 cap=2044 }, 18:{ span=18,210 cap=2039 }, 19:{ span=19,211 cap=2042 }, 20:{ span=20,212 cap=2041 }, 21:{ span=21,213 cap=2048 }, 22:{ span=22,214 cap=2036 }, 23:{ span=23,215 cap=2048 }, 24:{ span=24,216 cap=2021 }, 25:{ span=25,217 cap=2043 }, 26:{ span=26,218 cap=2044 }, 27:{ span=27,219 cap=2041 }, 28:{ span=28,220 cap=2041 }, 29:{ span=29,221 cap=2037 }, 30:{ span=30,222 cap=2036 }, 31:{ span=31,223 cap=2048 }[ 8.192149] domain-2: span=0-63,192-255 level=NUMA[ 8.193141] groups: 0:{ span=0-31,192-223 cap=65115 }, 32:{ span=32-63,224-255 cap=65201 }[ 8.194142] domain-3: span=0-95,192-287 level=NUMA[ 8.195141] groups: 0:{ span=0-63,192-255 mask=0-31,192-223 cap=130316 }, 64:{ span=32-95,224-287 mask=64-95,256-287 cap=130714 }[ 8.196142] domain-4: span=0-383 level=NUMA[ 8.197141] groups: 0:{ span=0-95,192-287 cap=195692 }, 96:{ span=96-191,288-383 cap=195639 }