Reorder statements to improve spatial locality #43

yy214123 · 2025-04-08T05:16:10Z

Since row-wise and column-wise checks are logically equivalent in terms of win probability, evaluating row-wise directions first is more favorable due to better spatial locality.

This reordering does not alter functional behavior but may lead to more efficient execution in practice.

visitorckw · 2025-04-08T05:39:08Z

Looks good.

However, I would appreciate more explanation on why prioritizing row-wise checks improves spatial locality.
Also, I'm curious if there's any observable performance gain or data showing how much faster this change might be.

yy214123 · 2025-04-08T05:59:36Z

I have not yet conducted a performance benchmark for this optimization. The improvement was motivated by insights from the following references:

After printing the board indices being checked by check_win and check_line_segment_win, we can observe the access patterns:

---- Line type: COL ----
Checking board indices:
(0,0)[0] → (1,0)[4] → (2,0)[8]
(0,1)[1] → (1,1)[5] → (2,1)[9]
(0,2)[2] → (1,2)[6] → (2,2)[10]
(0,3)[3] → (1,3)[7] → (2,3)[11]
(1,0)[4] → (2,0)[8] → (3,0)[12]
(1,1)[5] → (2,1)[9] → (3,1)[13]
(1,2)[6] → (2,2)[10] → (3,2)[14]
(1,3)[7] → (2,3)[11] → (3,3)[15]

---- Line type: ROW ----
Checking board indices:
(0,0)[0] → (0,1)[1] → (0,2)[2]
(0,1)[1] → (0,2)[2] → (0,3)[3]
(1,0)[4] → (1,1)[5] → (1,2)[6]
(1,1)[5] → (1,2)[6] → (1,3)[7]
(2,0)[8] → (2,1)[9] → (2,2)[10]
(2,1)[9] → (2,2)[10] → (2,3)[11]
(3,0)[12] → (3,1)[13] → (3,2)[14]
(3,1)[13] → (3,2)[14] → (3,3)[15]

---- Line type: PRIMARY ----
Checking board indices:
(0,0)[0] → (1,1)[5] → (2,2)[10]
(0,1)[1] → (1,2)[6] → (2,3)[11]
(1,0)[4] → (2,1)[9] → (3,2)[14]
(1,1)[5] → (2,2)[10] → (3,3)[15]

---- Line type: SECONDARY ----
Checking board indices:
(0,2)[2] → (1,1)[5] → (2,0)[8]
(0,3)[3] → (1,2)[6] → (2,1)[9]
(1,2)[6] → (2,1)[9] → (3,0)[12]
(1,3)[7] → (2,2)[10] → (3,1)[13]

Since the chances of winning via ROW and COLUMN are the same, let's consider a case where the current board already contains a winning sequence like ROW XXX or ROW OOO. Even in this situation, the current implementation still performs a complete COLUMN-major traversal of the board before identifying a winner.

Given that the board is only 4×4, the performance loss from accessing non-contiguous memory might not be significant at this scale. However, the access pattern is still worth considering for potential optimization.

visitorckw · 2025-04-08T06:12:58Z

I have not yet conducted a performance benchmark for this optimization. The improvement was motivated by insights from the following references:

https://hackmd.io/@sysprog/CSAPP-ch6

https://en.wikipedia.org/wiki/Row-_and_column-major_order

Maybe consider including them in the commit message using the Link: tag?

After printing the board indices being checked by check_win and check_line_segment_win, we can observe the access patterns:

---- Line type: COL ---- Checking board indices: (0,0)[0] → (1,0)[4] → (2,0)[8] (0,1)[1] → (1,1)[5] → (2,1)[9] (0,2)[2] → (1,2)[6] → (2,2)[10] (0,3)[3] → (1,3)[7] → (2,3)[11] (1,0)[4] → (2,0)[8] → (3,0)[12] (1,1)[5] → (2,1)[9] → (3,1)[13] (1,2)[6] → (2,2)[10] → (3,2)[14] (1,3)[7] → (2,3)[11] → (3,3)[15]

---- Line type: ROW ---- Checking board indices: (0,0)[0] → (0,1)[1] → (0,2)[2] (0,1)[1] → (0,2)[2] → (0,3)[3] (1,0)[4] → (1,1)[5] → (1,2)[6] (1,1)[5] → (1,2)[6] → (1,3)[7] (2,0)[8] → (2,1)[9] → (2,2)[10] (2,1)[9] → (2,2)[10] → (2,3)[11] (3,0)[12] → (3,1)[13] → (3,2)[14] (3,1)[13] → (3,2)[14] → (3,3)[15]

---- Line type: PRIMARY ---- Checking board indices: (0,0)[0] → (1,1)[5] → (2,2)[10] (0,1)[1] → (1,2)[6] → (2,3)[11] (1,0)[4] → (2,1)[9] → (3,2)[14] (1,1)[5] → (2,2)[10] → (3,3)[15]

---- Line type: SECONDARY ---- Checking board indices: (0,2)[2] → (1,1)[5] → (2,0)[8] (0,3)[3] → (1,2)[6] → (2,1)[9] (1,2)[6] → (2,1)[9] → (3,0)[12] (1,3)[7] → (2,2)[10] → (3,1)[13]

Since the chances of winning via ROW and COLUMN are the same, let's consider a case where the current board already contains a winning sequence like ROW XXX or ROW OOO. Even in this situation, the current implementation still performs a complete COLUMN-major traversal of the board before identifying a winner.

Given that the board is only 4×4, the performance loss from accessing non-contiguous memory might not be significant at this scale. However, the access pattern is still worth considering for potential optimization.

I agree that, in theory, prioritizing row-wise checks can lead to better cache locality. However, the current commit message only states that spatial locality improves, without explaining WHY.

Maybe consider tweaking the commit message to clarify the relationship between row-wise access, memory layout, spatial locality, and potential performance benefits ?

jserv · 2025-04-08T08:02:13Z

Instead of appending the references supporting this proposed change, show experimental evidences.

yy214123 · 2025-04-11T06:17:18Z

Experiment 1: Fairly Generated Test Data

In this experiment, each generated board guarantees exactly one winning condition, evenly covering all possible directions.

To eliminate bias from board position and access order, I applied the Fisher–Yates shuffle to randomly permute the entire board after inserting the winning condition. This ensures the spatial location of the win does not favor any particular access pattern and provides a fair baseline for performance comparison.

On a 10 × 10 board:

On a 50 × 50 board:

On a 100 × 100 board:

Since all potential
winning patterns are included during data generation, programs using row-major storage incur additional memory access costs when scanning for column-based wins — and vice versa for column-major storage.

Experiment 2: Biased Sample Distribution (Favoring Rows)

On a 100 × 100 board, the test data was deliberately skewed to increase the proportion of row-based wins.

Under this biased condition, the row-major implementation clearly outperforms column-major, with a much wider performance gap.

Perf analysis summary:

Metric	column-major	row-major
Instructions	1,585,771,268,229	1,426,533,169,134
CPU cycles	427,169,360,881	390,890,530,455
IPC (insn/cycle)	3.71	3.65
Cache references	3,505,765,178	3,411,224,518
Cache misses	2,549,701,006	2,526,058,997
Cache miss rate	72.73%	74.05%
Time elapsed (s)	85.00 s	77.88 s

These results demonstrate a clear performance benefit for row-major storage when handling row-aligned winning conditions. Therefore, if the AI algorithm is designed to favor row-based placements, it can reduce the cost of victory checks and improve overall efficiency.

Experiment 3: Simulating Real Gameplay

The previous experiments assume "guaranteed wins," which is unrealistic in actual gameplay, where wins often occur only after many turns. For example:

 O | O | O | O | O 
---+---+---+---+---
 O |   |   | X |   
---+---+---+---+---
 O |   | X |   |   
---+---+---+---+---
 O |   | X |   | X 
---+---+---+---+---
 X |   |   | X |

In this scenario, scanning from the top-left may require multiple invalid checks before identifying a win at the bottom-left, which can be costly for column-major access.

To simulate this, I created two 5 × 5 boards that are transposes of each other:

One favors row-major
The other favors column-major

Both require several invalid checks before locating a winning pattern, representing a more realistic edge-case scenario.

 O | O |   |   |  
---+---+---+---+---
 O | O |   |   | 
---+---+---+---+---
   |   |   |   | O
---+---+---+---+---
 O | O |   | O | O
---+---+---+---+---
 O | O |   | O | O

 O | O |   | O | O
---+---+---+---+---
 O | O |   | O | O
---+---+---+---+---
   |   |   |   | 
---+---+---+---+---
   |   |   | O | O
---+---+---+---+---
   |   | O | O | O

After running 1,000K iterations with perf, the metrics were as follows:

Metric	column_major	row_major
Instructions	20,042,179,354	9,528,171,690
CPU cycles	4,974,042,956	3,064,415,787
IPC (insn/cycle)	4.03	3.11
Cache references	205,075	115,548
Cache misses	28,503	20,176
Cache miss rate	13.90%	17.46%
Time elapsed (s)	0.971	0.597

Even under near-realistic conditions, row-major consistently outperformed column-major across almost all metrics. Although its cache miss rate is higher, this is due to having nearly half the total cache references — the overall execution time is still significantly lower.

Summary & Outlook

Across all three experiments, row-major storage clearly benefits from better spatial locality, especially when the game favors horizontal wins or involves frequent win-checking logic.

As a result, future AI algorithms for board games could benefit from encouraging row-oriented placement strategies. This not only simplifies decision-making but also enhances performance, particularly on large boards or scenarios requiring frequent win evaluations.

visitorckw · 2025-04-11T14:22:48Z

Experiment 1: Fairly Generated Test Data

In this experiment, each generated board guarantees exactly one winning condition, evenly covering all possible directions.

To eliminate bias from board position and access order, I applied the Fisher–Yates shuffle to randomly permute the entire board after inserting the winning condition. This ensures the spatial location of the win does not favor any particular access pattern and provides a fair baseline for performance comparison.

[...]

Since all potential winning patterns are included during data generation, programs using row-major storage incur additional memory access costs when scanning for column-based wins — and vice versa for column-major storage.

I'm not sure if "column-major storage" here refers to actually storing the board in column-major order. If so, I'm unclear about the point of this experiment, since we're currently using a 1D array in row-major order and aren't planning to change that.

Experiment 2: Biased Sample Distribution (Favoring Rows)

On a 100 × 100 board, the test data was deliberately skewed to increase the proportion of row-based wins.

[...]

Under this biased condition, the row-major implementation clearly outperforms column-major, with a much wider performance gap.

Perf analysis summary:

Metric column-major row-major
Instructions 1,585,771,268,229 1,426,533,169,134
CPU cycles 427,169,360,881 390,890,530,455
IPC (insn/cycle) 3.71 3.65
Cache references 3,505,765,178 3,411,224,518
Cache misses 2,549,701,006 2,526,058,997
Cache miss rate 72.73% 74.05%
Time elapsed (s) 85.00 s 77.88 s
These results demonstrate a clear performance benefit for row-major storage when handling row-aligned winning conditions. Therefore, if the AI algorithm is designed to favor row-based placements, it can reduce the cost of victory checks and improve overall efficiency.

I'm also unsure about the purpose of this experiment. For fairness, shouldn't we also test column-based wins and prioritize scanning columns?

Experiment 3: Simulating Real Gameplay

The previous experiments assume "guaranteed wins," which is unrealistic in actual gameplay, where wins often occur only after many turns. For example:

[...]

After running 1,000K iterations with perf, the metrics were as follows:

Metric column_major row_major
Instructions 20,042,179,354 9,528,171,690
CPU cycles 4,974,042,956 3,064,415,787
IPC (insn/cycle) 4.03 3.11
Cache references 205,075 115,548
Cache misses 28,503 20,176
Cache miss rate 13.90% 17.46%
Time elapsed (s) 0.971 0.597
Even under near-realistic conditions, row-major consistently outperformed column-major across almost all metrics. Although its cache miss rate is higher, this is due to having nearly half the total cache references — the overall execution time is still significantly lower.

This experiment seems more reasonable, but I'm a bit confused - I expected spatial locality to help by reducing cache misses, but the results show more cache misses. The faster execution seems to come from fewer instructions instead.

Summary & Outlook

Across all three experiments, row-major storage clearly benefits from better spatial locality, especially when the game favors horizontal wins or involves frequent win-checking logic.

I didn't really see a clear benefit from the first experiment.

As a result, future AI algorithms for board games could benefit from encouraging row-oriented placement strategies. This not only simplifies decision-making but also enhances performance, particularly on large boards or scenarios requiring frequent win evaluations.

yy214123 · 2025-04-12T13:14:33Z

I'm not sure if "column-major storage" here refers to actually storing the board in column-major order. If so, I'm unclear about the point of this experiment, since we're currently using a 1D array in row-major order and aren't planning to change that.

You're right to point that out — I realize my previous wording may have caused some confusion. The data is still stored in a row-major 1D array; I didn’t change the underlying layout. The main goal of Experiment 1 was to generate a complete set of winning patterns on an n × n board, and then evaluate the performance of both row-wise and column-wise scanning strategies using this dataset.

I'm also unsure about the purpose of this experiment. For fairness, shouldn't we also test column-based wins and prioritize scanning columns?

You're right — for fairness, we should definitely include this experiment.
Here are the results for the column-biased case:

Perf analysis summary:

Metric	column-major	row-major
Instructions	1,430,153,246,436	1,585,481,506,424
CPU cycles	390,527,903,361	423,555,794,132
IPC (insn/cycle)	3.66	3.74
Cache references	3,416,572,300	3,496,775,391
Cache misses	2,515,143,524	2,549,693,611
Cache miss rate	73.62%	72.92%
Time elapsed (s)	77.70 s	84.30 s

This experiment seems more reasonable, but I'm a bit confused - I expected spatial locality to help by reducing cache misses, but the results show more cache misses. The faster execution seems to come from fewer instructions instead.

I just want to clarify that the cache miss rate is typically calculated as:
Cache Miss Rate = Cache Misses / Cache References
(e.g., 20,176 / 115,548 ≈ 17.46%)

So while the row-major version does have a higher miss rate percentage (17.46% vs. 13.90%), it actually had fewer total cache references and fewer absolute cache misses (20,176 vs. 28,503).

I didn't really see a clear benefit from the first experiment.

You're right — I misspoke earlier. The performance improvements are more clearly demonstrated in Experiment 2 (row-biased wins) and Experiment 3, not in Experiment 1.

yy214123 · 2025-04-16T13:06:15Z

@visitorckw — just wanted to follow up on this PR when you have time. Appreciate your thoughts!

visitorckw · 2025-04-17T16:35:04Z

I just want to clarify that the cache miss rate is typically calculated as: Cache Miss Rate = Cache Misses / Cache References (e.g., 20,176 / 115,548 ≈ 17.46%)

So while the row-major version does have a higher miss rate percentage (17.46% vs. 13.90%), it actually had fewer total cache references and fewer absolute cache misses (20,176 vs. 28,503).

Yes, I'm aware of how cache miss rate is calculated. I just initially thought the improved efficiency came from a lower miss rate, not fewer cache references.

yy214123 · 2025-04-18T03:29:34Z

Yes, I'm aware of how cache miss rate is calculated. I just initially thought the improved efficiency came from a lower miss rate, not fewer cache references.

Just to summarize the improvements observed:

Cache references reduced from 205,075 to 115,548 → a 43.65% drop
Cache misses reduced from 28,503 to 20,176 → a 29.23% drop

Both metrics show a reasonable amount of reduction,
so I just wanted to check —
do you see anything potentially questionable or misleading about these results?

Or do you have any suggestions for improving the experiment further?
I'm happy to refine it if needed.

visitorckw · 2025-04-18T06:43:40Z

Please include at least your benchmark observations in the commit message.

Currently, the message mainly repeats that the change improves spatial locality.
It would be helpful to explain why better spatial locality reduces cache references, and how it still improves performance even if the cache miss rate increases.

I'm asking because I noticed the instruction count dropped by nearly 50%, so I'm wondering if the performance gain is actually due to fewer instructions being executed, rather than cache behavior alone.

Reorder win-check logic to favor row-wise over column-wise checks. This aligns better with C’s row-major memory layout, enhancing spatial locality. It reduces cache line crossings, cuts down total instruction count, and improves sequential memory access— especially useful for large boards or frequent evaluations. Benchmark on a 5×5 board (1M iterations): - Instructions: 20,042,179,354 → 9,528,171,690 (-52.5%) - Cache refs: 205,075 → 115,548 (-43.65%) - Cache misses: 28,503 → 20,176 (-29.23%) - Time elapsed: 0.971s → 0.597s (-38.5%) Despite a minor rise in miss rate, total misses declined due to fewer cache references.

jserv · 2025-04-21T04:24:42Z

Thank @yy214123 for contributing!

yy214123 force-pushed the optimize-spatial-locality branch from ddf7e45 to c77acdb Compare April 8, 2025 06:31

yy214123 force-pushed the optimize-spatial-locality branch from c77acdb to e404f67 Compare April 19, 2025 04:56

yy214123 force-pushed the optimize-spatial-locality branch from e404f67 to cfe9548 Compare April 19, 2025 04:59

jserv merged commit 33e1e21 into jserv:main Apr 21, 2025
2 checks passed

Reorder statements to improve spatial locality #43

Reorder statements to improve spatial locality #43

Uh oh!

Conversation

yy214123 commented Apr 8, 2025

Uh oh!

visitorckw commented Apr 8, 2025

Uh oh!

yy214123 commented Apr 8, 2025

Uh oh!

visitorckw commented Apr 8, 2025

Uh oh!

jserv commented Apr 8, 2025

Uh oh!

yy214123 commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Experiment 1: Fairly Generated Test Data

Experiment 2: Biased Sample Distribution (Favoring Rows)

Experiment 3: Simulating Real Gameplay

Summary & Outlook

Uh oh!

visitorckw commented Apr 11, 2025

Experiment 1: Fairly Generated Test Data

Experiment 2: Biased Sample Distribution (Favoring Rows)

Experiment 3: Simulating Real Gameplay

Summary & Outlook

Uh oh!

yy214123 commented Apr 12, 2025

Uh oh!

yy214123 commented Apr 16, 2025

Uh oh!

visitorckw commented Apr 17, 2025

Uh oh!

yy214123 commented Apr 18, 2025

Uh oh!

visitorckw commented Apr 18, 2025

Uh oh!

Uh oh!

jserv commented Apr 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yy214123 commented Apr 11, 2025 •

edited

Loading