Reassociate reduction chains for long Min/Max by galderz · Pull Request #2 · rwestrel/jdk

galderz · 2026-01-30T09:10:29Z

Here's my draft proposal for JDK-8351409.

I've run tier1-3 testing on x86_64 successfully. Emanuel run tests and they looked good.

Here are the benchmark results I observe:

darwin/aarch64:

Benchmark                                         (SIZE)  (seed)   Mode  Cnt      Base     Patch   Units   Diff
VectorReduction2.NoSuperword.longMaxBig             2048       0  thrpt    3   935.919  1085.515  ops/ms   +16%
VectorReduction2.NoSuperword.longMaxDotProduct      2048       0  thrpt    3  1049.671  2756.752  ops/ms  +163%
VectorReduction2.NoSuperword.longMaxSimple          2048       0  thrpt    3  1095.269  2588.862  ops/ms  +136%
VectorReduction2.NoSuperword.longMinBig             2048       0  thrpt    3   990.134  1147.871  ops/ms   +16%
VectorReduction2.NoSuperword.longMinDotProduct      2048       0  thrpt    3  1067.192  2711.534  ops/ms  +154%
VectorReduction2.NoSuperword.longMinSimple          2048       0  thrpt    3  1077.393  2690.817  ops/ms  +150%
VectorReduction2.WithSuperword.longMaxBig           2048       0  thrpt    3  1001.981  1149.861  ops/ms   +15%
VectorReduction2.WithSuperword.longMaxDotProduct    2048       0  thrpt    3  1079.197  2772.765  ops/ms  +157%
VectorReduction2.WithSuperword.longMaxSimple        2048       0  thrpt    3  1206.340  1208.723  ops/ms     0%
VectorReduction2.WithSuperword.longMinBig           2048       0  thrpt    3  1004.450  1150.072  ops/ms   +14%
VectorReduction2.WithSuperword.longMinDotProduct    2048       0  thrpt    3  1080.833  2875.314  ops/ms  +166%
VectorReduction2.WithSuperword.longMinSimple        2048       0  thrpt    3  1119.842  1209.392  ops/ms    +8%

linux/x86_64 (xeon cascade lake with AVX-512):

Benchmark                                         (SIZE)  (seed)   Mode  Cnt      Base     Patch   Units   Diff
VectorReduction2.NoSuperword.longMaxBig             2048       0  thrpt    3   327.435   307.119  ops/ms    -6%
VectorReduction2.NoSuperword.longMaxDotProduct      2048       0  thrpt    3   406.023   606.387  ops/ms   +49%
VectorReduction2.NoSuperword.longMaxSimple          2048       0  thrpt    3   463.940   980.875  ops/ms  +111%
VectorReduction2.NoSuperword.longMinBig             2048       0  thrpt    3   324.709   330.053  ops/ms    +2%
VectorReduction2.NoSuperword.longMinDotProduct      2048       0  thrpt    3   414.097   605.047  ops/ms   +46%
VectorReduction2.NoSuperword.longMinSimple          2048       0  thrpt    3   461.952   825.363  ops/ms   +79%
VectorReduction2.WithSuperword.longMaxBig           2048       0  thrpt    3   547.252   557.059  ops/ms    +2%
VectorReduction2.WithSuperword.longMaxDotProduct    2048       0  thrpt    3   948.265   944.391  ops/ms     0%
VectorReduction2.WithSuperword.longMaxSimple        2048       0  thrpt    3  1290.747  1290.617  ops/ms     0%
VectorReduction2.WithSuperword.longMinBig           2048       0  thrpt    3   552.254   552.402  ops/ms     0%
VectorReduction2.WithSuperword.longMinDotProduct    2048       0  thrpt    3   955.592   955.159  ops/ms     0%
VectorReduction2.WithSuperword.longMinSimple        2048       0  thrpt    3  1290.704  1290.363  ops/ms     0%

The -6% down for longMaxBig on xeon is noise, I did perfnorm and perfasm runs and in those it was ~~330~~ ops/ms.

Please see some additional notes I've made in the code.

galderz · 2026-01-30T09:12:22Z

test/hotspot/jtreg/compiler/loopopts/TestReductionReassociationForAssociativeAdds.java

+ * a typical loop would generate additional IR nodes. Hence, the test uses
+ * a custom while loop with non-inlined helper methods.
+ */
+public class TestReductionReassociationForAssociativeAdds {


I'm unsure about the name for this test. As you can see, I've originally used it to test all associative adds but which it doesn't do now, but I thought it could be useful to leave it like this for future enhancements? We know we would want to do this for min/max for Integer, and I think additions for long/integer might also work well.

galderz · 2026-01-30T09:14:38Z

To help review, this is what the generated IR test looks like (having it formatted after test execution):

package compiler.loopopts.templated;

// --- IMPORTS start ---
import compiler.lib.generators.*;
import compiler.lib.ir_framework.*;
import compiler.lib.verify.*;

// --- IMPORTS end   ---
public class ReductionReassociationForAssociativeAdds {
  // --- CLASS_HOOK insertions start ---
  // --- CLASS_HOOK insertions end   ---
  public static void main(String[] vmFlags) {
    TestFramework framework = new TestFramework(ReductionReassociationForAssociativeAdds.class);
    framework.addFlags(
        "-classpath",
        "/Users/g/src/jdk-reassoc-reduct-chain/JTwork/classes/compiler/loopopts/TestReductionReassociationForAssociativeAdds.d:/Users/g/src/jdk-reassoc-reduct-chain/test/hotspot/jtreg/compiler/loopopts:/Users/g/src/jdk-reassoc-reduct-chain/JTwork/classes/compiler/loopopts/TestReductionReassociationForAssociativeAdds.d/test/lib:/Users/g/src/jdk-reassoc-reduct-chain/test/lib:/Users/g/src/jdk-reassoc-reduct-chain/test/hotspot/jtreg:/Users/g/opt/jtreg/build/images/jtreg/lib/javatest.jar:/Users/g/opt/jtreg/build/images/jtreg/lib/jtreg.jar:/Users/g/src/jdk-reassoc-reduct-chain/JTwork/scratch/./compile-framework-classes-4709488473270129801");
    framework.addFlags(vmFlags);
    framework.start();
  }

  // --- LIST OF TESTS start ---
  // --- test_2 start ---

  private static long[] input_2 = new long[10000];

  static {
    Generators.G.fill(Generators.G.longs(), input_2);
  }

  private Object[] expected_2 = test_2();

  @Test
  @IR(
      counts = {IRNode.MIN_L, "= 4"},
      phase = CompilePhase.AFTER_LOOP_OPTS)
  public Object[] test_2() {
    long result = Long.MAX_VALUE;
    long result2 = Long.MAX_VALUE;
    int i = 0;
    while (i < input_2.length) {
      long v0 = getArray_dontinline_input_2(0, i);
      long v1 = getArray_dontinline_input_2(1, i);
      long v2 = getArray_dontinline_input_2(2, i);
      long v3 = getArray_dontinline_input_2(3, i);
      long u0 = Long.min(v0, result);
      long u1 = Long.min(v1, u0);
      long u2 = Long.min(v2, u1);
      long u3 = Long.min(v3, u2);
      long t0 = Long.min(v0, v1);
      long t1 = Long.min(v2, t0);
      long t2 = Long.min(v3, t1);
      long t3 = Long.min(result, t2);
      result = u3;
      result2 = t3;
      i = sum_dontinline_input_2(i, 4);
    }
    return asArray_dontinline_input_2(result, result2);
  }

  @Check(test = "test_2")
  public void check_2(Object[] results) {
    Verify.checkEQ(expected_2[0], results[0]);
    Verify.checkEQ(expected_2[1], results[1]);
    Verify.checkEQ(results[0], results[1]);
  }

  static long getArray_dontinline_input_2(int pos, int base) {
    return input_2[pos + base];
  }

  static Object[] asArray_dontinline_input_2(long result, long result2) {
    return new Object[] {result, result2};
  }

  static int sum_dontinline_input_2(int a, int b) {
    return a + b;
  }

  // --- test_2 end ---
  // --- test_18 start ---

  private static long[] input_18 = new long[10000];

  static {
    Generators.G.fill(Generators.G.longs(), input_18);
  }

  private Object[] expected_18 = test_18();

  @Test
  @IR(
      counts = {IRNode.MAX_L, "= 4"},
      phase = CompilePhase.AFTER_LOOP_OPTS)
  public Object[] test_18() {
    long result = Long.MIN_VALUE;
    long result2 = Long.MIN_VALUE;
    int i = 0;
    while (i < input_18.length) {
      long v0 = getArray_dontinline_input_18(0, i);
      long v1 = getArray_dontinline_input_18(1, i);
      long v2 = getArray_dontinline_input_18(2, i);
      long v3 = getArray_dontinline_input_18(3, i);
      long u0 = Long.max(v0, result);
      long u1 = Long.max(v1, u0);
      long u2 = Long.max(v2, u1);
      long u3 = Long.max(v3, u2);
      long t0 = Long.max(v0, v1);
      long t1 = Long.max(v2, t0);
      long t2 = Long.max(v3, t1);
      long t3 = Long.max(result, t2);
      result = u3;
      result2 = t3;
      i = sum_dontinline_input_18(i, 4);
    }
    return asArray_dontinline_input_18(result, result2);
  }

  @Check(test = "test_18")
  public void check_18(Object[] results) {
    Verify.checkEQ(expected_18[0], results[0]);
    Verify.checkEQ(expected_18[1], results[1]);
    Verify.checkEQ(results[0], results[1]);
  }

  static long getArray_dontinline_input_18(int pos, int base) {
    return input_18[pos + base];
  }

  static Object[] asArray_dontinline_input_18(long result, long result2) {
    return new Object[] {result, result2};
  }

  static int sum_dontinline_input_18(int a, int b) {
    return a + b;
  }

  // --- test_18 end ---
  // --- LIST OF TESTS end   ---
}

rwestrel · 2026-02-03T08:44:24Z

src/hotspot/share/opto/loopnode.cpp

  return progress;
 }

+static AddNode* build_min_max(int opcode, Node* a, Node* b, PhaseIdealLoop* phase) {


There's already a MinMaxNode::build_min_max_long()

But maybe it's not convenient to use here.

Hmmmm, I could use it for sure. I would just need to have a opcode branch around it though, but then I could just reuse what is_associative does below

rwestrel · 2026-02-03T08:45:06Z

src/hotspot/share/opto/loopnode.cpp

+  }
+}
+
+static bool is_associative(Node* node) {


A comment that says only long min and max are supported at this point?

Yeah I can add a comment. The more I think about it, maybe it can be renamed to can_reassociate.

rwestrel · 2026-02-03T08:45:57Z

src/hotspot/share/opto/loopnode.cpp

+  return node->Opcode() == Op_MinL || node->Opcode() == Op_MaxL;
+}
+
+static Node* reassociate_chain(int add_opcode, Node* node, PhiNode* phi, Node* loop_head, PhaseIdealLoop* phase) {


rather than static methods, create a ReassociateAssociativeOperations class and make static methods part of the class.

Are there some guidelines on when to use static vs instance methods?

I initially considered making these instance methods of PhaseIdealLoop, which would have been fine for me, but I ended up with static methods to speed up development. Is making a ReassociateAssociativeOperations class with static methods preferable to this?

There are no guidelines that I'm aware of. One benefit of a new class might be that, rather than passing things around as arguments to methods, they can be fields in the class. For instance, phase can be a field. So no need to pass it around. Same goes with loop_head I suppose. And all the code that's specific to a particular optimizations and unlikely to be needed elsewhere is clearly grouped together.

Maybe the new class is overkill here but I went that way for some patch I'm working on and I found that it worked fairly well.

rwestrel · 2026-02-03T08:48:25Z

src/hotspot/share/opto/loopnode.cpp

+          Node* loop_head_use = loop_head->fast_out(i);
+          if (loop_head_use->is_Phi()) {
+            PhiNode* phi = loop_head_use->as_Phi();
+            Unique_Node_List wq;


I think you should be able to have this work without the extra Unique_Node_List. If you don't want to process new Phis in the loop, you can capture uint unique = Compile::unique() before entering the loop and then in the loop test phi->_idx < unique.

rwestrel · 2026-02-03T08:50:03Z

src/hotspot/share/opto/loopnode.cpp

+      break;
+    }
+
+    Node* use = nullptr;


current has only one use, so you can use current->unique_out() instead of the loop.

This said, wouldn't it be ok if there was multiple uses if only one of them is in the loop?

In theory yeah but we agreed to limit this as much as possible to avoid breaking other parts. The primary use of this would be unrolled loops by HotSpot itself (as opposed to user unrolled ones)

This reverts commit 66f3cd7.

This reverts commit d7cf51f.

This reverts commit 3b6d371.

rwestrel · 2026-03-03T16:16:24Z

src/hotspot/share/opto/loopopts.cpp

+      Node* loop_head = loop->head();
+      ReassociateReductionChains rrc(loop, this);
+
+      for (DUIterator_Fast imax, i = loop_head->fast_outs(imax); i < imax; i++) {


Can this be a method of ReassociateReductionChains?

rwestrel · 2026-03-03T16:17:25Z

src/hotspot/share/opto/loopopts.cpp

+  }
+
+  bool transform(Node* n, PhiNode* phi) {
+    bool is_associative = n->Opcode() == Op_MinL || n->Opcode() == Op_MaxL;


I would put this in its own method

galderz added 30 commits January 29, 2026 17:48

Minimal reworked impl with IR test passing

caba51c

Make it a template framework test

f339cdf

Remove unnecessary information

e22aacd

Support non-power-of-2 chains

8e26f00

Remove is_counted()

cf1e875

Test intermediate use of value

d00cadd

Extend impl to other AddNodes, MinL for now

21a2fc8

Test MinL

2080acc

Refactor AddNode construction to build_add

66c7718

Expand to Min/Max integer

6cc392c

Not all AddNodes can reassociate so stick to Min/Max for now

9d0a0b7

Expand to Min/Max Float

959fc49

Expand to Min/Max for doubles

fd56d4c

Expand to Float16

538ac8a

Add IR expectations for Float16

f7ca35c

Comment/revert Float16 changes, has a different IR shape

2c085de

Add support for AddI/AddL

26f9f81

Test with AddL, commented test for AddI

f0118c5

Phi with more than one output not enough, new approach needed

b2b8ae2

Deal with Phi with more than 1 chain, try each

7b336c2

Use auxiliary methods throughout that don't expose Add nodes

a02478c

Test passing for AddI

5bed557

Auxiliary test, to be removed in the end

510223e

Add support for OrL reassociation

69ff537

Add support for OrI reassociation

f873057

Add support for XOrI and XOrL

c70e5b4

Remove auxiliary tests

7ec2b93

Wrap new functionality in UseNewCode and rename test

7e77b03

Adjust test description

09b1ceb

Separate edge case scenarios that apply to all to a separate test

7d40ff3

galderz added 6 commits January 29, 2026 17:48

Update test description

3f04d07

Chain has to be of same original opcode, fixes sum + max mixed

040fe47

Check against given opcode when reassociating

648f1b6

Remove UseNewCode protection

3da5478

Add missing Verify.java @compile

56395e6

Limit to Long Min/Max and add some documentation

91a6967

galderz commented Jan 30, 2026

View reviewed changes

rwestrel reviewed Feb 3, 2026

View reviewed changes

galderz added 13 commits February 9, 2026 11:41

Use MinMaxNode::build_min_max_long instead of roll own

7f4dbe3

Use unique_out instead of hand rolled loop

3b6d371

Avoid work list by using an iterator that allows deletes

d7cf51f

Copy node notes to new nodes

66f3cd7

Revert "Copy node notes to new nodes"

6ce4dfc

This reverts commit 66f3cd7.

Revert "Avoid work list by using an iterator that allows deletes"

4f38d49

This reverts commit d7cf51f.

Revert "Use unique_out instead of hand rolled loop"

6f6aeda

This reverts commit 3b6d371.

Improved: Use unique_out instead of hand rolled loop

f4413ef

Improved avoid work list with iterator allowing deletes

f2975ba

Copy nodes to new nodes

bf4c27e

Combine templated and IR tests into a single class

82678b5

Refactor reassociation to loopopts and encapsulate

78df7c4

Minor adjustments

1dfc62b

rwestrel reviewed Mar 3, 2026

View reviewed changes

galderz added 2 commits March 3, 2026 17:46

Small refactorings after PR review

6eaa040

Add some documentation

2332973

Conversation

galderz commented Jan 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

galderz commented Jan 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants