Skip to content

Reassociate reduction chains for long Min/Max#2

Open
galderz wants to merge 51 commits intorwestrel:masterfrom
galderz:topic.reassoc-reduct-chain.roland
Open

Reassociate reduction chains for long Min/Max#2
galderz wants to merge 51 commits intorwestrel:masterfrom
galderz:topic.reassoc-reduct-chain.roland

Conversation

@galderz
Copy link
Copy Markdown

@galderz galderz commented Jan 30, 2026

Here's my draft proposal for JDK-8351409.

I've run tier1-3 testing on x86_64 successfully. Emanuel run tests and they looked good.

Here are the benchmark results I observe:

darwin/aarch64:

Benchmark                                         (SIZE)  (seed)   Mode  Cnt      Base     Patch   Units   Diff
VectorReduction2.NoSuperword.longMaxBig             2048       0  thrpt    3   935.919  1085.515  ops/ms   +16%
VectorReduction2.NoSuperword.longMaxDotProduct      2048       0  thrpt    3  1049.671  2756.752  ops/ms  +163%
VectorReduction2.NoSuperword.longMaxSimple          2048       0  thrpt    3  1095.269  2588.862  ops/ms  +136%
VectorReduction2.NoSuperword.longMinBig             2048       0  thrpt    3   990.134  1147.871  ops/ms   +16%
VectorReduction2.NoSuperword.longMinDotProduct      2048       0  thrpt    3  1067.192  2711.534  ops/ms  +154%
VectorReduction2.NoSuperword.longMinSimple          2048       0  thrpt    3  1077.393  2690.817  ops/ms  +150%
VectorReduction2.WithSuperword.longMaxBig           2048       0  thrpt    3  1001.981  1149.861  ops/ms   +15%
VectorReduction2.WithSuperword.longMaxDotProduct    2048       0  thrpt    3  1079.197  2772.765  ops/ms  +157%
VectorReduction2.WithSuperword.longMaxSimple        2048       0  thrpt    3  1206.340  1208.723  ops/ms     0%
VectorReduction2.WithSuperword.longMinBig           2048       0  thrpt    3  1004.450  1150.072  ops/ms   +14%
VectorReduction2.WithSuperword.longMinDotProduct    2048       0  thrpt    3  1080.833  2875.314  ops/ms  +166%
VectorReduction2.WithSuperword.longMinSimple        2048       0  thrpt    3  1119.842  1209.392  ops/ms    +8%

linux/x86_64 (xeon cascade lake with AVX-512):

Benchmark                                         (SIZE)  (seed)   Mode  Cnt      Base     Patch   Units   Diff
VectorReduction2.NoSuperword.longMaxBig             2048       0  thrpt    3   327.435   307.119  ops/ms    -6%
VectorReduction2.NoSuperword.longMaxDotProduct      2048       0  thrpt    3   406.023   606.387  ops/ms   +49%
VectorReduction2.NoSuperword.longMaxSimple          2048       0  thrpt    3   463.940   980.875  ops/ms  +111%
VectorReduction2.NoSuperword.longMinBig             2048       0  thrpt    3   324.709   330.053  ops/ms    +2%
VectorReduction2.NoSuperword.longMinDotProduct      2048       0  thrpt    3   414.097   605.047  ops/ms   +46%
VectorReduction2.NoSuperword.longMinSimple          2048       0  thrpt    3   461.952   825.363  ops/ms   +79%
VectorReduction2.WithSuperword.longMaxBig           2048       0  thrpt    3   547.252   557.059  ops/ms    +2%
VectorReduction2.WithSuperword.longMaxDotProduct    2048       0  thrpt    3   948.265   944.391  ops/ms     0%
VectorReduction2.WithSuperword.longMaxSimple        2048       0  thrpt    3  1290.747  1290.617  ops/ms     0%
VectorReduction2.WithSuperword.longMinBig           2048       0  thrpt    3   552.254   552.402  ops/ms     0%
VectorReduction2.WithSuperword.longMinDotProduct    2048       0  thrpt    3   955.592   955.159  ops/ms     0%
VectorReduction2.WithSuperword.longMinSimple        2048       0  thrpt    3  1290.704  1290.363  ops/ms     0%

The -6% down for longMaxBig on xeon is noise, I did perfnorm and perfasm runs and in those it was 330 ops/ms.

Please see some additional notes I've made in the code.

* a typical loop would generate additional IR nodes. Hence, the test uses
* a custom while loop with non-inlined helper methods.
*/
public class TestReductionReassociationForAssociativeAdds {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure about the name for this test. As you can see, I've originally used it to test all associative adds but which it doesn't do now, but I thought it could be useful to leave it like this for future enhancements? We know we would want to do this for min/max for Integer, and I think additions for long/integer might also work well.

@galderz
Copy link
Copy Markdown
Author

galderz commented Jan 30, 2026

To help review, this is what the generated IR test looks like (having it formatted after test execution):

package compiler.loopopts.templated;

// --- IMPORTS start ---
import compiler.lib.generators.*;
import compiler.lib.ir_framework.*;
import compiler.lib.verify.*;

// --- IMPORTS end   ---
public class ReductionReassociationForAssociativeAdds {
  // --- CLASS_HOOK insertions start ---
  // --- CLASS_HOOK insertions end   ---
  public static void main(String[] vmFlags) {
    TestFramework framework = new TestFramework(ReductionReassociationForAssociativeAdds.class);
    framework.addFlags(
        "-classpath",
        "/Users/g/src/jdk-reassoc-reduct-chain/JTwork/classes/compiler/loopopts/TestReductionReassociationForAssociativeAdds.d:/Users/g/src/jdk-reassoc-reduct-chain/test/hotspot/jtreg/compiler/loopopts:/Users/g/src/jdk-reassoc-reduct-chain/JTwork/classes/compiler/loopopts/TestReductionReassociationForAssociativeAdds.d/test/lib:/Users/g/src/jdk-reassoc-reduct-chain/test/lib:/Users/g/src/jdk-reassoc-reduct-chain/test/hotspot/jtreg:/Users/g/opt/jtreg/build/images/jtreg/lib/javatest.jar:/Users/g/opt/jtreg/build/images/jtreg/lib/jtreg.jar:/Users/g/src/jdk-reassoc-reduct-chain/JTwork/scratch/./compile-framework-classes-4709488473270129801");
    framework.addFlags(vmFlags);
    framework.start();
  }

  // --- LIST OF TESTS start ---
  // --- test_2 start ---

  private static long[] input_2 = new long[10000];

  static {
    Generators.G.fill(Generators.G.longs(), input_2);
  }

  private Object[] expected_2 = test_2();

  @Test
  @IR(
      counts = {IRNode.MIN_L, "= 4"},
      phase = CompilePhase.AFTER_LOOP_OPTS)
  public Object[] test_2() {
    long result = Long.MAX_VALUE;
    long result2 = Long.MAX_VALUE;
    int i = 0;
    while (i < input_2.length) {
      long v0 = getArray_dontinline_input_2(0, i);
      long v1 = getArray_dontinline_input_2(1, i);
      long v2 = getArray_dontinline_input_2(2, i);
      long v3 = getArray_dontinline_input_2(3, i);
      long u0 = Long.min(v0, result);
      long u1 = Long.min(v1, u0);
      long u2 = Long.min(v2, u1);
      long u3 = Long.min(v3, u2);
      long t0 = Long.min(v0, v1);
      long t1 = Long.min(v2, t0);
      long t2 = Long.min(v3, t1);
      long t3 = Long.min(result, t2);
      result = u3;
      result2 = t3;
      i = sum_dontinline_input_2(i, 4);
    }
    return asArray_dontinline_input_2(result, result2);
  }

  @Check(test = "test_2")
  public void check_2(Object[] results) {
    Verify.checkEQ(expected_2[0], results[0]);
    Verify.checkEQ(expected_2[1], results[1]);
    Verify.checkEQ(results[0], results[1]);
  }

  static long getArray_dontinline_input_2(int pos, int base) {
    return input_2[pos + base];
  }

  static Object[] asArray_dontinline_input_2(long result, long result2) {
    return new Object[] {result, result2};
  }

  static int sum_dontinline_input_2(int a, int b) {
    return a + b;
  }

  // --- test_2 end ---
  // --- test_18 start ---

  private static long[] input_18 = new long[10000];

  static {
    Generators.G.fill(Generators.G.longs(), input_18);
  }

  private Object[] expected_18 = test_18();

  @Test
  @IR(
      counts = {IRNode.MAX_L, "= 4"},
      phase = CompilePhase.AFTER_LOOP_OPTS)
  public Object[] test_18() {
    long result = Long.MIN_VALUE;
    long result2 = Long.MIN_VALUE;
    int i = 0;
    while (i < input_18.length) {
      long v0 = getArray_dontinline_input_18(0, i);
      long v1 = getArray_dontinline_input_18(1, i);
      long v2 = getArray_dontinline_input_18(2, i);
      long v3 = getArray_dontinline_input_18(3, i);
      long u0 = Long.max(v0, result);
      long u1 = Long.max(v1, u0);
      long u2 = Long.max(v2, u1);
      long u3 = Long.max(v3, u2);
      long t0 = Long.max(v0, v1);
      long t1 = Long.max(v2, t0);
      long t2 = Long.max(v3, t1);
      long t3 = Long.max(result, t2);
      result = u3;
      result2 = t3;
      i = sum_dontinline_input_18(i, 4);
    }
    return asArray_dontinline_input_18(result, result2);
  }

  @Check(test = "test_18")
  public void check_18(Object[] results) {
    Verify.checkEQ(expected_18[0], results[0]);
    Verify.checkEQ(expected_18[1], results[1]);
    Verify.checkEQ(results[0], results[1]);
  }

  static long getArray_dontinline_input_18(int pos, int base) {
    return input_18[pos + base];
  }

  static Object[] asArray_dontinline_input_18(long result, long result2) {
    return new Object[] {result, result2};
  }

  static int sum_dontinline_input_18(int a, int b) {
    return a + b;
  }

  // --- test_18 end ---
  // --- LIST OF TESTS end   ---
}

return progress;
}

static AddNode* build_min_max(int opcode, Node* a, Node* b, PhaseIdealLoop* phase) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a MinMaxNode::build_min_max_long()

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe it's not convenient to use here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, I could use it for sure. I would just need to have a opcode branch around it though, but then I could just reuse what is_associative does below

}
}

static bool is_associative(Node* node) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment that says only long min and max are supported at this point?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can add a comment. The more I think about it, maybe it can be renamed to can_reassociate.

return node->Opcode() == Op_MinL || node->Opcode() == Op_MaxL;
}

static Node* reassociate_chain(int add_opcode, Node* node, PhiNode* phi, Node* loop_head, PhaseIdealLoop* phase) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than static methods, create a ReassociateAssociativeOperations class and make static methods part of the class.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there some guidelines on when to use static vs instance methods?

I initially considered making these instance methods of PhaseIdealLoop, which would have been fine for me, but I ended up with static methods to speed up development. Is making a ReassociateAssociativeOperations class with static methods preferable to this?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no guidelines that I'm aware of. One benefit of a new class might be that, rather than passing things around as arguments to methods, they can be fields in the class. For instance, phase can be a field. So no need to pass it around. Same goes with loop_head I suppose. And all the code that's specific to a particular optimizations and unlikely to be needed elsewhere is clearly grouped together.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the new class is overkill here but I went that way for some patch I'm working on and I found that it worked fairly well.

Node* loop_head_use = loop_head->fast_out(i);
if (loop_head_use->is_Phi()) {
PhiNode* phi = loop_head_use->as_Phi();
Unique_Node_List wq;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to have this work without the extra Unique_Node_List. If you don't want to process new Phis in the loop, you can capture uint unique = Compile::unique() before entering the loop and then in the loop test phi->_idx < unique.

break;
}

Node* use = nullptr;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current has only one use, so you can use current->unique_out() instead of the loop.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This said, wouldn't it be ok if there was multiple uses if only one of them is in the loop?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory yeah but we agreed to limit this as much as possible to avoid breaking other parts. The primary use of this would be unrolled loops by HotSpot itself (as opposed to user unrolled ones)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Node* loop_head = loop->head();
ReassociateReductionChains rrc(loop, this);

for (DUIterator_Fast imax, i = loop_head->fast_outs(imax); i < imax; i++) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a method of ReassociateReductionChains?

}

bool transform(Node* n, PhiNode* phi) {
bool is_associative = n->Opcode() == Op_MinL || n->Opcode() == Op_MaxL;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this in its own method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants