-
Notifications
You must be signed in to change notification settings - Fork 18
Description
While running indexgather at scale on some larger Slingshot-11 machines I hit verification failures that turned out to be from failed allocations during exstack_init().
I believe this is from the THREADS*THREADS multiplication in the following allocation overflowing:
bale/src/bale_classic/exstack/exstack.upc
Line 86 in 1b8f673
| XStk->wait_done = lgp_all_alloc(THREADS*THREADS,sizeof(int64_t)); |
For 2**16 PEs this was 2**16 * 2**16, which overflows int (I think it's UB, but for my runs it was trying to do a zero sized allocation). I'd expect any number of PEs larger than 46,340 sqrt(2**31-1) to trigger this behavior.
Looking through the code, the only other obvious overflow I saw was:
bale/src/bale_classic/exstack/exstack.upc
Line 112 in 1b8f673
| lgp_put_int64(Xstk->wait_done, MYTHREAD*THREADS + Xstk->put_order[i], 1L); |
I wanted to check if this issue was surprising and whether anybody else has run exstack at this scale. I think 2**16 PEs is 128G of aggregator buffers per node, so it may just be beyond the scale exstack was designed for.