|
| 1 | +> When algorithm cannot go any faster, you exploit the hardware |
| 2 | +
|
| 3 | +## Multiple threads |
| 4 | + |
| 5 | +Consider this scenario: you have 2 variables (like ints or long long) and you perform a long running task on each of them. Now to speed things up you use 2 threads hoping they would take half the amount of time. |
| 6 | + |
| 7 | +```cpp |
| 8 | +long long x = 0; |
| 9 | +long long y = 0; |
| 10 | + |
| 11 | +void increment(long long& a) { |
| 12 | + for (int i=0; i<100'000'000; i++) { |
| 13 | + a++; |
| 14 | + } |
| 15 | +} |
| 16 | +``` |
| 17 | +
|
| 18 | +Now measure the time taken when `increment` is invoked on x and y on separate threads. |
| 19 | +
|
| 20 | +```cpp |
| 21 | +int main() { |
| 22 | + auto start = std::chrono::high_resolution_clock::now(); |
| 23 | +
|
| 24 | + std::thread t1([&](){ increment(a); }); |
| 25 | + std::thread t2([&](){ increment(b); }); |
| 26 | + t1.join(); |
| 27 | + t2.join(); |
| 28 | +
|
| 29 | + auto end = std::chrono::high_resolution_clock::now(); |
| 30 | + std::cout << "time: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms\n"; |
| 31 | +
|
| 32 | + return 0; |
| 33 | +} |
| 34 | +``` |
| 35 | + |
| 36 | +Run the program and note the time taken, for my machine this turned out to be close to `500ms` |
| 37 | +Another metric we can use here is the IPC or instruction per cycle, which you can get using `perf` |
| 38 | +```shell |
| 39 | +$ perf stat ./a.out |
| 40 | +time: 503 ms |
| 41 | + |
| 42 | + Performance counter stats for './a.out': |
| 43 | + |
| 44 | + 893.16 msec task-clock:u # 1.760 CPUs utilized |
| 45 | + 0 context-switches:u # 0.000 /sec |
| 46 | + 0 cpu-migrations:u # 0.000 /sec |
| 47 | + 139 page-faults:u # 155.627 /sec |
| 48 | + 1,602,565,788 instructions:u # 0.67 insn per cycle |
| 49 | + 2,385,620,324 cycles:u # 2.671 GHz |
| 50 | + 200,447,733 branches:u # 224.424 M/sec |
| 51 | + 14,482 branch-misses:u # 0.01% of all branches |
| 52 | + TopdownL1 # 68.2 % tma_backend_bound |
| 53 | + # 11.9 % tma_bad_speculation |
| 54 | + # 4.3 % tma_frontend_bound |
| 55 | + # 15.6 % tma_retiring |
| 56 | + |
| 57 | + 0.507340864 seconds time elapsed |
| 58 | + |
| 59 | + 0.892937000 seconds user |
| 60 | + 0.000000000 seconds sys |
| 61 | +``` |
| 62 | + |
| 63 | +We can see `0.67 insn per cycle`, hmm ok. |
| 64 | + |
| 65 | +## Struct instead of int |
| 66 | + |
| 67 | +Now, let us use this padded struct instead of the long longs which we used earlier |
| 68 | + |
| 69 | +```cpp |
| 70 | +struct PaddedStruct { |
| 71 | + long long value; |
| 72 | + char pad[64 - sizeof(long long)]; |
| 73 | +}; |
| 74 | + |
| 75 | +PaddedStruct pa = {}; |
| 76 | +PaddedStruct pb = {}; |
| 77 | +``` |
| 78 | +
|
| 79 | +Overload the earlier defined function to handle this structure as well |
| 80 | +```cpp |
| 81 | +void increment(Padding& a) { |
| 82 | + for (int i=0; i<100'000'000; i++) { |
| 83 | + a.value++; |
| 84 | + } |
| 85 | +} |
| 86 | +``` |
| 87 | + |
| 88 | +Now invoke the functions on two thread, similar to what we did earlier |
| 89 | +```cpp |
| 90 | +std::thread t1([&](){ increment(pa); }); |
| 91 | +std::thread t2([&](){ increment(pb); }); |
| 92 | +``` |
| 93 | +
|
| 94 | +This time, you will notice time takes turns out to be roughly half of what was observed earlier. For my machine, this new time was `300ms`. |
| 95 | +Again, we can get the IPC using `perf` |
| 96 | +``` |
| 97 | +$ perf stat ./a.out |
| 98 | +time: 297 ms |
| 99 | + |
| 100 | + Performance counter stats for './a.out': |
| 101 | + |
| 102 | + 594.66 msec task-clock:u # 1.975 CPUs utilized |
| 103 | + 0 context-switches:u # 0.000 /sec |
| 104 | + 0 cpu-migrations:u # 0.000 /sec |
| 105 | + 138 page-faults:u # 232.066 /sec |
| 106 | + 1,602,565,643 instructions:u # 1.06 insn per cycle |
| 107 | + 1,508,069,432 cycles:u # 2.536 GHz |
| 108 | + 200,447,663 branches:u # 337.080 M/sec |
| 109 | + 14,506 branch-misses:u # 0.01% of all branches |
| 110 | + TopdownL1 # 71.2 % tma_backend_bound |
| 111 | + # 1.5 % tma_bad_speculation |
| 112 | + # 2.8 % tma_frontend_bound |
| 113 | + # 24.6 % tma_retiring |
| 114 | + |
| 115 | + 0.301146813 seconds time elapsed |
| 116 | + |
| 117 | + 0.594276000 seconds user |
| 118 | + 0.000000000 seconds sys |
| 119 | +``` |
| 120 | +
|
| 121 | +We can clearly see 1.06 insn per cycle, that is roughly double of what we saw in case of long longs. |
| 122 | +
|
| 123 | +
|
| 124 | +
|
0 commit comments