TaskProf/index.html at master · rutgers-apl/TaskProf · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>TaskProf: A Fast Causal Profiler for Task Parallel Programs</title>

<style>
body {
    background:LightGoldenRodYellow ;
}

.topl {
position: absolute;
top:0;
    width: 20%;
    height: 120px;
background:white ;
position: fixed;
overflow-x: hidden;
}

.topl a {
    text-decoration: none;
    font-size: 30px;
    color: DarkSlateGray ;
    display: block;
}

.container {
position:absolute;
top:120px;
   width: 20%;
    height: 100%;
   background:white;
position: fixed;
overflow-x: hidden;
}

.container a {
    text-decoration: none;
    font-size: 20px;
    color: #818181;
    display: block;
}

.main {
position: absolute;
margin-left: 300px;
top:0;
    height: 100%;
    width: 70%
}

</style>
<!-- <link href="css/styles.css" rel="stylesheet"> -->
</head>

<body>

<div class="topl">
<a href="#top">TaskProf Home</a>
</div>

<div class="container">
  <a href="#TaskProf">1. TaskProf Installation and Usage</a>
  <a href="#Prerequisite">&nbsp &nbsp 1.1. Prerequisite</a>
  <a href="#Installation">&nbsp &nbsp 1.2. Installation</a>
  <a href="#Usage">&nbsp &nbsp 1.3. Usage</a>
  <br>
  <a href="#Evaluation">2. TaskProf Evaluation</a>
  <a href="#Effectiveness">&nbsp &nbsp 2.1. Effectiveness</a>
  <a href="#Performance">&nbsp &nbsp 2.2. Performance</a>
  <br>
  <a href="#remmachine">3. Artifact on Remote Machine</a>
  <a href="#reminstall">&nbsp &nbsp 3.1. Installation</a>
  <a href="#remusage">&nbsp &nbsp 3.2. Usage</a>
  <a href="#remevaluation">&nbsp &nbsp 3.3. Evaluation</a>
</div>

<!-- <div style="width:800px"> -->
<div class="main">
<br><br>
<center><font style="font-size:32px; font-family: Verdana;">TaskProf: A Fast Causal Profiler for Task Parallel Programs</font></center>
<br>

<!-- <br><br> -->
<h2>Artifact Overview</font></h2>
<p align="justify">We present TaskProf, a profiling tool for task parallel programs that use <a href="https://www.threadingbuildingblocks.org">Intel Threading Building Blocks(TBB)</a> C++ library. TaskProf assists developers in identifying parallelism bottlenecks in TBB programs. TaskProf provides an estimate of the increase in parallelism of a program on optimizing a region of code. This allows developers to focus optimization efforts of regions of code that matter in increasing the parallelism of the program.</p>
<p align="justify">The artifact contains our tool and the TBB applications that we used to evaluate it. We provide detailed instructions on how to build and use TaskProf. Using TaskProf, we were able to design parallelization techniques for five TBB applications. We describe how TaskProf assisted us in identifying the parallelization opportunities in the five applications.</p>

<h2><a name="TaskProf"/>1. TaskProf Installation and Usage</h2>
<p align="justify">TaskProf and the modified TBB library can be downloaded from <a href="https://github.com/rutgers-apl/TaskProf">GitHub</a>. The artifact is structured as follows. (a) <b>ptprof_lib</b> contains the TaskProf implementation, (b) <b>tprof-tbb-lib</b> contains the modified TBB library, and (c) <b>tests</b> contain two test programs to demonstrate TaskProf's usage. In addition, we provide all the benchmark applications that we used to evaluate TaskProf. The links to download the applications are provided in Section <a href="#Evaluation">2</a>.</p>

<p align="justify"><b>Since TaskProf requires a machine with hardware performance counters, we provide access to a remote machine that has hardware performance counters. Refer to Section <a href="#remmachine">3</a> for instructions on running the artifact on the remote machine.</b></p>

<h3><a name="Prerequisite"/>1.1. Prerequisite</h3>
<p align="justify">TaskProf relies on hardware support for performance counters. It must be executed on a machine that has performance counters enabled. In order to access the performance counters, TaskProf must be executed directly on the machine and not through a VM. Currently available VMs cannot access the host machine's performance counters. The build script that we provide for installation, checks if the machine supports hardware performance counters. Alternatively, one can check using the command,</p>
&nbsp;<code> > dmesg | grep PMU</code>
<p align="justify">If the output is "Performance Events: Unsupported...", then the machine does not support performance counters and TaskProf cannot be executed on the machine.</p>
<p align="justify"> TaskProf uses the <i>perf_events</i> API in Linux to read the performance counters. It must be executed on a linux machine with support for the <i>perf_events</i> API. Most recent versions of Linux support the <i>perf_events</i> API.We have tested TaskProf on Ubuntu 14.04 and 16.04.</p>

<h3><a name="Installation"/>1.2. Installation</h3>
<p align="justify">Install <i>perf</i> using,</p>
&nbsp;<code> > sudo apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`</code>
<p align="justify">We provide a bash script to automate the installation of TaskProf and the TBB library. We use &lt;ART_ROOT&gt; to refer to the base directory of our artifact. To build TaskProf,</p>
&nbsp;<code> > cd &lt;ART_ROOT&gt;</code>
<br>
&nbsp;<code> > source build.sh</code>
<p align="justify">A successful build will create the shared libraries for TaskProf and TBB and also setup the appropriate environment variables. Note, the bash script has to be sourced at the command-line. The installation will fail if it is run as an executable (i.e, with ./).</p>

<h3><a name="Usage"/>1.3. Usage</h3>
<p align="justify">We demonstrate how to use TaskProf on the <i>tree sum</i> test program under the <b>tests</b> directory.</p>
<p align="justify">TaskProf generates the profile for the program in two phases. In the first phase TaskProf executes in parallel along with the program and generates the profile data for the program. In the next phase, TaskProf reads the profile data and generates the parallelism profile for the program similar to as shown in Figure 3(a) in our paper. To execute TaskProf on the <i>tree sum</i> test program,</p>
&nbsp;<code> > cd &lt;ART_ROOT&gt;/tests/tree_sum</code>
<br>
&nbsp;<code> > make</code>
<br>
&nbsp;<code> > ./tree_sum</code>
<p align="justify">This generates the profile data from the execution of the program. Next execute TaskProf's post-processing program. Note, TP_GENPROF is an environment variable already setup by the build script.</p>

&nbsp;<code> > $TP_GENPROF/gentprof</code>

<p align="justify">This generates the parallelism profile for <i>treesum</i> in the file <b>ws_profile.csv</b>. The first row of the profile describes what each column represents. The second row specifies the parallelism of the entire program and the percentage of work performed on the critical path by the main function. The rest of the rows specify the parallelism and the critical path percentage for each static spawn site in the program.</p>

<p align="justify">The parallelism profile for <i>treesum</i> shows that the program has very low parallelism (2.5~4.5) and the majority of the work on the critical path (99%) is performed by the main function (or any function called from main before the first task is spawned or after the last task completes). On examining the code, we find that the main function (in main.cpp) calls <i>create_and_time</i> (in TreeMaker.h) where half of the nodes is passed to <i>do_all_1</i> and the other half is passed to <i>do_all_2</i>. On examining both the functions we find that <i>do_all_1</i> constructs the tree serially whereas <i>do_all_2</i> constructs it in parallel.</p>
<p align="justify">We can use TaskProf's Causal Profiler to obtain an estimate of the increase in the parallelism of the program on optimizing <i>do_all_1</i>. To generate the causal profile annotate the region of code around the call to <i>do_all_1</i> with __OPTIMIZE__BEGIN__ and __OPTIMIZE__END__ (already added). Now compile and execute TaskProf on the <i>treesum</i> program as shown above. The post-processing program, gentprof, will now generate a file named <b>region_all.csv</b> along with the file <b>ws_profile.csv</b>. The <b>region_all.csv</b> file reports the parallelism for the whole program when the parallelism of the annotated region is optimized by 50X, 100X, 200X and 400X. The causal profile shows that the parallelism of the program increases if the annotated region is optimized.</p>

<h2><a name="Evaluation"/>2. TaskProf Evaluation</h2>

<p align="justify">We propose to evaluate our artifact by demonstrating the effectivness of TaskProf in finding parallelism bottlenecks in TBB programs while profiling them in parallel.</p>

<h3><a name="Effectiveness"/>2.1. Effectiveness (RQ1 in the paper)</h3>

<p align="justify">Using TaskProf we were able to optimize five applications. These applications and their optimized versions can be downloaded from <a href="https://drive.google.com/file/d/0B3345eAvaoMKSG5xMVdnY0dseU0/view?usp=sharing">here</a>. To unpack the applications, </p>
&nbsp;<code> > cd &lt;ART_ROOT&gt;</code>
<br>
&nbsp;<code> > tar -xvf &lt;path_to_opt_benchmarks.tar.gz&gt;</code>

<p align="justify">Next, we describe how to generate the effectiveness results for the <i>minimum spanning forest</i> application (Figure 5.I in the paper). The instructions to reproduce the results for the other four applications follow similarly. Alternatively, we provide a python script run_bmarks_opt.py to execute all the applications and their optimized versions. To execute the script,</p>
&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks </code><br>
&nbsp;<code> > nohup python run_bmarks_opt.py > report_opt.txt & </code>
<p align="justify">This script reproduces the results in Figure 5 in the paper. The original parallelism profile (Figure (a) for each application) and the causal profile (Figure (b)) can in found in the <b>ws_profile.csv</b> and <b>region_all.csv</b> files respectively in each application directory. The optimized parallelism profile (Figure (c)) can be found in the <b>ws_profile.csv</b> file in the directory containing the optimized version of each application (directory name ending with _opt).</p>

<h4>Minimum Spanning Forest</h4>

&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks/minSpanningForest/parallelKruskal</code>

<p align="justify">Compile the application with TaskProf by running make.</p>
&nbsp;<code> > make </code>

<p align="justify">To profile the application with TaskProf execute the run.sh shell script.</p>
&nbsp;<code> > sh run.sh </code>

<p align="justify">This generates the parallelism profile (figure 5.I.a in the paper) in the <b>ws_profile.csv</b> file. In addition to the parallelism profile, it also generates the causal profile (figure 5.I.b in the paper) in <b>region_all.csv</b> for the two regions that were identified for optimization. </p>

<p align="justify">The optimized version of the <i>minimum spanning forest</i> application is at the location,</p>

&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks/minSpanningForest_opt/parallelKruskal</code>

<p align="justify">To generate the parallelism profile (figure 5.I.c) in the <b>ws_profile.csv</b> file for the optimized version,</p>
&nbsp;<code> > make </code>
<br>
&nbsp;<code> > sh run.sh </code>

<h4>Convex Hull</h4>

<p align="justify">To generate the parallelism profile (figure 5.II.a) and causal profile (figure 5.II.b) ,</p>

&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks/convexHull/quickHull</code>
<br>
&nbsp;<code> > make </code>
<br>
&nbsp;<code> > sh run.sh </code>

<p align="justify">To generate the parallelism profile (figure 5.II.c) for the optimized version,</p>

&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks/convexHull_opt/quickHull</code>
<br>
&nbsp;<code> > make </code>
<br>
&nbsp;<code> > sh run.sh </code>

<h4>Delaunay Triangulation</h4>

<p align="justify">To generate the parallelism profile (figure 5.III.a) and causal profile (figure 5.III.b) ,</p>

&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks/delaunayTriangulation/incrementalDelaunay</code>
<br>
&nbsp;<code> > make </code>
<br>
&nbsp;<code> > sh run.sh </code>

<p align="justify">To generate the parallelism profile (figure 5.III.c) for the optimized version,</p>

&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks/delaunayTriangulation_opt/incrementalDelaunay</code>
<br>
&nbsp;<code> > make </code>
<br>
&nbsp;<code> > sh run.sh </code>

<h4>Delaunay Refinement</h4>

<p align="justify">To generate the parallelism profile (figure 5.IV.a) and causal profile (figure 5.IV.b) ,</p>

&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks/delaunayRefine/incrementalRefine</code>
<br>
&nbsp;<code> > make </code>
<br>
&nbsp;<code> > sh run.sh </code>

<p align="justify">To generate the parallelism profile (figure 5.IV.c) for the optimized version,</p>

&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks/delaunayRefine_opt/incrementalRefine</code>
<br>
&nbsp;<code> > make </code>
<br>
&nbsp;<code> > sh run.sh </code>

<h4>Blackscholes</h4>

<p align="justify">To generate the parallelism profile (figure 5.V.a) and causal profile (figure 5.V.b) ,</p>

&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks/blackscholes</code>
<br>
&nbsp;<code> > make </code>
<br>
&nbsp;<code> > sh run.sh </code>

<p align="justify">To generate the parallelism profile (figure 5.V.c) for the optimized version,</p>

&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks/blackscholes_opt</code>
<br>
&nbsp;<code> > make </code>
<br>
&nbsp;<code> > sh run.sh </code>

<h3><a name="Performance"/>2.2. Performance (RQ2 in the paper)</h3>

<p align="justify">In order to reproduce the speedup graph of parallel execution of TaskProf over its serial execution (Figure 6 in the paper), download the rest of the benchmarks from <a href="https://drive.google.com/file/d/0B3345eAvaoMKV09uXzRXc3pPeFk/view?usp=sharing">here</a>.</p>

&nbsp;<code> > cd &lt;ART_ROOT&gt;</code>
<br>
&nbsp;<code> > tar -xvf &lt;path_to_all_benchmarks.tar.gz&gt;</code>

<p align="justify">Our artifact uses <i>jgraph</i>, a postscript graphing tool, to generate the speedup graph. The <i>jgraph</i> tool is available as a package for installing on Ubuntu. To install <i>jgraph</i> on Ubuntu</p>

&nbsp;<code> > sudo apt-get install jgraph</code>

<p align="justify">To convert the postscript graph generated by <i>jgraph</i> to a pdf we use <i>epstopdf</i>. To install <i>epstopdf</i> on Ubuntu,</p>

&nbsp;<code> > sudo apt-get install texlive-font-utils</code>

<p align="justify">We provide a python script that executes TaskProf on all the benchmarks and generates the speedup graph as output. Since the benchmark applications are long running we suggest using the nohup command to run the script. To run the benchmarks,</p>

&nbsp;<code> > cd &lt;ART_ROOT&gt;/benchmarks</code>
<br>
&nbsp;<code> > nohup python run_bmarks_speedup.py > report_speedup.txt & </code>
<p align="justify">The script generates a bar graph similar to Figure 6 in the paper in the file named <b>Speedup_graph.pdf</b>.</p>

<h2><a name="remmachine"/>3. TaskProf Installation and Usage on Remote Machine</h2>
<p align="justify">Login to the remote machine with the provided user credentials. TaskProf and the accompanying benchmarks are at the location <b>"/freespace/local/TaskProf"</b></p>

<h3><a name="reminstall"/>3.1. Installation</h3>
&nbsp;<code> > cd /freespace/local/TaskProf</code>
<br>
&nbsp;<code> > source build.sh</code>
<p align="justify">Note, the bash script has to be sourced at the command-line. The installation will fail if it is run as an executable (i.e, with ./).</p>

<h3><a name="remusage"/>3.2. Usage</h3>
<p align="justify">We demonstrate how to use TaskProf on the <i>tree sum</i> test program under the <b>tests</b> directory.</p>
<p align="justify">TaskProf generates the profile for the program in two phases. In the first phase TaskProf executes in parallel along with the program and generates the profile data for the program. In the next phase, TaskProf reads the profile data and generates the parallelism profile for the program similar to as shown in Figure 3(a) in our paper. To execute TaskProf on the <i>tree sum</i> test program,</p>
&nbsp;<code> > cd /freespace/local/TaskProf/tests/tree_sum</code>
<br>
&nbsp;<code> > make</code>
<br>
&nbsp;<code> > ./tree_sum</code>
<p align="justify">This generates the profile data from the execution of the program. Next execute TaskProf's post-processing program. Note, TP_GENPROF is an environment variable already setup by the build script.</p>

&nbsp;<code> > $TP_GENPROF/gentprof</code>

<p align="justify">This generates the parallelism profile for <i>treesum</i> in the file <b>ws_profile.csv</b>. The first row of the profile describes what each column represents. The second row specifies the parallelism of the entire program and the percentage of work performed on the critical path by the main function. The rest of the rows specify the parallelism and the critical path percentage for each static spawn site in the program.</p>

<p align="justify">The parallelism profile for <i>treesum</i> shows that the program has very low parallelism (2.5~4.5) and the majority of the work on the critical path (99%) is performed by the main function (or any function called from main before the first task is spawned or after the last task completes). On examining the code, we find that the main function (in main.cpp) calls <i>create_and_time</i> (in TreeMaker.h) where half of the nodes is passed to <i>do_all_1</i> and the other half is passed to <i>do_all_2</i>. On examining both the functions we find that <i>do_all_1</i> constructs the tree serially whereas <i>do_all_2</i> constructs it in parallel.</p>
<p align="justify">We can use TaskProf's Causal Profiler to obtain an estimate of the increase in the parallelism of the program on optimizing <i>do_all_1</i>. To generate the causal profile annotate the region of code around the call to <i>do_all_1</i> with __OPTIMIZE__BEGIN__ and __OPTIMIZE__END__ (already added). Now compile and execute TaskProf on the <i>treesum</i> program as shown above. The post-processing program, gentprof, will now generate a file named <b>region_all.csv</b> along with the file <b>ws_profile.csv</b>. The <b>region_all.csv</b> file reports the parallelism for the whole program when the parallelism of the annotated region is optimized by 50X, 100X, 200X and 400X. The causal profile shows that the parallelism of the program increases if the annotated region is optimized.</p>

<h3><a name="remevaluation"/>3.3. Evaluation</h3>
<p align="justify">Using TaskProf we were able to optimize five applications. They are <i>minSpanningForest, convexHull, delaunayTriangulation, delaunayRefine and blackscholes</i>. We provide a python script run_bmarks_opt.py to execute all the five applications and their optimized versions. To execute the script,</p>
&nbsp;<code> > cd /freespace/local/TaskProf/benchmarks </code><br>
&nbsp;<code> > nohup python run_bmarks_opt.py > report_opt.txt &</code>
<p align="justify">This script reproduces the results in Figure 5 in the paper. The original parallelism profile (Figure (a) for each application) and the causal profile (Figure (b)) can in found in the <b>ws_profile.csv</b> and <b>region_all.csv</b> files respectively in each application directory. The directory and sub-directory names for the applications are <i>"minSpanningForest/parallelKruskal"</i>, <i>"convexHull/quickHull"</i>, <i>"delaunayTriangulation/incrementalDelaunay"</i>, <i>"delaunayRefine/incrementalRefine"</i> and <i>"blackscholes"</i>. The optimized parallelism profile (Figure (c)) can be found in the <b>ws_profile.csv</b> file in the directory containing the optimized version of each application (directory names ending with _opt, with same sub-directory names as above). For executing each application individually, the steps are similar to the ones detailed in Section <a href="#Effectiveness">2.1</a> </p>

<p align="justify">In order to reproduce the speedup graph of parallel execution of TaskProf over its serial execution (Figure 6 in the paper), we provide a python script that executes TaskProf on all the benchmark applications and generates the speedup graph as output. To execute the script,</p>

&nbsp;<code> > cd /freespace/local/TaskProf/benchmarks</code>
<br>
&nbsp;<code> > nohup python run_bmarks_speedup.py > report_speedup.txt & </code>
<p align="justify">The script generates a bar graph similar to Figure 6 in the paper in the file named <b>Speedup_graph.pdf</b>.</p>


<br><br><br>
</div>
</body>