1- # Description
1+ # BinKit 2.0
2+
23BinKit is a binary code similarity analysis (BCSA) benchmark. BinKit provides
34scripts for building a cross-compiling environment, as well as the compiled
4- dataset. The original dataset includes 1,352 distinct combinations of compiler
5- options of 8 architectures, 5 optimization levels, and 13 compilers. We
6- currently tested this code in Ubuntu 16.04.
5+ dataset. The current dataset includes 1,904 distinct combinations of compiler
6+ options of 8 architectures, 6 optimization levels, and 23 compilers. It includes
7+ 371,928 binaries.
8+
9+ The main improvements of the latest version of BinKit compared to the paper
10+ version of BinKit are as follows: Additional support for relatively newer
11+ compiler versions for major compilation options, and support for Ofast
12+ optimization option.
13+
14+ In particular, BinKit now includes GCC and Clang versions up to 11 and 13,
15+ respectively. Currently, a total of 6 optimization options (O0, O1, O2, O3, Os,
16+ Ofast) are supported. see the [ Currently supported compile
17+ options] ( https://github.com/SoftSec-KAIST/BinKit#currently-supported-compile-options )
18+ section below for more detailed options.
19+
20+ In Binkit 2.0 dataset, the gsl package misses 8 binaries with Ofast option due
21+ to compiler bugs. See the [ Missing binaries] ( https://github.com/SoftSec-KAIST/BinKit#Missing-binaries )
22+ part of the [ Issues] ( https://github.com/topcue/tmp#issues ) section for more
23+ information.
24+
25+ ## BinKit 1.0 (paper version)
26+ The original dataset includes 1,352 distinct combinations of compiler options of
27+ 8 architectures, 5 optimization levels, and 13 compilers. It includes 243,128
28+ binaries. We tested this code in Ubuntu 16.04.
729
830For more details, please check [ our
931paper] ( https://0xdkay.me/pub/2020/kim-arxiv2020.pdf ) .
@@ -19,7 +41,13 @@ You can download our dataset and toolchain as below. The link will be changed to
1941[ //] : # (Cloning this repository also downloads below pre-compiled dataset and toolchain
2042with ` git-lfs ` . Please use ` GIT_LFS_SKIP_SMUDGE=1 ` to skip the download.)
2143
22- ### Dataset
44+ ### Dataset (latest version)
45+
46+ - [ BinKit 2.0 dataset] ( https://drive.google.com/file/d/1TrjFnv6BMpVEXYukVxrhlQ78S0NPKEXa/view?usp=share_link )
47+
48+ ### Dataset (old)
49+ Below datasets are for reproduction of paper
50+
2351- [ Normal dataset] ( https://drive.google.com/file/d/1K9ef-OoRBr0X5u8g2mlnYqh9o1i6zFij/view?usp=sharing )
2452- [ SizeOpt dataset] ( https://drive.google.com/file/d/1QgwbEfd8vdzg5glNZFL7dg4l4hrkoWO3/view?usp=sharing )
2553- [ Noinline dataset] ( https://drive.google.com/file/d/1wt7GY-DDp8J_2zeBBVUrcfWIyerg_xLO/view?usp=sharing )
@@ -63,23 +91,35 @@ Below data is only used for our evaluation.
6391- O2
6492- O3
6593- Os
94+ - Ofast
6695
6796### Compilers
68- - gcc-4.9.4
69- - gcc-5.5.0
70- - gcc-6.4.0
71- - gcc-7.3.0
72- - gcc-8.2.0
73- - clang-4.0
74- - clang-5.0
75- - clang-6.0
76- - clang-7.0
77- - clang-8.0
78- - clang-9.0
79- - clang-obfus-fla (Obfuscator-LLVM - FLA)
80- - clang-obfus-sub (Obfuscator-LLVM - SUB)
81- - clang-obfus-bcf (Obfuscator-LLVM - BCF)
82- - clang-obfus-all (Obfuscator-LLVM - FLA + SUB + BCF)
97+ - gcc
98+ - gcc-4.9.4
99+ - gcc-5.5.0
100+ - gcc-6.4.0
101+ - gcc-6.5.0
102+ - gcc-7.3.0
103+ - gcc-8.2.0
104+ - gcc-9.4.0
105+ - gcc-10.3.0
106+ - gcc-11.2.0
107+ - clang
108+ - clang-4.0.0
109+ - clang-5.0.2
110+ - clang-6.0.1
111+ - clang-7.0.1
112+ - clang-8.0.0
113+ - clang-9.0.1
114+ - clang-10.0.1
115+ - clang-11.0.1
116+ - clang-12.0.1
117+ - clang-13.0.0
118+ - clang-obfus
119+ - clang-obfus-fla (Obfuscator-LLVM - FLA)
120+ - clang-obfus-sub (Obfuscator-LLVM - SUB)
121+ - clang-obfus-bcf (Obfuscator-LLVM - BCF)
122+ - clang-obfus-all (Obfuscator-LLVM - FLA + SUB + BCF)
83123
84124# How to use
85125### 1. Configure the environment in ` scripts/env.sh `
@@ -126,7 +166,7 @@ You can download the source code of GNU packages of your interest as below.
126166- You must give * ABSOLUTE PATH* for ` --base_dir ` .
127167
128168``` bash
129- $ source scripts/env
169+ $ source scripts/env.sh
130170$ python gnu_compile_script.py \
131171 --base_dir " /home/dongkwan/binkit/dataset/gnu" \
132172 --num_jobs 8 \
@@ -137,7 +177,7 @@ $ python gnu_compile_script.py \
137177You can compile only the packages or compiler options of your interest as below.
138178
139179``` bash
140- $ source scripts/env
180+ $ source scripts/env.sh
141181$ python gnu_compile_script.py \
142182 --base_dir " /home/dongkwan/binkit/dataset/gnu" \
143183 --num_jobs 8 \
@@ -148,7 +188,7 @@ $ python gnu_compile_script.py \
148188You can check the compiled binaries as below.
149189
150190``` bash
151- $ source scripts/env
191+ $ source scripts/env.sh
152192$ python compile_checker.py \
153193 --base_dir " /home/dongkwan/binkit/dataset/gnu" \
154194 --num_jobs 8 \
@@ -194,6 +234,16 @@ $ python gnu_compile_script.py \
194234If compilation fails, you may have to adjust the number of jobs for parallel
195235processing in the step 1, which is machine-dependent.
196236
237+ ### Missing binaries
238+
239+ In Binkit 2.0 dataset, the gsl package misses 8 binaries with Ofast option due
240+ to compiler bugs. Clang-8 and clang-9 induce compiler hang bug when compiling
241+ the gsl package for 32bit ARM with Ofast option. We reported this issue to
242+ bug-gsl and llvm-project respectively. However, bug-gsl did not reply, and the
243+ llvm-project replied that these versions are not currently supported. The bug
244+ reporting links are respectively as follows:
245+ [ bug-gsl] ( https://lists.gnu.org/archive/html/bug-gsl/2023-02/msg00000.html ) ,
246+ [ llvm-project] ( https://github.com/llvm/llvm-project/issues/60692 )
197247
198248# Authors
199249This project has been conducted by the below authors at KAIST.
@@ -218,3 +268,5 @@ paper](https://ieeexplore.ieee.org/document/9813408) when using BinKit.
218268 doi={10.1109/TSE.2022.3187689}
219269}
220270```
271+
272+
0 commit comments