This is the repository for the tool to compute the semantic similarity between the original C code and the translation.
The overview of this tool is as follows. The script in src/python/translationValidator.py kicks off the process.
-
It accepts as input a codebase, and then it separates out each function in its own file. The script file responsible for this is
src/python/functionAndDepsExtractor.pyThe codebase should reside insidesrc/python/inputs-complex. -
Then, it reads each file (containing one function) and asks the chosen LLM model for a translation. (See script:
src/python/gptTranslation.py). -
It tries to compile the Rust translation, and if it fails, it tries feeds the error message back to the LLM model and asks it to fix it. There is a cap on the number of attempts that can be controlled by the
COMPILATION_RETRIESin the scriptgptTranslation.py. -
The output is typically inside a directory named
individual-funcs_<options>, whereoptionscontains the model used and the timestamp. Every time we execute the script, it will create a new directory so and not cobble the old result directory. -
Then, It will compile the each .i and .rs files into LLVM IR and symbolized it. We will mark every input arguments and return values as symbols and print the corresponding KLEE Symbolic value.
-
Then, we will use KLEE to execute the symbolized IR and gathers all of the output graph and calculate the edit distance between C results and Rust results.
NOTE: IMPORTANT:
When developing or trying to fix bugs, PLEASE make sure that the LLM model is set to GPT-3.5. You can do this by turning off use-gpt4 and use-claude in the translationValidator.py either by hardcoding it or passing these values explicitly as False on the command line.
GPT-4o is very expensive. So please use GPT-3.5 for testing and development and when we are ready to generate the results, lets use GPT-4.o.
If you are a member of the team, please email tpalit@ucdavis.edu for the OpenAI key that you should set up in an environment variable as described below (if you haven't received it yet).
Frontend includes extract all individual .i files and translate them into rust by LLM
-
Run
git submodule update --init --recursive. Insidesrc/SVFexecute./build.shand then insideRelease-Buildinvokesudo make install. -
Install
rustusingrustup. Then, downgrade to version 1.64.0 which uses LLVM 14 backend, which we use.`rustup install 1.64.0` `rustup default 1.64.0` -
Please clone
https://github.com/davsec-lab/typedefextractorand build it. Make sure it builds theclangproject. -
Add the build directory to your
$PATH. Make sure you can rununused-typedef-extractor <src-dir>from the terminal. -
Make sure you have
universal-ctagsinstalled.
sudo apt purge ctags && sudo apt install universal-ctags.
-
Make sure you have a GPT key stored in the environment variable
$OPENAI_KEY. -
Install the Python modules
openai,tiktoken,more_itertools, andpycparserusingpip3. For the validator, also installantlr4-tools,antlr4-python3-runtime,numpy,scipy,pygraphviz,pydot, andnetworkx.
Backend includes using LLVM to modify all rust and C files and use KLEE to get symbolic value. Then run graph compare algorithm between them
-
Install symbolic dependencies
sudo apt-get install z3 cmakepip3 install cmakepip install cmake
-
Download the LLVM and clang binaries
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-14.0.0/clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz.- Extract it
tar -Jxvf clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz. - Add the
<FULL_PATH>/clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04/binto$PATH. This will bring the binaries on your path and you can invoke them like standard Linux tools.
-
Once you init the LLVM submodules you should have the KLEE repository.
- Create a directory for
klee-buildin<PATH>/rustassure/src - Run
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_TCMALLOC=0 -DENABLE_SOLVER_Z3=ON ../klee - Run
make -j4 && sudo make install
- Create a directory for
-
install json dependency by
git submodule update --init -
Build the Symbolizer pass. This is LLVM tool that automatically inserts the
klee_make_symbolicandklee_print_exprsfunctions to the LLVM bitcode. Insiderustassure/src/Symbolizerrun./build.sh.
NOTE: When pulling, please make sure that you have the latest of the typedefextractor repo too.
This part gives detailed introduction of the whole tool chain.
There is a wrapper (inputs-complex/clang-wrapper.sh) around the clang compiler that dumps out the preprocessed files. Configure and build the source code of the target application by passing CC=<dir>/clang-wrapper.sh.
For an example, check out compile.sh in inputs-complex/zlib-1.3.1.
This will generate a bunch of .i files in the source directory. We want those.
NOTE: The clang wrapper assumes that the Makefile commands compile a single file at a time. This is the common case. But if you have something that tries to compile multiple files (and link) in the same command, such as $(CC) a.c b.c -o a.out, the wrapper won't work. Please let me know in case it's not easy to adjust the Makefile.
The GPT tranlation module has following steps:
- Parse the
.ifiles and extract the individual functions and create.ifiles for each function. This file will also contain all thetypedefandastructdefinitions referenced by that function.
The script will automatically filter all unneeded dependencies from the preprocessor expansion by automatically invoking unused-typedef-extractor. The code to do this is in typedefFilter.py.
-
Then it will take each individual
.ifile and invoke thegptTranslation.pyfile. Currently, it uses GPT-3.5 by default (to prevent us from going bankrupt). To use GPT-4 pass--use-gpt4to thetranslationValidator.py. -
This will (hopefully) use GPT to create a corresponding
.rsfile for each.ifile. -
Automatically invoke both the Clang C compiler to compile the
.ifiles for the individual functions and therustccompiler for the individual functions for the.rsfiles. Any compilation failures will be displayed on screen, and also in the log file invalidator.log.
The final files will be in the directory <SRC_DIR>/individual-funcs. This directory will contain the individual .i files, the Rust files for each function, and the compiled bitcodes for both the .i file and the .rs file (if successful).
We can only fine-tune GPT 3.5 models, as of 6/24/2024.
-
Place the fine-tuning training file in
./trainingaccording to the existing formatting. -
Run
python3 jsonifyTrainingData.py. -
Go to
https://platform.openai.com/finetune/to check the progress. It should show the fine tuning job. When it finishes, grab the name of the model (TODO: provide it as a argument to the script) -
Then, pass --fine-tuned-model=<model_name> when invoking
translatorValidator.py.
Steps of symbolic execution module is as following:
-
Compiles each C file to LLVM bitcode.
-
Applies a custom LLVM pass to generate symbolized LLVM IR.
-
Runs KLEE on the IR files to extract symbolic execution logs.
-
Converts symbolic expressions into tree structures and saves them as .png files.
-
run graph compare algorithm to compare the similiarity and differences between the output symbolic structure.
This part gives serveral ways to run rust-validator tool chain.
Rust-validator has two parts
- Frontend : translate C code base to Rust.
- Backend : use KLEE to verify transaltion similarity.
- Inside
src/pythonrunpython3 translationValidator.py - It should be given a input codebase directory by
--src=<input directory> - The default GPT model is gpt 3.5, it can be changed by input
--use-gpt4=trueor--use-claude=true - The final files will be in the directory
<SRC_DIR>/individual-funcs. This directory will contain the individual.ifiles, the Rust files for each function, and the compiled bitcodes for both the.ifile and the.rsfile (if successful).
- inside
src/pythonrunpython3 performSymbolExecution.py, there are several existing evaluation codebases that can be selected. You can select the codebase and GPT model that you want to try according to the commandline hint. - Or, you can specify any translated repostiory that you want to test by
python3 performSymbolExecution.py --src=<input directory> - The output is in src/Symbolizer, the directory name is
codebase_gptmodel_data.
There are several ouput inside the output directory
- Rust and C symbol results are in
graph_output - Graph compare results(edit_distance) are in
edit_distance result.csvincludes all of the statistical results- total_functions : Total number of functions in the original input codebase.
- total_rust_functions_compiled : Total number of translated rust functions that can be compiled.
- total_arguments : Total number of arguments of the rust functions that can be compiled. If it is a struct, then expand it.
- edit_distance_equal_0 : Total number of arguments that the edit_distance of the symbolic value between C & Rust are 0.
- overall_lines_sum : Total lines of the transalted Rust target functions.
- overall_unsafe_sum : Total lines of the transalted Rust target functions which are unsafe.
- overall_safe_lines : Total lines of the transalted Rust target functions which are safe.
- coverage : KLEE execution instruction coverage of Rust.
- It combines frontend and backend of the tool chain. It will firstly transalte the input codebase and then use KLEE to verify the results.
- Run
python3 translateAndSymbolicValidate.py --src=<input_directory> - Change GPT model by specify argument through
--use-gpt4=true