AlgoTwin: GNN-Based Detector for Algorithm-Level Binary Code Similarity Detection

This project implements a Graph Neural Network (GNN) for analyzing binary files using graph-based representations of functions. The GNN is trained to distinguish between positive and negative function pairs and can be used to detect and rank functions based on their similarity to known positive examples.

Features

Graph-Based Representation: Extracts control flow and data flow graphs from binary functions using Ghidra's decompiler.
Graph Neural Network (GNN): Implements a GNN with contrastive loss for training on positive and negative function pairs.
Visualization: Provides t-SNE visualization of graph embeddings after training.
Detection: Detects and ranks functions in a binary file based on their similarity to positive embeddings.
Customizable: Easily extendable to other binary analysis tasks.

Requirements

Operating System

Windows

Dependencies

Java JDK 21: Download here
Python Libraries:
- torch, torch_geometric, torch_scatter
- numpy, matplotlib, scikit-learn
- pandas, tqdm, magic
- pyghidra

Installation

Clone the Repository:

git clone https://github.com/MathMasterMind/AlgoTwin.git
cd AlgoTwin

Install Java JDK 21:
- Download and install Java JDK 21 from here.
Set Up the Environment:
- Run the setup.bat script to install Python dependencies and set up the environment:
```
setup.bat
```
Prepare Ghidra:
- Ensure the ghidra_11.3.1_PUBLIC_20250219 directory is present in the project folder.
Prepare Binaries:
- Place your binaries in the binaries directory for training and testing.

Usage

1. Train the Model

To train the GNN model on your dataset:

Ensure your dataset is defined in dataset.csv with the following format:
```
Name,Function1,Function2,...
binary1,function_name1,function_name2,...
binary2,function_name3,function_name4,...
```
For example, Ghidra will ingest all executable binaries in the binaries/binary1 directory and look for all functions containing function_name1, function_name2.
Run the run.bat script to process binaries, train the model, and save the embeddings:
```
run.bat
```

2. Detect Functions in a Binary

To detect and rank functions in a binary file:

Place the binary file (e.g., arducopter) in the project directory.
Run the detector.py script:
```
python detector.py
```
Results will be saved in closest_functions.txt and printed to the console.

Visualization

After training, the embeddings are visualized using t-SNE. The visualization is saved as graph_embeddings.png in the project directory.

Blue Points: Positive embeddings
Red Points: Negative embeddings

File Structure

ECE6254-Project/
├── binaries/                     # Directory for training binaries
├── ghidra_11.3.1_PUBLIC_20250219/ # Ghidra installation directory
├── setup.bat                     # Script to set up the environment
├── run.bat                       # Script to train the model and run the detector
├── dataset.csv                   # CSV file defining the dataset
├── ProcessDataset.py             # Main script for processing binaries and training
├── detector.py                   # Script for detecting functions in a binary
├── positive_graphs.pkl           # Saved positive graphs (after processing binaries)
├── negative_graphs.pkl           # Saved negative graphs (after processing binaries)
├── positive_embeddings.json      # Saved positive embeddings (after training)
├── trained_gnn_model.pth         # Trained GNN model
├── model_dimensions.json         # Dimensions of the trained model
├── closest_functions.txt         # Output of the detector
└── graph_embeddings.png          # t-SNE visualization of graph embeddings

Example Workflow

Prepare Dataset:
- Add binaries and their target functions to dataset.csv.
Train the Model:
- Run run.bat to process binaries, train the model, and save embeddings.
Detect Functions:
- Use detector.py to rank functions in a binary based on their similarity to positive embeddings.
Visualize Embeddings:
- Check graph_embeddings.png for a 2D visualization of the embeddings.

Troubleshooting

CUDA PyTorch Could Not Install:
- Assumes you have a CUDA based graphics card. For the CPU version, refer to the PyTorch Geometric installation guide.

References

Ghidra: Ghidra Official Website
PyTorch Geometric: PyTorch Geometric Documentation
t-SNE: t-SNE Algorithm

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributors

Daniel Khoshkhoo - Developer
Luca Gianantonio, Caroline Huang - Dataset Construction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlgoTwin: GNN-Based Detector for Algorithm-Level Binary Code Similarity Detection

Features

Requirements

Operating System

Dependencies

Installation

Usage

1. Train the Model

2. Detect Functions in a Binary

Visualization

File Structure

Example Workflow

Troubleshooting

References

License

Contributors

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
.gitignore		.gitignore
Evaluation.py		Evaluation.py
ExtractArchives.py		ExtractArchives.py
LICENSE		LICENSE
ProcessDataset.py		ProcessDataset.py
README.md		README.md
dataset.csv		dataset.csv
detector.py		detector.py
run.bat		run.bat
setup.bat		setup.bat

Folders and files

Latest commit

History

Repository files navigation

AlgoTwin: GNN-Based Detector for Algorithm-Level Binary Code Similarity Detection

Features

Requirements

Operating System

Dependencies

Installation

Usage

1. Train the Model

2. Detect Functions in a Binary

Visualization

File Structure

Example Workflow

Troubleshooting

References

License

Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages