This project implements a Graph Neural Network (GNN) for analyzing binary files using graph-based representations of functions. The GNN is trained to distinguish between positive and negative function pairs and can be used to detect and rank functions based on their similarity to known positive examples.
- Graph-Based Representation: Extracts control flow and data flow graphs from binary functions using Ghidra's decompiler.
- Graph Neural Network (GNN): Implements a GNN with contrastive loss for training on positive and negative function pairs.
- Visualization: Provides t-SNE visualization of graph embeddings after training.
- Detection: Detects and ranks functions in a binary file based on their similarity to positive embeddings.
- Customizable: Easily extendable to other binary analysis tasks.
- Windows
- Java JDK 21: Download here
- Python Libraries:
torch,torch_geometric,torch_scatternumpy,matplotlib,scikit-learnpandas,tqdm,magicpyghidra
-
Clone the Repository:
git clone https://github.com/MathMasterMind/AlgoTwin.git cd AlgoTwin -
Install Java JDK 21:
- Download and install Java JDK 21 from here.
-
Set Up the Environment:
- Run the setup.bat script to install Python dependencies and set up the environment:
setup.bat
- Run the setup.bat script to install Python dependencies and set up the environment:
-
Prepare Ghidra:
- Ensure the ghidra_11.3.1_PUBLIC_20250219 directory is present in the project folder.
-
Prepare Binaries:
- Place your binaries in the binaries directory for training and testing.
To train the GNN model on your dataset:
-
Ensure your dataset is defined in dataset.csv with the following format:
Name,Function1,Function2,... binary1,function_name1,function_name2,... binary2,function_name3,function_name4,...For example, Ghidra will ingest all executable binaries in the
binaries/binary1directory and look for all functions containingfunction_name1,function_name2. -
Run the run.bat script to process binaries, train the model, and save the embeddings:
run.bat
To detect and rank functions in a binary file:
- Place the binary file (e.g., arducopter) in the project directory.
- Run the detector.py script:
python detector.py
- Results will be saved in closest_functions.txt and printed to the console.
After training, the embeddings are visualized using t-SNE. The visualization is saved as graph_embeddings.png in the project directory.
- Blue Points: Positive embeddings
- Red Points: Negative embeddings
ECE6254-Project/
├── binaries/ # Directory for training binaries
├── ghidra_11.3.1_PUBLIC_20250219/ # Ghidra installation directory
├── setup.bat # Script to set up the environment
├── run.bat # Script to train the model and run the detector
├── dataset.csv # CSV file defining the dataset
├── ProcessDataset.py # Main script for processing binaries and training
├── detector.py # Script for detecting functions in a binary
├── positive_graphs.pkl # Saved positive graphs (after processing binaries)
├── negative_graphs.pkl # Saved negative graphs (after processing binaries)
├── positive_embeddings.json # Saved positive embeddings (after training)
├── trained_gnn_model.pth # Trained GNN model
├── model_dimensions.json # Dimensions of the trained model
├── closest_functions.txt # Output of the detector
└── graph_embeddings.png # t-SNE visualization of graph embeddings
-
Prepare Dataset:
- Add binaries and their target functions to dataset.csv.
-
Train the Model:
- Run run.bat to process binaries, train the model, and save embeddings.
-
Detect Functions:
- Use detector.py to rank functions in a binary based on their similarity to positive embeddings.
-
Visualize Embeddings:
- Check
graph_embeddings.pngfor a 2D visualization of the embeddings.
- Check
- CUDA PyTorch Could Not Install:
- Assumes you have a CUDA based graphics card. For the CPU version, refer to the PyTorch Geometric installation guide.
- Ghidra: Ghidra Official Website
- PyTorch Geometric: PyTorch Geometric Documentation
- t-SNE: t-SNE Algorithm
This project is licensed under the MIT License. See the LICENSE file for details.
- Daniel Khoshkhoo - Developer
- Luca Gianantonio, Caroline Huang - Dataset Construction
