The mars-steg project is focused on trying to elicit steganographic behaviour by placing LLMs under an unusual kind of RL optimisation pressure, using a reward function with two elements:
- A reward for successfully completing a task;
- A penalty for using an aspect of language in the CoT, which is chosen to be critical to successful task performance.
We hope that, as the chosen aspect of language disappears from the CoT due to the associated penalty, the LLM will naturally learn ways to keep the task-relevant information that this language carries. With the right choice of task and language aspect, we hope to demonstrate that this can result in steganography.
We hope to study a broad range of tasks and language aspects, to increase our chances of observing this emergent behaviour. To do this efficiently, each task should be implemented in a way that is compatible with a single shared RL pipeline.
- 🔥 Fast and lightweight
- 🔄 Supports multiple file formats
- 🔧 Customizable via settings
- See requirements.txt for dependencies.
git clone https://github.com/puria-radmard/mars-steg.git
cd mars-steg
pip install -r requirements.txtRun the following command to run the training script:
run ./run_math.shContributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch). - Commit changes (
git commit -m "Add new feature"). - Push to the branch (
git push origin feature-branch). - Create a pull request.
🚧 In Construction
🚧 In Construction