Q: I'm used to working with static datasets for SFT/DPO. How can I use Atropos?
A: Atropos can work with static datasets. The main README.md mentions "📚 Dataset environments" like GSM8K and MMLU, which are designed to "Evaluate and improve LLM performance on static data."
More directly for SFT/DPO workflows, Atropos provides tools to convert data generated from its environments into formats suitable for Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO). See the next question for details on these tools.
Q: How do I convert data from Atropos environments into SFT or DPO format?
A: The "Offline Data Generation" section in the main README.md details how to use atropos-sft-gen and atropos-dpo-gen. These command-line tools allow you to collect rollouts from Atropos environments and convert them into SFT or DPO compatible dataset formats (e.g., .jsonl).
The "Offline Data Generation Quick Start" in the README.md provides an example:
# First, ensure run-api and your environment server are running
run-api &
python environments/gsm8k_server.py serve --slurm False # or an env of your choice
# Then, generate the SFT dataset
atropos-sft-gen path/to/output.jsonl --tokenizer Qwen/Qwen2.5-1.5B-InstructRefer to atropos-sft-gen -h and atropos-dpo-gen -h for more detailed usage and filtering options.
Q: What is group_size and how does it relate to my existing data if I'm used to non-grouped examples?
A: group_size is a configuration parameter mentioned in the README.md (e.g., in config_init and example CLI commands like python environments/gsm8k_server.py serve --config environments/configs/example.yaml --env.group_size 8). It typically relates to how many interactions or rollouts are batched or processed together within an environment.
If you're coming from static, non-grouped datasets, you can often start with a default group_size. The key takeaway is that tools like atropos-sft-gen and atropos-dpo-gen are designed to process these groups and produce datasets in familiar SFT/DPO formats, so you don't need to fundamentally change your understanding of individual data points for training.
This section guides users familiar with SFT/DPO on how to explore the online Reinforcement Learning capabilities of Atropos.
Q: I've used Atropos to generate SFT/DPO datasets. How can I start experimenting with online RL training?
A: Once you're comfortable generating offline datasets, the next step is to explore Atropos's capabilities with "live" or interactive data. The main README.md highlights:
- 🎮 Online environments: Examples like Blackjack and Taxi allow training LLMs through interactive game-based learning.
- 🤖 RLAIF and RLHF: Atropos supports fine-tuning LLMs using human feedback and alignment, as demonstrated by the RLAIF experiment artifacts in the
README.md. - 🔄 Multi-Turn RL: Environments like
gym_taxi.pyor those involving internal tool calling can train LLMs on complex multi-step interactions.
To start, you'd typically select an environment that suits your interest from the environments/ directory.
Q: What's the easiest way to run and observe an interactive Atropos environment?
A: The "Quick Start Guide" and "Testing and Debugging Tools" sections in the main README.md provide the best entry points:
- Configure an Environment:
- Edit the
config_initsection of an environment file (e.g.,environments/gsm8k_server.py) to point to your inference server (like a running VLLM or SGLang instance, or an OpenAI API compatible endpoint) and make any other desired configuration changes.
- Edit the
- Start the API Server:
run-api
- Start the Environment Microservice:
- In a separate terminal, start your chosen environment. For example:
python environments/gsm8k_server.py serve --openai.model_name YourModelName --slurm false # Or using a config file: # python environments/gsm8k_server.py serve --config environments/configs/example.yaml
- In a separate terminal, start your chosen environment. For example:
- Observe the Environment:
view-run: TheREADME.mdstates you can "Launch a Gradio UI to inspect batches of rollouts generated by your environment runs. This is useful for visually debugging the interactions and data flow." This is a great tool for "poking around" without a full training setup.processsubcommand: For quick, server-free local testing of a single environment, you can use itsprocesssubcommand. For example:This saves generated rollout groups to apython environments/gsm8k_server.py process --env.data_path_to_save_groups gsm8k.jsonl
.jsonlfile and also creates an HTML page (e.g.,gsm8k.html) to visualize these rollouts. Seepython your_env_server.py process --helpfor details.
Q: What are the advantages of using Atropos for online RL training compared to just SFT/DPO?
A: While SFT and DPO are useful, online RL in Atropos enables models to learn directly from interactions:
- Dynamic learning: Multi-turn and asynchronous RL let models adapt to evolving states—something static datasets can't capture.
- Complex skills: Experiments show major gains in tasks like tool calling and financial prediction, illustrating how RL helps train nuanced behaviors.
- Real-time feedback: The model receives immediate rewards (or corrections) during training, allowing it to refine its responses quickly.
- Exploration of new strategies: RL encourages the model to try different actions and learn from trial and error, uncovering solutions static data might miss.
- Better generalisation: Sampling from the model's own pretrained, latent policy space and reinforcing it in a range of environments with objective rewards tends to encourage the model to surface its OWN best way of tackling problems vs just SFT or human preference data with DPO. As seen with models like DeepSeek trained in similar fashion, the reasoning behaviours discovered during math & code training tended to improve capabilities across the board, even outside those kinds of environments.
Q: I see the Atropos example trainer (example_trainer/grpo.py) uses an algorithm called GRPO. What is that, and why might it be used with Atropos?
A: GRPO stands for Group Relative Policy Optimization. It's a Reinforcement Learning (RL) algorithm that can be effective for training Large Language Models (LLMs) on complex tasks, like those found in many Atropos environments (e.g., mathematical reasoning or code generation). It can be used for RLHF or RLAIF (there's examples in the repo), but it's handy for RLVR (Reinforcement Learning from Verifiable Rewards) as well.
Why it's for use with LLMs:
- Memory Efficiency: One of GRPO's main advantages is that it's designed to be more memory-efficient than some other RL algorithms like PPO. It achieves this by not needing a separate, often large, "value model" to estimate potential future rewards. This can be very helpful when training already large LLMs. The tradeoff is more inference for doing rollouts - but, generally that's less of an issue than VRAM is. There's nothing wrong with PPO, but with compute always being tight with LLMs, GRPO and similar more memory-efficient algorithms have gained a lot of ground.
- How it Learns: Instead of a value model, GRPO typically has the LLM generate multiple different responses for a given prompt or situation. These responses are then scored (e.g., an objective score from an environment, by an AI judge, a specific metric, or a reward model - anything you can use to get a reward signal really). GRPO uses the average score of this group of responses as a baseline. It then looks at how much better or worse each individual response was compared to that average. The model is then updated to make it more likely to produce responses that scored above the average.
- Suitability for Atropos: The example trainer uses GRPO to showcase how Atropos environments can provide the interactive data needed for this kind of online RL. This aligns with Atropos's goal of enabling LLMs to learn through interaction and feedback.
Q: What is the run-api command that's mentioned frequently, and why is it important?
A: The run-api command starts the central Atropos API server, which is a core component of the framework. Think of it as the main hub that connects your environments to your training process.
Here's its role:
- Trajectory Collector: When you run an Atropos environment (e.g.,
python environments/gsm8k_server.py serve), that environment generates interaction data (called "trajectories" or "rollouts"). It sends this data to therun-apiserver. - Data Source for Trainers: Your RL training script (like the
example_trainer/grpo.pyor a custom one you build) then fetches batches of these trajectories from therun-apiserver to update your LLM. - Central Coordination: It allows multiple different environments to potentially contribute data to the same training process and enables tools like
view-runto inspect the data flowing through the system.
So, in most Atropos workflows, you'll need the run-api server running in one terminal before you start your environment servers and your trainer.