Skip to content

Conversation

@teknium1
Copy link
Collaborator

PR Type

  • RL Environment PR - Complete Environment Snapshot & Zero-Training sections
  • Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

This pull request introduces a new Reinforcement Learning environment named NemoRPSlidingPuzzleEnv. This environment implements a sliding puzzle game, inspired by examples from NVIDIA's NeMo-RL framework, and integrates it into the Atropos framework.

The environment is designed for multi-turn interactions where a Language Model (LLM) attempts to solve a sliding puzzle of configurable size (e.g., 3x3). The LLM receives the current board state and a list of possible moves (moving the blank tile LEFT, UP, RIGHT, or DOWN) and must output its thinking process within <think> </think> tags followed by an action number.

Key features include:

  • Configurable board size and maximum steps per puzzle.
  • Generation of solvable puzzle instances.
  • Strict enforcement of the <think> </think> tag before action output.
  • Scoring based on whether the puzzle is solved (1.0) or not (0.0).
  • A length-based reward penalty applied if all attempts in a group solve the puzzle, to encourage shorter solutions.
  • Collection of trajectories, evaluation, and WandB logging, adhering to Atropos conventions.
  • A debug flag for verbose logging of internal game and environment states.
  • The game logic itself is encapsulated in an internal _SlidingPuzzleGame class, adapted from NeMo-RL.

This environment does not use explicit JSON-based tool calls; instead, the LLM's action is a direct numerical output representing a move.

Related Issues

N/A

Type of Change

N/A


🔖 Environment Snapshot

Field Your Entry
Environment Name nemorl_sliding_puzzle
Short Description An environment where an LLM plays a multi-turn sliding puzzle game, aiming to arrange numbers in ascending order by making sequential moves.
Category RLAIF
Dataset Needed? No (puzzles are procedurally generated)
External Deps None
Environmental Variables None specific to the environment itself (standard Atropos LLM server API keys/URLs would be needed for operation).
Compute Footprint Estimate Puzzle logic: <10 MB RAM, negligible CPU per instance. Overall footprint dominated by LLM inference calls. Trajectory collection for a single item is very fast, excluding LLM time.

Example of a "good" interaction (conceptual):

Puzzle Seed: 123, Board Size: 3x3
Initial State:
1 2 3
4 B 5
7 8 6

LLM Turn 1:
User: Current board state:
1 2 3
4 B 5
7 8 6
Available actions for the blank tile (B): Move Blank LEFT, Move Blank UP, Move Blank RIGHT, Move Blank DOWN
Recall: 0:LEFT, 1:UP, 2:RIGHT, 3:DOWN for the blank tile.
Provide your thinking process and then the action number.

Assistant: <think>The goal is to get 6 to its correct place. The blank is at (1,1). Moving the blank DOWN (action 3) would swap it with 8. Moving it RIGHT (action 2) would swap it with 5. Moving it LEFT (action 0) would swap it with 4. Moving it UP (action 1) would swap it with 2. To move 6 to the right, I need the blank to be to the right of 6. Currently 6 is at (2,2). The blank is at (1,1). Let me try to move 5 into the blank, so action 2 (move blank RIGHT).</think>2

... (further turns) ...

Final State:
0 1 2
3 4 5
6 7 B
Outcome: Solved in 15 steps. Score: 1.0

Example of a "bad" interaction (conceptual):

Puzzle Seed: 456, Board Size: 3x3
Initial State:
8 7 6
5 4 3
2 1 B

LLM Turn 1:
User: Current board state: ...
Assistant: <think>This looks complicated. I'll just try a random valid move. Moving blank UP (action 1) seems possible.</think>1

... (many turns later, or invalid action) ...

Outcome: Not solved within max_steps / Invalid action format. Score: 0.0

📄 Citation

It is recommended to add a citation or acknowledgment in the project documentation or relevant code comments, recognizing that the core sliding puzzle logic was adapted from NVIDIA's NeMo-RL examples. For instance:
"The sliding puzzle game logic in this environment is adapted from the examples provided in the NVIDIA NeMo-RL repository (https://github.com/NVIDIA/NeMo-RL)."


✅ Developer & Reviewer Checklist

  • Code follows project style (black, isort, flake8 pass with pre-commit)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes
  • Docstrings added for all new public classes / functions
  • If .env vars required, did you add it to the .env.example in repo root?

@teknium1 teknium1 marked this pull request as draft May 17, 2025 19:23
@teknium1 teknium1 marked this pull request as ready for review December 26, 2025 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants