add sliding puzzle environment #54

teknium1 · 2025-05-17T19:23:33Z

PR Type

RL Environment PR - Complete Environment Snapshot & Zero-Training sections
Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

This pull request introduces a new Reinforcement Learning environment named NemoRPSlidingPuzzleEnv. This environment implements a sliding puzzle game, inspired by examples from NVIDIA's NeMo-RL framework, and integrates it into the Atropos framework.

The environment is designed for multi-turn interactions where a Language Model (LLM) attempts to solve a sliding puzzle of configurable size (e.g., 3x3). The LLM receives the current board state and a list of possible moves (moving the blank tile LEFT, UP, RIGHT, or DOWN) and must output its thinking process within <think> </think> tags followed by an action number.

Key features include:

Configurable board size and maximum steps per puzzle.
Generation of solvable puzzle instances.
Strict enforcement of the <think> </think> tag before action output.
Scoring based on whether the puzzle is solved (1.0) or not (0.0).
A length-based reward penalty applied if all attempts in a group solve the puzzle, to encourage shorter solutions.
Collection of trajectories, evaluation, and WandB logging, adhering to Atropos conventions.
A debug flag for verbose logging of internal game and environment states.
The game logic itself is encapsulated in an internal _SlidingPuzzleGame class, adapted from NeMo-RL.

This environment does not use explicit JSON-based tool calls; instead, the LLM's action is a direct numerical output representing a move.

Related Issues

N/A

Type of Change

N/A

🔖 Environment Snapshot

Field	Your Entry
Environment Name	`nemorl_sliding_puzzle`
Short Description	An environment where an LLM plays a multi-turn sliding puzzle game, aiming to arrange numbers in ascending order by making sequential moves.
Category	RLAIF
Dataset Needed?	No (puzzles are procedurally generated)
External Deps	None
Environmental Variables	None specific to the environment itself (standard Atropos LLM server API keys/URLs would be needed for operation).
Compute Footprint Estimate	Puzzle logic: <10 MB RAM, negligible CPU per instance. Overall footprint dominated by LLM inference calls. Trajectory collection for a single item is very fast, excluding LLM time.

Example of a "good" interaction (conceptual):

Puzzle Seed: 123, Board Size: 3x3
Initial State:
1 2 3
4 B 5
7 8 6

LLM Turn 1:
User: Current board state:
1 2 3
4 B 5
7 8 6
Available actions for the blank tile (B): Move Blank LEFT, Move Blank UP, Move Blank RIGHT, Move Blank DOWN
Recall: 0:LEFT, 1:UP, 2:RIGHT, 3:DOWN for the blank tile.
Provide your thinking process and then the action number.

Assistant: <think>The goal is to get 6 to its correct place. The blank is at (1,1). Moving the blank DOWN (action 3) would swap it with 8. Moving it RIGHT (action 2) would swap it with 5. Moving it LEFT (action 0) would swap it with 4. Moving it UP (action 1) would swap it with 2. To move 6 to the right, I need the blank to be to the right of 6. Currently 6 is at (2,2). The blank is at (1,1). Let me try to move 5 into the blank, so action 2 (move blank RIGHT).</think>2

... (further turns) ...

Final State:
0 1 2
3 4 5
6 7 B
Outcome: Solved in 15 steps. Score: 1.0

Example of a "bad" interaction (conceptual):

Puzzle Seed: 456, Board Size: 3x3
Initial State:
8 7 6
5 4 3
2 1 B

LLM Turn 1:
User: Current board state: ...
Assistant: <think>This looks complicated. I'll just try a random valid move. Moving blank UP (action 1) seems possible.</think>1

... (many turns later, or invalid action) ...

Outcome: Not solved within max_steps / Invalid action format. Score: 0.0

📄 Citation

It is recommended to add a citation or acknowledgment in the project documentation or relevant code comments, recognizing that the core sliding puzzle logic was adapted from NVIDIA's NeMo-RL examples. For instance:
"The sliding puzzle game logic in this environment is adapted from the examples provided in the NVIDIA NeMo-RL repository (https://github.com/NVIDIA/NeMo-RL)."

✅ Developer & Reviewer Checklist

Code follows project style (black, isort, flake8 pass with pre-commit)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
New and existing unit tests pass locally with my changes
Docstrings added for all new public classes / functions
If .env vars required, did you add it to the .env.example in repo root?

add sliding puzzle environment

1e58d43

teknium1 marked this pull request as draft May 17, 2025 19:23

teknium1 marked this pull request as ready for review December 26, 2025 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add sliding puzzle environment #54

add sliding puzzle environment #54

Uh oh!

teknium1 commented May 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add sliding puzzle environment #54

Are you sure you want to change the base?

add sliding puzzle environment #54

Uh oh!

Conversation

teknium1 commented May 17, 2025

PR Type

📝 General Information

Description

Related Issues

Type of Change

🔖 Environment Snapshot

📄 Citation

✅ Developer & Reviewer Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants