Project
Course Unit: Topics in Intelligent Robotics 2024/2025
Course: Master's in Artificial Intelligence (MIA)
Faculty: Faculty of Engineering, University of Porto
Report
Presentation
This project extends the classic ball-and-plate control problem by adding maze navigation. The goal was to develop an agent that could guide a ball through generated mazes on a tilting platform. Rather than solving just one static maze, the focus was on creating agents that could generalize across different maze configurations.
The work progressed through three levels of complexity:
- Level 1 – Plate Balancing: At this initial stage, the agent learned to control the platform to keep the ball stable and prevent it from falling off the plate.
- Level 2 – Target Navigation: The task became more complex as the agent was required to move the ball to a specific coordinate on the plate, demonstrating controlled directional movement.
- Level 3 – Maze Navigation: The final and most challenging level involved guiding the ball from a designated start point to the maze exit, navigating around walls and obstacles. We explored two main strategies for this task: a Hierarchical approach and an End-to-End approach.
Level 1: Plate Balancing |
Level 2: Target Navigation |
Level 3: Maze Navigation |
Figure 1: The agent's performance across the three levels of complexity: balancing the ball (left), navigating to a target (center), and solving a complete maze (right).
The project's architecture was designed to be modular and extensible, using a Strategy design pattern in C# within the Unity environment. This decouples the agent's core logic from its decision-making process, allowing for seamless switching between different algorithmic approaches. The two primary strategies implemented are HierarchicalStrategy and EndToEndStrategy. Further technical documentation and implementation details are available on the project's deepWiki.
Figure 2: Architecture implementing the Strategy pattern. The main agent delegates control, allowing it to switch between a direct EndToEndStrategy and a HierarchicalStrategy that combines an A planner with a low-level RL controller.
Before tackling the full maze, we established baselines with simpler tasks. For both, we used a PPO (Proximal Policy Optimization) agent.
- Plate Balancing: The agent's goal was simply to keep the ball on the plate. It received a small positive reward for every step the ball remained on the plate and a penalty for falling. The agent quickly learned an effective stabilization policy, achieving a near-perfect success rate.
- Balance to Target: This task required the agent to move the ball to a random target coordinate. The reward was inversely proportional to the distance to the target, encouraging the agent to get closer. The agent successfully learned to perform this goal-directed navigation.
This approach decomposes the problem into two distinct components: a high-level planner and a low-level controller.
- High-Level Planner (A* Algorithm): We used the classic A* search algorithm as the planner (
AStarPlanner). It operates on the discrete maze structure to compute a guaranteed optimal sequence of waypoints from the start to the exit before the agent begins to move. - Low-Level Controller (PPO Agent): A PPO-based reinforcement learning agent (
RLController) is responsible for the continuous control task of navigating the ball between the waypoints provided by the planner. This agent learns the complex physics-based manipulation through trial and error, focusing only on point-to-point navigation.
The agent's learning is guided by a carefully engineered multi-component reward function:
-
Directional Reward (
$R_{dir}$ ): A shaping reward calculated from the dot product of the ball's velocity and the direction to the target waypoint, encouraging movement in the correct direction. -
Dynamic Time Penalty (
$R_{time}$ ): A small negative reward at each step to incentivize speed. It is scaled based on maze complexity to ensure fairness. -
Waypoint Achievement Reward (
$R_{waypoint}$ ): A sparse, positive reward of +1.0 is given for reaching each waypoint, providing a clear signal for sub-goal completion.
Figure 3: Flowchart of the hierarchical reward function. Dense rewards guide the agent towards a waypoint, while a large sparse reward is given for reaching it.
This strategy uses a single, monolithic reinforcement learning policy to handle the entire navigation task autonomously.
- Unified Policy: The agent learns both high-level planning and low-level control simultaneously.
- Advanced Architecture: This approach requires more sophisticated observations and a more complex network. We used:
- A Convolutional Neural Network (CNN) to process a 20x20 grid representing the local wall layout around the ball, providing the agent with "spatial vision."
- A Long Short-Term Memory (LSTM) unit to give the agent short-term memory, enabling it to better handle the partially observable nature of the environment by tracking its movement history.
- Observations: The agent integrates multiple data sources, including its tilt angles, velocity, the difference in grid cells to the goal, its precise position within a cell, and the distance to the goal pre-calculated using a Breadth-First Search (BFS).
- Reward Function: The reward is primarily driven by reaching the goal (+10) or failing (-1). To guide its learning, the agent receives a continuous small negative reward based on its distance to the goal, encouraging it to make progress.
This approach achieved a notable success rate of 80.4% in testing.
Setup:
# Clone the repository
git clone https://github.com/Minipoloalex/maze-solver-rl.git
cd maze-solver-rlPrerequisites:
- Python 3.10.12
- Unity Hub
- Unity Editor 6.1
Running the Project:
- (Optional) Install the required Python packages for training new models:
pip install -r requirements.txt. This step is not required to run pre-trained models. - Open the project in the Unity Hub using the specified Unity Editor version.
- Navigate to the
Assets/Scenesfolder and select the desired scene. - The project is pre-configured to run with a trained model. Press the Play button in the Unity Editor to start the simulation.
To switch between the Hierarchical and End-to-End strategies, select the correspondent agent prefab GameObject in the scene.
You can easily switch between the Hierarchical and End-to-End agents in the Unity Editor before running the simulation.
- In the Hierarchy window, select the
MazeGameObject. This GameObject contains theMaze Spawner (Script)component. - In the Inspector window, find the
Maze Spawnercomponent. - Locate the Platform Agent Prefab field. This field determines which agent will be spawned.
- In the Project window, navigate to the
Assets/Prefabsfolder. - Drag either the
HierarchicalAgentorEndToEndAgentprefab from theAssets/Prefabsfolder into the Platform Agent Prefab slot in the Inspector. - Press Play to run the simulation with the selected agent.
This project uses the Unity ML-Agents Toolkit to train the reinforcement learning agents.
1. Training Command
After installing the Python requirements, you can start training from your terminal. The main command is mlagents-learn. You must point it to a configuration file and give your training run a unique ID.
mlagents-learn <config_file_path> --run-id=<your_run_id><config_file_path>: Path to the training configuration file. We have provided two:config/MazeTraining.yaml: For theHierarchicalandEndToEndmaze-solving agents.config/Platform.yaml: For the simplerBalancePlatformandPlatformTargetbaseline agents.
--run-id: A unique name for your training run. Trained models and summaries will be saved in aresults/<your_run_id>folder.
2. Starting a Training Session
- Open the project in the Unity Editor and load the desired scene (e.g.,
MazeScene). - Open a terminal in the project's root directory.
- Run the training command. For example, to train the maze agents:
mlagents-learn config/MazeTraining.yaml --run-id=MazeTestRun01
- When you see the message "Start training by pressing the Play button in the Unity Editor", press Play in Unity to begin.
- You can monitor the training progress in the terminal or by using TensorBoard:
tensorboard --logdir results.
3. Using the Trained Model
- Once training is complete (or you stop it with
Ctrl+C), the trained model will be saved as an.onnxfile. You can find it atresults/<your_run_id>/<Behavior_Name>.onnx.- The
Behavior_Nameis defined in the agent'sBehavior Parameterscomponent in Unity (e.g.,HierarchicalMazeSolver).
- The
- Copy the generated
.onnxfile into your Unity project, for example, into theAssets/Modelsfolder. - In the Unity Editor, select the agent prefab you want to update (e.g.,
HierarchicalAgent). - In the Inspector, find the Behavior Parameters component.
- Drag your new
.onnxmodel file from the Project window into the Model field. - The agent will now use your newly trained model when you run the simulation.
The project's Strategy Pattern makes it easy to implement and test new algorithms.
Step 1: Create a New Strategy Class
- In the Unity Project window, navigate to
Assets/Scripts/Strategies. - Create a new C# script (e.g.,
MyNewStrategy.cs). - This class must implement the
IStrategyinterface. Use the following template as a starting point:
// In MyNewStrategy.cs
using UnityEngine;
using Unity.MLAgents.Sensors;
[System.Serializable]
public class MyNewStrategy : IStrategy
{
// Add public fields here to make them editable in the Inspector
[Header("My New Strategy Settings")]
public float myCustomParameter = 1.0f;
private StrategicPlatformAgent _agent;
public void Initialize(StrategicPlatformAgent agent)
{
_agent = agent;
Debug.Log("MyNewStrategy Initialized!");
}
public void OnEpisodeBegin()
{
// Logic to reset the environment for your strategy
}
public void CollectObservations(VectorSensor sensor)
{
// Add observations relevant to your strategy
}
public void ProcessActions()
{
// Implement the core logic: calculate rewards, check for end conditions
}
}Step 2: Register the New Strategy
-
Open the
Assets/Scripts/Agents/StrategicPlatformAgent.csscript. -
Add your new strategy to the
StrategyTypeenum:public enum StrategyType { EndToEnd, Hierarchical, MyNewStrategy // Add your new strategy here }
-
Open the
Assets/Scripts/Strategies/StrategyFactory.csscript. -
Add a
caseto theCreatemethod to instantiate your new class:public static IStrategy Create(StrategicPlatformAgent.StrategyType type) { switch (type) { // ... existing cases case StrategicPlatformAgent.StrategyType.MyNewStrategy: return new MyNewStrategy(); // ... } }
Step 3: Select and Configure in Unity
- Select the
StrategicPlatformAgentprefab or the instance in your scene. - In the Inspector, find the
Strategic Platform Agent (Script)component. - Use the Selected Strategy dropdown to choose
MyNewStrategy. - The public fields from your new strategy class (like
myCustomParameter) will now appear in the Inspector, allowing you to configure them without changing the code.
You can now press Play to run the agent with your new custom logic! If it's an RL-based strategy, you'll also need to add a new behavior configuration to the .yaml file for training.
Python, C#, Unity, Unity ML-Agents, PPO
- Adriano Machado (202105352)
- Félix Martins (202108837)
- Francisco da Ana (202108762)
- Martim Iglesias (202005380)



