In this repository, I publish the code of my thesis.
To execute my experiments, the environment and my reinforcmenet learning code has to be installed as a package first. Clone this repository, navigate with your terminal into this repository and execute the following steps.
- Install the repository as a pip package
pip install . - Check whether the installation was successful
python -c "import ThesisPackage"
The basic multi agent pong environment can be imported and trained like this:
from ThesisPackage.Environments.collectors.collectors_env_discrete_onehot import Collectors
from ThesisPackage.RL.Centralized_PPO.multi_ppo import PPO_Multi_Agent_Centralized
if __name__ == "__main__":
num_envs = 64
seed = 1
total_timesteps = 6000000000
sequence_length = 1:
envs = [make_env(sequence_length) for i in range(num_envs)]
agent = PPO_Multi_Agent_Centralized(envs, device="cpu")
agent.train(total_timesteps, tensorboard_folder="OneHot", exp_name=f"collect_seq_{sequence_length}", anneal_lr=True, learning_rate=0.001, num_checkpoints=60)
agent.save(f"models/collectors_seq_{sequence_length}")If you want to train your setup as a self-play sender-receiver setup, you can do it like this:
from ThesisPackage.Environments.multi_pong_sender_receiver import PongEnvSenderReceiver
from ThesisPackage.RL.Decentralized_PPO.multi_ppo import PPO_Multi_Agent
def make_env(seed, vocab_size, sequence_length, max_episode_steps):
env = PongEnvSenderReceiver(width=20, height=20, vocab_size=vocab_size,sequence_length=sequence_length, max_episode_steps=max_episode_steps, self_play=True, receiver="paddle_2", mute_method="zero")
return env
if __name__ == "__main__":
i = 4
num_envs = 64
seed = 1
sequence_length = i
vocab_size = 3
max_episode_steps = 2048
total_timesteps = 150000000
envs = [make_env(seed, vocab_size, sequence_length, max_episode_steps) for i in range(num_envs)]
agent = PPO_Multi_Agent(envs)
agent.train(total_timesteps, exp_name="multi_pong_sender_receiver")
agent.save(f"models/multi_pong_test_sender_receiver_{i}")The output of this code will look like this:
-
The ball moves according to its current direction. If it hits the top or bottom wall, its vertical direction reverses, simulating a bounce.
-
When the ball hits the left wall, its horizontal direction reverses, making it bounce back into play.
-
Initially, all paddles have a reward of 0. This setup indicates that no paddle has earned a reward yet.
-
The environment checks if the ball is at the right edge and in line with any paddle. If a paddle successfully hits the ball (the ball's vertical position aligns with the paddle's position), that paddle receives a reward of 1, rewarding the paddle for hitting the ball back.
-
The rewards are calculated based on the ball's interaction with the paddles and the walls, focusing on rewarding paddles for successfully hitting the ball back into play.
The language channel uses discrete language tokens. The size of the language channel and the number of vocabulary are handed over as hyperparameters to the pong environment.
Cornelius Wolff - cowolff@uos.de

