Fine tune Model with RLHF to Improve Agent Behavior

Hi @PWhiddy 

I liked  this project a lot , and as someone whose childhood was shaped by Pokémon, I find this work incredibly exciting!
While testing the pretrained model, I noticed that the agent rarely reaches Cerulean City and often gets stuck in confined areas like Pallet Town after a few steps.

 I’ve been experimenting with adapting Reinforcement Learning with Human Feedback (RLHF) approache to fine-tune the model. The goal is to correct some of the model’s behaviors (e.g getting stuck in loops or confined spaces) by incorporating human feedback into the training process.

Here’s a high-level overview of the approach:

Enable db_path in the baseline to record random segments of gameplay.

Use NiceGUI to annotate these segments with human feedback.

Train a reward model using train_reward.py based on the annotated data.

Enable reward_path to allow the agent to rely on the human feedback-based reward model instead of the default state reward.

this is the changes i made :  https://github.com/kardSIM/PokemonRedExperiments/tree/rlhf

I tried to run it myself  but due to hardware limitations (with my 14-CPUs), I  m not going anywhere.

I still don’t know what hyperparameters to experiment with. START_PROB = 0.00005 controls the number of samples to be recorded. Also, during fine-tuning, the learning rate is reduced to prevent the model from forgetting.

I’d love to get your feedback on this approach!, if you tried something similar, and if you find it prominent if so, i be happy to collaborate further and refine the implementation.

Looking forward to hearing from you

Best,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine tune Model with RLHF to Improve Agent Behavior #204

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fine tune Model with RLHF to Improve Agent Behavior #204

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions