This repository contains the official implementation of the paper "CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts".
-
Install Verl from Verl (https://github.com/volcengine/verl). It is recommended to use the Docker image for installation.
-
Place
src/training/stage-2/batch.pyinto theverl/workers/reward_managerfolder, overwriting the existing file. -
Prepare the training data: Run
src/data_preparation/{jec-qa/medqa-usmle}.pyto convert the data into.parquetformat. -
Begin training:
-
Stage 1: Modify the training script by setting
custom_reward_function.path=src/training/stage-1/{law/med}_faithfulness_rule.py, then start training. -
Stage 2:
- Modify the training script by setting
custom_reward_function.path=src/training/stage-2/{law/med}_faithfulness_model_batch.py. - Set
reward_model.reward_manager=batchin the training script and then start training.
- Modify the training script by setting
-
-
data:original_data: Contains the full original dataset forjec-qaandmedqa-usmle.reformulated_data: Contains the augmented data generated using our dynamic data reformulation method.
-
src/data_preparation: Code for data preparation. It converts.jsonformat data into.parquetformat and adds system prompts for use with the Verl framework. -
src/data_reformulation: Contains the full source code for our dynamic data reformulation method (execution order: concatenate -> polish -> diversify -> aggregate). -
src/training: Scripts for CLARity training.stage-1: Code for the reward function in the stage-1 (refine) phase.stage-2: Code for the reward function and reward manager in the stage-2 (monitor) phase.
If you use this code or find our work helpful, please cite our paper:
@misc{TBD}