Repository for running self-correction and self-diagnosis experiments on BBQ dataset. The entry point is experiments_final.py, which orchestrates dozens of run_experiment.py calls across models, discourse templates, and bias slices.
experiments_final.py: kicks off the full paper sweep (multiple LLMs, seeds, discourse constructions, and capabilities).run_experiment.py: single-run driver that builds datasets, fine-tunes with LoRA, and evaluates self-correction/self-diagnosis.discourse_construction.py&constructionConfig.py: prompt/discourse templates used to synthesize training/eval text.processor.py,evaluator.py,sft_finetuning.py: data processing, evaluation routines, and TRL-based fine-tuning.experiment/dataset/: BBQ splits already materialized asbbq.<bias>.{train,test}.json.
- Python 3.10+ with CUDA-capable GPUs recommended.
- Install deps (minimal set):
pip install torch transformers datasets trl peft accelerate evaluate google-api-python-client tqdm scipy
- Populate
api_tokens.pywith your Hugging Face and API keys (required for gated models such as Llama/Mistral). - Verify data is present under
experiment/dataset/; add or swap in new BBQ splits if needed.
- Baseline evaluation only:
python run_experiment.py --llm llama-3.2-3b-instruct --benchmark bbq.gender --baseline_only --eval_only
- Fine-tune and evaluate self-correction on SES bias:
python run_experiment.py --llm llama-3.2-1b-instruct --benchmark bbq.SES \ --num_train_epochs 10 --batch_size 32 --discourse_construction situation-statement-action1-action2groundTruth \ --epoch_wise_eval
- Self-diagnosis variant: add
--capability selfdiagnosis. - Cross-capability eval: append
--cross_capability_evaluation. - Choose different discourse templates via
--discourse_construction ...(seediscourse_construction.pyfor names).
To launch the exact sweep used in the paper (multiple LLMs, seeds, templates, and cross-capability checks), run:
python experiments_final.pyThis script will queue many fine-tuning/eval jobs; ensure you have sufficient GPU time and disk. Intermediate checkpoints are stored under experiment/finetuning/<llm>/<capability>_<discourse>/ and cleaned up after evaluation.
- Use
--eval_onlywhen you only want zero-shot baselines. - Set
--output_direxplicitly if you prefer to keep checkpoints. - Some models require Hugging Face authentication (
huggingface_hub.loginis triggered insideutils.py). - GPU memory varies by model size; adjust
--batch_sizeor pick smaller LLMs if you hit OOM.