Task description here. Essentially, this repo reimplemented ICM WITHOUT logical consistency fix, and ran it on a subset of TruthfulQA dataset by feeding results as few-shot examples to a base model.
Across four runs for ICM, the accuracy ranged from 56.0% to 60% for 100 examples. All intermediate results are saved in results/icm_history.json.
The results may not exactly match the paper's numbers due to:
- Different models used
- In-context learning
- Randomness in LLM generation and throughout ICM pipeline
- This ICM version doesn't include logical consistency fix.
Other notes:
- Zero-shot with chat model (Llama-3.1-405B-instruct) initially gave lower accuracy (50%) than base model (~65%) since 25-30% of responses were empty and must be skipped when parsing labels. The accuracy was raised after adding retry mechanism for empty response.
- Require Python>=3.10, <=3.13 (test with Python 3.12, as specified in .python-version)
- Get uv if you haven't to manage packages and run scripts.
- Fork/clone the repo.
- Run the set up code in the terminal
chmod +x setup.sh
sh setup.shYou should be in the virtual environment named "praxis-sprint-icm". If not, manually activate it with
source .venv/bin/activate- Go to .env and fill in the secrets. You need a Hyperbolic API key.
To run the full pipeline from data loading, ICM prediction, evaluation to figure generation, run
uv run src/main.pythen generate figure with the command above.
You can also each evaluation scenario separately by running
uv run src/main.py --<scenario>where <scenario> can be one of zero_shot_base, zero_shot_chat, few_shot_golden, few_shot_icm. Note that running few_shot_icm will also run ICM prediction if not done before, and could take around 30-40 mins for test set.
Run with -h to see all options, or check src/main.py for details.
uv run src/main.py -hCRITICAL: Change seed and other configurations in src/init.py if you want different ICM results. Current seed was used AFTER the reported results.
