Author: gadwant
This repository contains the official implementation of the Chain-of-Logic Coverage Evaluator (CoL-CE), a framework for assessing reasoning validity in Retrieval-Augmented Generation (RAG) systems.
CoL-CE evaluates whether the logical connections (edges) in a generated answer's reasoning graph are explicitly supported by retrieved context, distinct from checking only atomic facts.
code/
├── src/ # Source code
│ ├── run_reverification.py # Main evaluation script (N=30)
│ ├── run_experiment.py # Legacy experiment runner
│ ├── data_loader.py # HotpotQA loader
│ ├── extractor.py # DeepSeek-R1 logic extractor
│ ├── verifier.py # Mistral edge verifier
│ └── experiment.py # Core experiment logic
├── results/ # Experimental Data
│ ├── reverified/ # The final N=30 Strict Results (Paper source).
│ │ ├── reverified_results.json
│ │ └── reverified_stats.json
│ ├── enhanced/ # The raw Llama-3 outputs.
│ └── legacy/ # The pilot study data.
├── PROMPTS.md # Exact prompt templates used
├── requirements.txt # Python dependencies
└── README.md # This file
-
Prerequisites:
- Python 3.12+
- Ollama installed and running locally.
-
Install Models: Pull the required quantized models via Ollama:
ollama pull deepseek-r1:latest # Logic Extractor ollama pull mistral:latest # Edge Verifier ollama pull llama3:latest # Answer Generator
-
Install Dependencies:
pip install -r requirements.txt
To reproduce the paper's final N=30 "Re-Verification" results (Strict Protocol + Factual Coverage Baseline):
python3 src/run_reverification.py- Input: Loads pre-generated answers from
results/enhanced/. - Output: Saves strict verification results to
results/reverified/.
To start from scratch (generate new answers):
python3 src/run_experiment.py| Metric | Result (N=30) | Interpretation |
|---|---|---|
| Edge Verification Rate (EVR) | 60.0% | Only 60% of logic is supported. |
| Logic Coverage Score (LCS) | 59.2% | ~40% of reasoning steps are hallucinations. |
| Factual Coverage (FC) | 90.9% | Baseline. Atomic facts are mostly correct. |
See results/reverified/reverified_stats.json for detailed statistics.
See PROMPTS.md for the exact prompt templates used for generation, logic extraction, and strictly verification.
MIT License. See LICENSE file for details.