Chain-of-Logic Coverage Evaluator (CoL-CE)

Author: gadwant

This repository contains the official implementation of the Chain-of-Logic Coverage Evaluator (CoL-CE), a framework for assessing reasoning validity in Retrieval-Augmented Generation (RAG) systems.

CoL-CE evaluates whether the logical connections (edges) in a generated answer's reasoning graph are explicitly supported by retrieved context, distinct from checking only atomic facts.

📂 Repository Structure

code/
├── src/                    # Source code
│   ├── run_reverification.py   # Main evaluation script (N=30)
│   ├── run_experiment.py       # Legacy experiment runner
│   ├── data_loader.py          # HotpotQA loader
│   ├── extractor.py            # DeepSeek-R1 logic extractor
│   ├── verifier.py             # Mistral edge verifier
│   └── experiment.py           # Core experiment logic
├── results/                # Experimental Data
│   ├── reverified/         # The final N=30 Strict Results (Paper source).
│   │   ├── reverified_results.json
│   │   └── reverified_stats.json
│   ├── enhanced/           # The raw Llama-3 outputs.
│   └── legacy/             # The pilot study data.
├── PROMPTS.md              # Exact prompt templates used
├── requirements.txt        # Python dependencies
└── README.md               # This file

🛠️ Setup

Prerequisites:
- Python 3.12+
- Ollama installed and running locally.

Install Models: Pull the required quantized models via Ollama:

ollama pull deepseek-r1:latest  # Logic Extractor
ollama pull mistral:latest      # Edge Verifier
ollama pull llama3:latest       # Answer Generator

Install Dependencies:
```
pip install -r requirements.txt
```

🚀 Usage

1. Run Re-Verification (Recommended)

To reproduce the paper's final N=30 "Re-Verification" results (Strict Protocol + Factual Coverage Baseline):

python3 src/run_reverification.py

Input: Loads pre-generated answers from results/enhanced/.
Output: Saves strict verification results to results/reverified/.

2. Run Full Experiment (Generation -> Extraction -> Verification)

To start from scratch (generate new answers):

python3 src/run_experiment.py

📊 Results Summary

Metric	Result (N=30)	Interpretation
Edge Verification Rate (EVR)	60.0%	Only 60% of logic is supported.
Logic Coverage Score (LCS)	59.2%	~40% of reasoning steps are hallucinations.
Factual Coverage (FC)	90.9%	Baseline. Atomic facts are mostly correct.

See results/reverified/reverified_stats.json for detailed statistics.

📄 Prompts

See PROMPTS.md for the exact prompt templates used for generation, logic extraction, and strictly verification.

📜 License

MIT License. See LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chain-of-Logic Coverage Evaluator (CoL-CE)

📂 Repository Structure

🛠️ Setup

🚀 Usage

1. Run Re-Verification (Recommended)

2. Run Full Experiment (Generation -> Extraction -> Verification)

📊 Results Summary

📄 Prompts

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
PROMPTS.md		PROMPTS.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Chain-of-Logic Coverage Evaluator (CoL-CE)

📂 Repository Structure

🛠️ Setup

🚀 Usage

1. Run Re-Verification (Recommended)

2. Run Full Experiment (Generation -> Extraction -> Verification)

📊 Results Summary

📄 Prompts

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages