A robust, evidence-grounded evaluation system for non-canonical medical journal extraction using Semantic N-gram IoU and LLMs.
🎯 Goal
Build an objective, reproducible evaluation harness for extracting messy, non-canonical health data (Symptoms, Food, Emotion, Mind) from user journals. This pipeline uses Semantic Jaccard (Soft-IoU) scoring with multilingual embeddings to handle synonyms, Hinglish, and free-text evidence spans while strictly penalizing hallucinations.
- Evidence-Grounded Extraction: The LLM extracts
evidence_span(verbatim quotes) alongside structured attributes so every prediction is traceable to text. - Semantic N-gram Scorer: Soft-IoU using multilingual sentence-transformers to match semantically equivalent phrases (e.g.,
dard≈pain). - Restraint Mechanism: Penalizes verbose or hallucinated outputs by increasing the union in the Jaccard denominator.
- Safety First: Polarity accuracy (present vs absent) is tracked separately to avoid false positives.
- Observability: Produces per-journal JSONL reports for fine-grained debugging.
graph TD
classDef input fill:#ff9,stroke:#333,stroke-width:2px,color:black;
classDef process fill:#00f7ff,stroke:#000,stroke-width:2px,color:black;
classDef logic fill:#39ff14,stroke:#000,stroke-width:2px,color:black;
classDef output fill:#ff00ff,stroke:#000,stroke-width:2px,color:white;
Input(["📂 Input Data<br/>(journals.jsonl + gold.jsonl)"]) --> Stage1
subgraph Stage1 ["Stage 1: Extraction"]
LLM["🤖 LLM <br/>(src/components/llm.py)"]
Parser["📄 Pydantic Parser"]
LLM --> Parser
end
Stage1 --> Extracted["📄 extracted_gold.jsonl"]
subgraph Stage2 ["Stage 2: Scoring"]
Extracted --> Scorer["⚖️ Scorer Engine<br/>(src/components/scorer.py)"]
Gold["🏆 Gold Reference"] --> Scorer
subgraph Logic ["Core Logic"]
Ngram["✂️ Bi-gram Decomposition"]
BERT["🧠 Multilingual BERT"]
Jaccard["🧮 Semantic Jaccard (Soft-IoU)"]
Ngram --> BERT --> Jaccard
end
Scorer --> Logic
end
subgraph Stage3 ["Stage 3: Reporting"]
Logic --> Metrics["📊 Final Metrics"]
end
Metrics --> Summary["✅ score_summary.json"]
Metrics --> Detail["📝 per_journal_scores.jsonl"]
class Input input;
class Stage1,Stage2,Stage3 process;
class LLM,Parser,Scorer,Logic logic;
class Summary,Detail output;
- Python 3.10+
- A modern CPU (GPU optional but recommended for faster embeddings/LLM)
- Hugging Face token (for any hosted models) if using remote endpoints
Dependencies are listed in requirements.txt.
# Clone the repository
git clone <https://github.com/Nossks/EviScore>
cd EviScore
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root with your Hugging Face token (if required):
HUGGINGFACEHUB_API_TOKEN=hf_xxxxxxxxxxxxxxxxxOther configuration options (model selection, scoring thresholds) live in ASSUMPTIONS.md and in the src/components/scorer.py constants.
Execute the full extraction + scoring loop from the CLI entrypoint:
python main.pyAll outputs are written to out/:
score_summary.json— high-level metrics
{
"f1_score": 0.88,
"precision": 0.92,
"recall": 0.85,
"polarity_accuracy": 1.0,
"bucket_accuracy": 0.78
}per_journal_scores.jsonl— row-level debug data
{"journal_id": "J006", "f1": 0.45, "note": "Failed on Hindi text"}
{"journal_id": "J007", "f1": 1.0, "note": "Perfect extraction"}Why not exact IoU? Simple word overlap fails for semantically equivalent phrases such as stomach ache vs tummy pain.
Let G be gold N-grams and P predicted N-grams. Using precomputed multilingual embeddings and cosine similarity, define Soft-IoU:
S-IoU = sum_{g in G} max_{p in P} cosine_sim(g, p) / (|G| + |P| - Intersection)
- N-Gram decomposition uses bi-grams (N=2) to preserve local context.
- Embeddings are produced with
paraphrase-multilingual-MiniLM-L12-v2(or equivalent). - Intersection is approximated by the sum of matched similarities above a threshold.
- Restraint Mechanism: Hallucinated tokens enlarge |P|, increasing denominator and lowering score.
Polarity (present/absent) is scored separately: a wrong polarity yields a strict penalty.
├── out/ # Generated outputs (JSON/JSONL)
├── data/ # Input journals and gold labels
├── src/
│ ├── components/
│ │ ├── llm.py # llm extraction logic
│ │ └── scorer.py # N-gram Semantic Jaccard logic
├──pipelines/
| └──extraction.py
| └──evaluation.py
├── logger.py # Custom logging setup
│ └── exception.py # Error handling
├── requirements.txt # Project dependencies
├── main.py # combination of pipelines
├── ASSUMPTIONS.md # Detailed logic explanation
└── README.md # This file
- Model choices: The scorer is model-agnostic. For embeddings prefer sentence-transformers multilingual variants.
- Thresholds: Tune cosine thresholds for "match" and the restraint penalty using a held-out validation set.
- Edge cases: Short spans (1–2 tokens) can produce noisy embeddings; consider backing off to exact-string matching for extremely short tokens.
- Performance: Use batch embedding calls (sentence-transformers supports batching) to speed scoring.
- Reproducibility: Seed all RNGs and log model versions and tokenizer checksums in
score_summary.json.
ModuleNotFoundError— ensurevenvis activated andpip install -r requirements.txtcompleted successfully.- Embedding calls too slow — use batching or a local ONNX/accelerated model.
- Low polarity accuracy — check label format in
gold.jsonland whether polarity is annotated aspresent/absentor booleans.
- Fork the repo
- Create a feature branch
- Open a PR with tests and a clear description
Please follow the existing code style; aim for minimal diffs if the change affects scorer.py or llm.py.
This project is released under the MIT License. See LICENSE for details.
Aryan