EviScore

A robust, evidence-grounded evaluation system for non-canonical medical journal extraction using Semantic N-gram IoU and LLMs.

🎯 Goal

Build an objective, reproducible evaluation harness for extracting messy, non-canonical health data (Symptoms, Food, Emotion, Mind) from user journals. This pipeline uses Semantic Jaccard (Soft-IoU) scoring with multilingual embeddings to handle synonyms, Hinglish, and free-text evidence spans while strictly penalizing hallucinations.

Key Features

Evidence-Grounded Extraction: The LLM extracts evidence_span (verbatim quotes) alongside structured attributes so every prediction is traceable to text.
Semantic N-gram Scorer: Soft-IoU using multilingual sentence-transformers to match semantically equivalent phrases (e.g., dard ≈ pain).
Restraint Mechanism: Penalizes verbose or hallucinated outputs by increasing the union in the Jaccard denominator.
Safety First: Polarity accuracy (present vs absent) is tracked separately to avoid false positives.
Observability: Produces per-journal JSONL reports for fine-grained debugging.

System Architecture

graph TD
    classDef input fill:#ff9,stroke:#333,stroke-width:2px,color:black;
    classDef process fill:#00f7ff,stroke:#000,stroke-width:2px,color:black;
    classDef logic fill:#39ff14,stroke:#000,stroke-width:2px,color:black;
    classDef output fill:#ff00ff,stroke:#000,stroke-width:2px,color:white;

    Input(["📂 Input Data<br/>(journals.jsonl + gold.jsonl)"]) --> Stage1

    subgraph Stage1 ["Stage 1: Extraction"]
        LLM["🤖 LLM <br/>(src/components/llm.py)"]
        Parser["📄 Pydantic Parser"]
        LLM --> Parser
    end

    Stage1 --> Extracted["📄 extracted_gold.jsonl"]

    subgraph Stage2 ["Stage 2: Scoring"]
        Extracted --> Scorer["⚖️ Scorer Engine<br/>(src/components/scorer.py)"]
        Gold["🏆 Gold Reference"] --> Scorer

        subgraph Logic ["Core Logic"]
            Ngram["✂️ Bi-gram Decomposition"]
            BERT["🧠 Multilingual BERT"]
            Jaccard["🧮 Semantic Jaccard (Soft-IoU)"]
            Ngram --> BERT --> Jaccard
        end

        Scorer --> Logic
    end

    subgraph Stage3 ["Stage 3: Reporting"]
        Logic --> Metrics["📊 Final Metrics"]
    end

    Metrics --> Summary["✅ score_summary.json"]
    Metrics --> Detail["📝 per_journal_scores.jsonl"]

    class Input input;
    class Stage1,Stage2,Stage3 process;
    class LLM,Parser,Scorer,Logic logic;
    class Summary,Detail output;

Quickstart

Requirements

Python 3.10+
A modern CPU (GPU optional but recommended for faster embeddings/LLM)
Hugging Face token (for any hosted models) if using remote endpoints

Dependencies are listed in requirements.txt.

Setup

# Clone the repository
git clone <https://github.com/Nossks/EviScore>
cd EviScore

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configure

Create a .env file in the project root with your Hugging Face token (if required):

HUGGINGFACEHUB_API_TOKEN=hf_xxxxxxxxxxxxxxxxx

Other configuration options (model selection, scoring thresholds) live in ASSUMPTIONS.md and in the src/components/scorer.py constants.

Run

Execute the full extraction + scoring loop from the CLI entrypoint:

python main.py

Outputs

All outputs are written to out/:

score_summary.json — high-level metrics

{
  "f1_score": 0.88,
  "precision": 0.92,
  "recall": 0.85,
  "polarity_accuracy": 1.0,
  "bucket_accuracy": 0.78
}

per_journal_scores.jsonl — row-level debug data

{"journal_id": "J006", "f1": 0.45, "note": "Failed on Hindi text"}
{"journal_id": "J007", "f1": 1.0, "note": "Perfect extraction"}

Core Logic: Semantic N-gram IoU

Why not exact IoU? Simple word overlap fails for semantically equivalent phrases such as stomach ache vs tummy pain.

Let G be gold N-grams and P predicted N-grams. Using precomputed multilingual embeddings and cosine similarity, define Soft-IoU:

S-IoU = sum_{g in G} max_{p in P} cosine_sim(g, p)  /  (|G| + |P| - Intersection)

N-Gram decomposition uses bi-grams (N=2) to preserve local context.
Embeddings are produced with paraphrase-multilingual-MiniLM-L12-v2 (or equivalent).
Intersection is approximated by the sum of matched similarities above a threshold.
Restraint Mechanism: Hallucinated tokens enlarge |P|, increasing denominator and lowering score.

Polarity (present/absent) is scored separately: a wrong polarity yields a strict penalty.

Project Structure

├── out/                    # Generated outputs (JSON/JSONL)
├── data/                   # Input journals and gold labels
├── src/
│   ├── components/
│   │   ├── llm.py          # llm extraction logic
│   │   └── scorer.py       # N-gram Semantic Jaccard logic
    ├──pipelines/
    |   └──extraction.py
    |   └──evaluation.py
    ├── logger.py           # Custom logging setup
│   └── exception.py        # Error handling
├── requirements.txt        # Project dependencies
├── main.py                 # combination of pipelines
├── ASSUMPTIONS.md          # Detailed logic explanation
└── README.md               # This file

Implementation Notes & Best Practices

Model choices: The scorer is model-agnostic. For embeddings prefer sentence-transformers multilingual variants.
Thresholds: Tune cosine thresholds for "match" and the restraint penalty using a held-out validation set.
Edge cases: Short spans (1–2 tokens) can produce noisy embeddings; consider backing off to exact-string matching for extremely short tokens.
Performance: Use batch embedding calls (sentence-transformers supports batching) to speed scoring.
Reproducibility: Seed all RNGs and log model versions and tokenizer checksums in score_summary.json.

Troubleshooting

ModuleNotFoundError — ensure venv is activated and pip install -r requirements.txt completed successfully.
Embedding calls too slow — use batching or a local ONNX/accelerated model.
Low polarity accuracy — check label format in gold.jsonl and whether polarity is annotated as present/absent or booleans.

Contributing

Fork the repo
Create a feature branch
Open a PR with tests and a clear description

Please follow the existing code style; aim for minimal diffs if the change affects scorer.py or llm.py.

License

This project is released under the MIT License. See LICENSE for details.

Developer

Aryan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EviScore

Table of Contents

Key Features

System Architecture

Quickstart

Requirements

Setup

Configure

Run

Outputs

Core Logic: Semantic N-gram IoU

Project Structure

Implementation Notes & Best Practices

Troubleshooting

Contributing

License

Developer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
out		out
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

EviScore

Table of Contents

Key Features

System Architecture

Quickstart

Requirements

Setup

Configure

Run

Outputs

Core Logic: Semantic N-gram IoU

Project Structure

Implementation Notes & Best Practices

Troubleshooting

Contributing

License

Developer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages