Skip to content

Nossks/EviScore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EviScore

A robust, evidence-grounded evaluation system for non-canonical medical journal extraction using Semantic N-gram IoU and LLMs.

🎯 Goal

Build an objective, reproducible evaluation harness for extracting messy, non-canonical health data (Symptoms, Food, Emotion, Mind) from user journals. This pipeline uses Semantic Jaccard (Soft-IoU) scoring with multilingual embeddings to handle synonyms, Hinglish, and free-text evidence spans while strictly penalizing hallucinations.


Table of Contents


Key Features

  • Evidence-Grounded Extraction: The LLM extracts evidence_span (verbatim quotes) alongside structured attributes so every prediction is traceable to text.
  • Semantic N-gram Scorer: Soft-IoU using multilingual sentence-transformers to match semantically equivalent phrases (e.g., dardpain).
  • Restraint Mechanism: Penalizes verbose or hallucinated outputs by increasing the union in the Jaccard denominator.
  • Safety First: Polarity accuracy (present vs absent) is tracked separately to avoid false positives.
  • Observability: Produces per-journal JSONL reports for fine-grained debugging.

System Architecture

graph TD
    classDef input fill:#ff9,stroke:#333,stroke-width:2px,color:black;
    classDef process fill:#00f7ff,stroke:#000,stroke-width:2px,color:black;
    classDef logic fill:#39ff14,stroke:#000,stroke-width:2px,color:black;
    classDef output fill:#ff00ff,stroke:#000,stroke-width:2px,color:white;

    Input(["📂 Input Data<br/>(journals.jsonl + gold.jsonl)"]) --> Stage1

    subgraph Stage1 ["Stage 1: Extraction"]
        LLM["🤖 LLM <br/>(src/components/llm.py)"]
        Parser["📄 Pydantic Parser"]
        LLM --> Parser
    end

    Stage1 --> Extracted["📄 extracted_gold.jsonl"]

    subgraph Stage2 ["Stage 2: Scoring"]
        Extracted --> Scorer["⚖️ Scorer Engine<br/>(src/components/scorer.py)"]
        Gold["🏆 Gold Reference"] --> Scorer

        subgraph Logic ["Core Logic"]
            Ngram["✂️ Bi-gram Decomposition"]
            BERT["🧠 Multilingual BERT"]
            Jaccard["🧮 Semantic Jaccard (Soft-IoU)"]
            Ngram --> BERT --> Jaccard
        end

        Scorer --> Logic
    end

    subgraph Stage3 ["Stage 3: Reporting"]
        Logic --> Metrics["📊 Final Metrics"]
    end

    Metrics --> Summary["✅ score_summary.json"]
    Metrics --> Detail["📝 per_journal_scores.jsonl"]

    class Input input;
    class Stage1,Stage2,Stage3 process;
    class LLM,Parser,Scorer,Logic logic;
    class Summary,Detail output;
Loading

Quickstart

Requirements

  • Python 3.10+
  • A modern CPU (GPU optional but recommended for faster embeddings/LLM)
  • Hugging Face token (for any hosted models) if using remote endpoints

Dependencies are listed in requirements.txt.

Setup

# Clone the repository
git clone <https://github.com/Nossks/EviScore>
cd EviScore

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configure

Create a .env file in the project root with your Hugging Face token (if required):

HUGGINGFACEHUB_API_TOKEN=hf_xxxxxxxxxxxxxxxxx

Other configuration options (model selection, scoring thresholds) live in ASSUMPTIONS.md and in the src/components/scorer.py constants.

Run

Execute the full extraction + scoring loop from the CLI entrypoint:

python main.py

Outputs

All outputs are written to out/:

  • score_summary.json — high-level metrics
{
  "f1_score": 0.88,
  "precision": 0.92,
  "recall": 0.85,
  "polarity_accuracy": 1.0,
  "bucket_accuracy": 0.78
}
  • per_journal_scores.jsonl — row-level debug data
{"journal_id": "J006", "f1": 0.45, "note": "Failed on Hindi text"}
{"journal_id": "J007", "f1": 1.0, "note": "Perfect extraction"}

Core Logic: Semantic N-gram IoU

Why not exact IoU? Simple word overlap fails for semantically equivalent phrases such as stomach ache vs tummy pain.

Let G be gold N-grams and P predicted N-grams. Using precomputed multilingual embeddings and cosine similarity, define Soft-IoU:

S-IoU = sum_{g in G} max_{p in P} cosine_sim(g, p)  /  (|G| + |P| - Intersection)
  • N-Gram decomposition uses bi-grams (N=2) to preserve local context.
  • Embeddings are produced with paraphrase-multilingual-MiniLM-L12-v2 (or equivalent).
  • Intersection is approximated by the sum of matched similarities above a threshold.
  • Restraint Mechanism: Hallucinated tokens enlarge |P|, increasing denominator and lowering score.

Polarity (present/absent) is scored separately: a wrong polarity yields a strict penalty.


Project Structure

├── out/                    # Generated outputs (JSON/JSONL)
├── data/                   # Input journals and gold labels
├── src/
│   ├── components/
│   │   ├── llm.py          # llm extraction logic
│   │   └── scorer.py       # N-gram Semantic Jaccard logic
    ├──pipelines/
    |   └──extraction.py
    |   └──evaluation.py
    ├── logger.py           # Custom logging setup
│   └── exception.py        # Error handling
├── requirements.txt        # Project dependencies
├── main.py                 # combination of pipelines
├── ASSUMPTIONS.md          # Detailed logic explanation
└── README.md               # This file

Implementation Notes & Best Practices

  • Model choices: The scorer is model-agnostic. For embeddings prefer sentence-transformers multilingual variants.
  • Thresholds: Tune cosine thresholds for "match" and the restraint penalty using a held-out validation set.
  • Edge cases: Short spans (1–2 tokens) can produce noisy embeddings; consider backing off to exact-string matching for extremely short tokens.
  • Performance: Use batch embedding calls (sentence-transformers supports batching) to speed scoring.
  • Reproducibility: Seed all RNGs and log model versions and tokenizer checksums in score_summary.json.

Troubleshooting

  • ModuleNotFoundError — ensure venv is activated and pip install -r requirements.txt completed successfully.
  • Embedding calls too slow — use batching or a local ONNX/accelerated model.
  • Low polarity accuracy — check label format in gold.jsonl and whether polarity is annotated as present/absent or booleans.

Contributing

  1. Fork the repo
  2. Create a feature branch
  3. Open a PR with tests and a clear description

Please follow the existing code style; aim for minimal diffs if the change affects scorer.py or llm.py.


License

This project is released under the MIT License. See LICENSE for details.


Developer

Aryan

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages