Skip to content

grogers772/MicroTraitLLM-Improvements

Repository files navigation

MicroTraitLLM Validation Subsystems

This folder implements two core reliability components for MicroTraitLLM:

  1. Article Validation – scores and filters retrieved papers before they are passed to the LLM.
  2. Citation Accuracy Checking – post-processes model outputs to detect and correct hallucinated or mismatched citations.

These components correspond to the Article Validation and Citation Accuracy tasks in the overall MicroTraitLLM project.


1. Article Validation

Goal

Ensure that only high-quality, relevant, and accessible articles are used as context in the RAG pipeline.

Inputs

  • query – user query string.
  • articles – list of Article objects, each containing:
    • pmcid, doi
    • title, abstract
    • journal, year
    • citation_count
    • is_peer_reviewed, is_retracted

Outputs

  • Same list of Article objects, each annotated with:
    • validation_score ∈ [0, 1]
    • confidence_label ∈ {high, medium, low, invalid_id, unknown}

By default, low-confidence articles are filtered out.

Scoring Model

The composite score is a weighted combination of:

  • Recency – newer articles score higher.
  • Source reputation – tiered by journal and peer-review status.
  • Citation count – log-scaled and normalized.
  • Topic relevance – cosine similarity between query and article text embeddings.

The implementation is exposed via:

from validation import validate_articles

filtered_articles = validate_articles(query, retrieved_articles)

Configuration points:

  • Journal tiers (JOURNAL_TIERS)
  • Weighting of score components
  • Identifier accessibility check (check_identifier_accessibility)

2. Citation Accuracy Checking

Goal

Reduce citation hallucinations by verifying that:

  1. Each citation in the answer corresponds to a real article in the retrieved corpus.
  2. The cited passage is semantically consistent with the referenced article.

Inputs

  • answer_body – main answer text (with inline numeric citations like [1]).
  • ref_text – reference list generated by the model (e.g., lines starting with [1], [2], etc.).
  • retrieved_articles – same Article objects used in retrieval.

Outputs

  • cleaned_answer_body – answer text with hallucinated citations removed.
  • report_list – per-citation diagnostics:
    • raw_citation (e.g. [1])
    • identifier (PMCID/DOI or None)
    • statusvalid, mismatch, or not_found
    • similarity – embedding similarity score

Processing Steps

  1. Extract numeric citations from the answer (e.g. [1], [2]).
  2. Parse the reference list and map each number to a PMCID or DOI.
  3. Match each identifier to a retrieved article.
  4. Extract a local context window around the citation token.
  5. Compute embedding similarity between the context and the article’s title + abstract.
  6. Flag citations as:
    • valid (similarity above threshold),
    • mismatch (article found but content does not align),
    • not_found (no article with that identifier in the corpus).

Usage:

from validation import check_citations

cleaned_answer, report = check_citations(
    answer_body=answer_body,
    ref_text=ref_section,
    retrieved_articles=filtered_articles,
    similarity_threshold=0.6,
)

3. Integration Points

These subsystems plug into the MicroTraitLLM RAG pipeline as follows:

  1. Retrieval – given a query, retrieve candidate articles.
  2. Article Validation – call validate_articles to score and filter the candidates.
  3. LLM Generation – pass validated articles as context to the model.
  4. Citation Accuracy – call check_citations on the model’s answer and reference list.
  5. Final Output – return the cleaned answer plus an optional citation report.

4. Future Work / TODOs

  • Replace embed_text stub with the production embedding model.
  • Implement real PMCID/DOI checks in check_identifier_accessibility using NCBI/PMC APIs or a local metadata cache.
  • Allow scoring weights and journal tiers to be configured via YAML/JSON.
  • Add unit tests and benchmark evaluation:
    • Precision/recall for valid article retention.
    • Precision/recall/F1 for citation validation.
  • Extend citation parsing to handle additional formats (e.g., inline PMCIDs, author-year styles).

About

This is a repository to host the updates to MicroTraitLLM that were created during the ECES450/650 FA 2025.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages