Skip to content

lcalvobartolome/mind

Repository files navigation

MIND: Multilingual Inconsistent Notion Detection

This repository contains the code and data for reproducing experiments from our paper Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering.

MIND pipeline

Installation

We recommend uv for installing the necessary dependencies.

Steps for deployment with uv

  1. Clone the repository (include submodules):

    git clone --recurse-submodules https://github.com/lcalvobartolome/mind.git
    cd mind

    If you already cloned without --recurse-submodules, run:

    git submodule update --init --recursive
  2. Install uv by following the official guide.

  3. Create a local environment (it will use the python version specified in pyproject.toml)

    uv venv .venv
  4. Activate the environment:

    source .venv/bin/activate   # On Linux/macOS
    .venv\Scripts\activate      # On Windows
  5. Install dependencies:

    uv pip install -e .
  6. Verify the installation:

    python -c "import mind; print(mind.__version__)"

Steps for deployment with docker

MIND web application provides an intuitive and efficient interface for MIND's tools, enabling users to perform data preprocessing, topic modeling, and discrepancy analysis easily and effectively. It can be deploy the web application using the code from this repository to customize the web interface, or access the application online via a hosted URL (https://mind.uc3m.es), allowing full flexibility for personal use.

  1. Clone the repository (include submodules):

    git clone --recurse-submodules https://github.com/lcalvobartolome/mind.git
    cd mind

    If you already cloned without --recurse-submodules, run:

    git submodule update --init --recursive
  2. Navigate to the project root (where docker-compose.yml is located) and build the containers:

    docker compose build
  3. Start the services in detached mode:

    docker compose up -d
  4. Once started, the application will be accessible at the following ports:

Run MIND pipeline

To run the MIND pipeline, you need a collection of loosely aligned documents (e.g., corresponding Wikipedia articles in different languages). These do not need to be perfect translations—just share similar topics.

Steps for running the pipeline:

  1. Preprocess corpora: The mind.corpus_building module provides scripts for segmenting documents, creating loose alignments (translation), NLP preprocessing, and assembling the final dataset for the PLTM wrapper.

    You can run these scripts directly from the command line with flexible arguments:

    # Segment documents into passages
    python3 src/mind/corpus_building/segmenter.py --input INPUT_PATH --output OUTPUT_PATH --text_col TEXT_COLUMN --id_col ID_COLUMN
    
    # Translate passages from anchor to comparison language (and vice versa)
    python3 src/mind/corpus_building/translator.py --input INPUT_PATH --output OUTPUT_PATH --src_lang SRC_LANG --tgt_lang TGT_LANG --text_col TEXT_COLUMN --lang_col LANG_COLUMN
    
    # Preprocess and prepare the final DataFrame for the pipeline
    python3 src/mind/corpus_building/data_preparer.py --anchor ANCHOR_PATH --comparison COMPARISON_PATH --output OUTPUT_PATH --schema SCHEMA_JSON_OR_PATH
    • Replace INPUT_PATH, OUTPUT_PATH, TEXT_COLUMN, MIN_LENGTH, SEPARATOR, SRC_LANG, TGT_LANG, LANG_COLUMN, ANCHOR_PATH, COMPARISON_PATH, and SCHEMA_JSON_OR_PATH with your actual file paths and column names.
    • For segmenter.py: MIN_LENGTH is the minimum paragraph length (default: 100), SEPARATOR is the split character (default: "\n")
    • The --schema argument for data_preparer.py can be a JSON string or a path to a JSON file mapping required columns.

    Alternatively, you can import and use these modules programmatically. See the Wikipedia use case for a complete example of how to use all these scripts in a workflow.

  2. Train a PLTM model: Train a Polylingual Topic Model on the prepared dataset.

    python3 src/mind/topic_modeling/polylingual_tm.py \
      --input PREPARED_DATASET_PATH \
      --lang1 LANG1 \
      --lang2 LANG2 \
      --model_folder MODEL_OUTPUT_DIR \
      --num_topics NUM_TOPICS \
      [+ additional optional params]
    • Replace each argument (e.g., PREPARED_DATASET_PATH, LANG1, LANG2, MODEL_OUTPUT_DIR, NUM_TOPICS, etc.) with your actual file paths, language codes, and options.
    • See python3 src/mind/topic_modeling/polylingual_tm.py --help for full details and all available options.
  • (Optional) Label Topics of a PLTM model: Assign meaningful labels to each discovery topic based on the top-ranked words and documents.

    python3 src/mind/topic_modeling/topic_label.py \
      --lang1 LANG1 \
      --lang2 LANG2 \
      --model_folder MODEL_OUTPUT_DIR \
      [+ additional optional params]
    • Replace each argument (e.g., LANG1, LANG2, MODEL_OUTPUT_DIR, etc.) with your actual file paths, language codes, and options.
    • See python3 src/mind/topic_modeling/topic_label.py --help for full details and all available options.
  1. Run the MIND pipeline: Detect discrepancies and perform downstream analysis.

    python3 src/mind/pipeline/cli.py \
        --src_corpus_path SRC_CORPUS_PATH \
        --src_thetas_path SRC_THETAS_PATH \
        --src_id_col SRC_ID_COL \
        --src_passage_col SRC_PASSAGE_COL \
        --src_full_doc_col SRC_FULL_DOC_COL \
        --src_lang_filter SRC_LANG \
        --tgt_corpus_path TGT_CORPUS_PATH \
        --tgt_thetas_path TGT_THETAS_PATH \
        --tgt_id_col TGT_ID_COL \
        --tgt_passage_col TGT_PASSAGE_COL \
        --tgt_full_doc_col TGT_FULL_DOC_COL \
        --tgt_lang_filter TGT_LANG \
        --topics TOPIC_IDS \
        --path_save RESULTS_DIR \
        [+ additional optional params]
    • Replace each argument (e.g., SRC_CORPUS_PATH, TGT_CORPUS_PATH, TOPIC_IDS, etc.) with your actual file paths, column names, and options.
    • --topics should be a comma-separated list of topic IDs, e.g. --topics 15,17.
    • See python3 src/mind/pipeline/cli.py --help for full details and all available options.

ROSIE-MIND

ROSIE-MIND is a dataset created by subsampling topics 12 (Pregnancy), 15 (Infant Care), and 25 (Pediatric Healthcare), and annotating them in two batches:

  • ROSIE-MIND-v1: Generated using the quora-distilbert-multilingual embedding model and qwen:32b LLM. Contains 80 annotated samples.
  • ROSIE-MIND-v2: Generated using BAAI/bge-m3 embeddings and llama3.3:70b LLM. Contains 651 annotated samples.

The complete ROSIE-MIND dataset is available on Hugging Face.

Replication of ablation experiments

Question and Answering

  1. Generate answers.

    For each question generated by the MIND question generator (using different LLMs), generate answers and detect discrepancies. This step requires already generated questions and relevant passages (from the Retrieval step).

    ./bash_scripts/run_answering_disc.sh
  2. Prepare human evaluation task.

    Prepare annotation files for human evaluation of both questions and answers. The evaluation is based on the following dimensions:

    • Questions: Verifiability, Passage Independence, Clarity, Terminology, Self-Containment, Naturalness
    • Answers: Faithfulness, Passage Dependence, Passage Reference Avoidance, Structured Response, Language Consistency
    python3 ablation/qa/prepare_eval_task.py
  3. Generate tables and figures.

    Analyze the results and create publication-ready tables and figures for the discrepancy evaluation. This notebook summarizes the main findings and visualizations used in the paper.

Retrieval

  1. Run retrieval to get relevant passages for the questions.

    For each question generated by the MIND question generator (using different LLMs), this step produces two Excel files per run: one with weighted and one with unweighted retrieval results.

    ./bash_scripts/run_retrieval.sh
  2. Get gold passages for retrieval metric calculation.

    For each passage retrieved by any method (ANN, ENN, TB-ENN, TB-ANN), four LLMs independently rate its relevance to the question. A passage is considered relevant only if all four LLMs agree.

    python3 ablation/retrieval/get_gold_passages.py
  3. Run statistical tests and generate tables.

    This step performs statistical analysis and generates summary tables for the retrieval experiments.

    ./bash_scripts/generate_tables.sh

Discrepancies

  1. Run discrepancy detection on the controlled dataset (FEVER-DPLACE-Q).

    Benchmarks the discrepancy detection module against a fixed reference set.

    python3 ablation/discrepancies/run_disc_ablation_controlled.py
  2. Prepare evaluation data from MIND-detected discrepancies and FEVER-DPLACE-Q.

    Builds a standardized annotation file for human evaluation and downstream analysis.

    python3 ablation/discrepancies/prepare_eval_task.py
  3. Generate tables and figures.

    Analyze the results and create publication-ready tables and figures for the discrepancy evaluation. This notebook summarizes the main findings and visualizations used in the paper.

Use cases

Wikipedia

To evaluate the generalizability of MIND — both across languages and beyond the health domain — we applied the pipeline to a collection of German–English Wikipedia article pairs.

The use_cases.wikipedia module provides scripts for:

  • Retrieving parallel or loosely aligned Wikipedia pages,
  • Preprocessing the corpora, and
  • Training topic models to run the full MIND pipeline.

1. Retrieve bilingual Wikipedia data

Retrieve a dataset of German–English Wikipedia articles, either aligned (same topic in both languages) or partially aligned (based on a specified alignment percentage) and preprocess the retrieved data and prepare it for topic modeling and pipeline execution:

python3 -m use_cases.wikipedia.generate_dtset --output-path test

This script runs all necessary retrieving, cleaning, segmentation, and structuring steps, producing a dataset ready for model training.

2. Train models and run the MIND pipeline

Once the dataset is prepared, you can train various models and apply the full MIND pipeline using the provided bash script:

./bash_scripts/run_en_de.sh

This script demonstrates an end-to-end workflow, from dataset preparation to discrepancy detection, for the German–English Wikipedia use case.

For more details on the preprocessing and pipeline stages, see the Run MIND pipeline section above.

 Data

All data used and generated for the paper, including ROSIE-MIND and the synthetic dataset FEVER-DPLACE-Q (with LLM annotations), are available at the HuggingFace collection. For both ROSIE and ENDE datasets (see use cases section), we include the preprocessed corpora augmented with topic modeling information ready to apply MIND. Topic words in each language and topic labels are available as additional metadata. Data from ablation studies is available as well as complete output from MIND pipeline executions on ROSIE and ENDE is available upon request.

About

MIND: Multilingual Inconsistent Notion Detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •