The Detectability Paradox: Bilingual Medical Report Generation with Open-Weight Models and the Limits of Human Oversight
This project investigates the quality and safety risks of using large language models (LLMs) to automate medical report generation in English and French. We evaluate medical reports generated by several multilingual LLMs using automated metrics and a medical expert panel, demonstrating high-quality output while highlighting the need for automated tools to detect machine-generated content.
This repository contains the complete pipeline for:
- Data preprocessing - Process raw medical data
- EHR simulation - Generate synthetic electronic health records
- Report generation - Generate medical reports (zero-shot & few-shot)
- Authorship classification - Detect machine-generated vs human-written medical reports
Languages: English & French
├── data/
│ ├── raw/ # mtsamples urls, PubMed French PMIDs
│ └── processed/
│ ├── dev/ # Development set (for few-shot prompting)
│ └── test/ # Test set (for evaluation)
│
├── src/
│ ├── preprocessing/ # Data preprocessing scripts
│ │ ├── case_report_extractor.py # Extract the French case reports
│ │ ├── preprocessing_pmc_patients.py # Extract the English case reports
│ │ └── medical_transcript_scraper.py # Extract the English medical transcript
│ │
│ ├── llm_generation/
│ │ ├── ehr_simulation/ # EHR simulation
│ │ │ ├── generate_ehr.py
│ │ │ ├── config.py
│ │ │ ├── prompts.py
│ │ │ └── utils.py
│ │ │
│ │ └── report_generation/ # Medical report generation
│ │ ├── generate_report.py
│ │ ├── config.py
│ │ ├── prompts.py
│ │ └── utils.py
│ │
│ ├── evaluation/ # Automatic evaluation
│ │ ├── bertscore_evaluator.py
│ │ └── rouge_evaluator.py
│ │
│ └── expert_annotation/ # Expert evaluation setup
│ │ └── randomize_data.py # Randomize samples for expert panel
│ │
│ └── authorship_classifier/ # Machine vs Human text detection
│ ├── training/
│ │ ├── train.py # Main training script
│ │ ├── config.py # Configuration
│ │ ├── dataset.py # PyTorch Dataset
│ │ ├── evaluation.py # Evaluation metrics
│ │ ├── inference.py # Inference and predictions
│ │ ├── trainer.py # Training logic
│ │ └── utils.py # Utilities
│ │
│ └── ig_scores/ # Integrated Gradients analysis
│ └── compute_ig.py # Attribution scores
│
│
├── README.md # This file
└── requirements.txt # Python dependencies- Python 3.11+
- CUDA-capable GPU (for vLLM)
# Clone repository
git clone https://github.com/ds4dh/medical_report_generation
cd medical_report_generation
# Install dependencies
pip install -r requirements.txtScrape medical transcripts from MTSamples.com:
cd src/preprocessing
# Scrape medical transcripts from MTSamples
python medical_transcript_scraper.py \
--input_dir ../../data/raw/mtsamples_urls.csv \
--output_dir ../../data/raw/english_medical_transcripts.csvDownload each paper using its PMC ID and extract the case report section.
cd src/preprocessing
python case_report_extractor.py ../../data/raw/french_case_reports_pmc_ids.txtTo run this script, you need to download the source dataset from: https://github.com/pmc-patients/pmc-patients
cd src/preprocessing
python preprocessing_pmc_patients.py
--input_dir path to "PMC-Patients.json" file \
--output_dir 'english_case_reports.csv'cd src/llm_generation/ehr_simulation
python generate_ehr.py \
--task case_report \
--language english \
--input_file ../../../data/processed/test/case_reports.csv
python generate_ehr.py \
--task transcript \
--language french \
--input_file ../../../data/processed/test/medical_transcripts_test.csvcd src/llm_generation/report_generation
# Zeroshot English case reports
python generate.py \
--task case_report \
--approach zeroshot \
--language english \
--input_file ../../../data/processed/test/case_reports.csv
# Fewshot French transcripts
python generate.py \
--task transcript \
--approach fewshot \
--language french \
--num_shots 3 \
--input_file ../../../data/processed/test/transcripts.csv \
--dev_file ../../../data/processed/dev/transcripts.csvcd src/authorship_classifier/training
# Train with default settings
python train.py --data_folder /path/to/data
# Train with custom model
python train.py \
--data_folder /path/to/data \
--model_name bert-base-multilingual-cased \
--num_epochs 5 \
--batch_size 16Place these files in your data folder:
train.csv- training datadev.csv- development datatest.csv- test data
Required columns:
text: text content to classifylabel: label (0 = machine, 1 = human)
text,label
"This is machine-generated text...",0
"This is human-written text...",1
For questions or inquiries, please contact us at hossein.rouhizadeh@unige.ch.