On the Factual Consistency of Text-based Explainable Recommendation Models

This repository provides a comprehensive framework for evaluating the factual consistency of text-based explainable recommendation models. It includes statement-level evaluation metrics, augmented benchmark datasets, and baseline implementations.

📄 Paper

On the Factual Consistency of Text-based Explainable Recommendation Models
Ben Kabongo, Vincent Guigue

Text-based explainable recommendation aims to generate natural-language explanations that justify item recommendations. While recent models produce fluent outputs, this work reveals a critical gap: high surface-level quality doesn't guarantee factual accuracy. We introduce a framework to evaluate factual consistency at the statement level and show that state-of-the-art models exhibit substantial hallucination, with precision ranging from 4.38% to 32.88%.

🗂️ Repository Structure

.
├── baselines/                    # Reference implementations + helpers
│   ├── Att2Seq/  CER/  NRT/      # RNN/Transformer baselines (train & module files)
│   ├── PETER/  PEPLER/  XRec/    # Transformer & LLM-enhanced baselines
│   ├── NL_profiles/              # User/item natural-language profile generation
│   └── output_process/           # Clean → STS extraction → post-processing
├── data/                         # Augmented datasets (schemas, loading tips)
├── evaluation/                   # LLM + NLI + QG-QA + text-similarity metrics
├── statement_topic_sentiment/    # STS extraction prompts & GT builder
└── README.md

🎯 Key Contributions

Statement-Level Ground-truth Construction: LLM-based pipeline to extract atomic explanatory statements with domain-specific topics and sentiment labels from reviews
Augmented Benchmark Datasets: Five Amazon Reviews categories (Toys, Clothes, Beauty, Sports, Cellphones) with statement-level annotations
Factuality Metrics: Comprehensive suite combining LLM-based and NLI-based approaches to assess precision (factual consistency) and recall (coverage)

📊 Datasets

We provide five augmented datasets from Amazon Reviews 2014: Toys, Clothes, Beauty, Sports, and Cellphones.

Each interaction includes:

Atomic statement–topic–sentiment (STS) triplets extracted from reviews
Ground-truth explanations constructed by rule-based aggregation (no LLM generation)
Domain-specific topics (10 topics per domain)

Dataset: https://huggingface.co/datasets/benkabongo25/amazon-reviews-statement-v0

Statistics:

Dataset	Users	Items	Interactions	Avg Statements/Interaction
Toys	19,398	11,924	163,711	5.03
Clothes	39,385	23,033	274,774	4.42
Beauty	22,362	12,101	197,621	5.45
Sports	35,596	18,357	293,244	4.93
Cellphones	27,873	10,429	190,194	4.54

For detailed dataset documentation, see data/README.md.

🔧 Baselines

We evaluate six state-of-the-art models spanning three architectural families:

RNN-based: Att2Seq, NRT
Transformer-based: PETER, CER, PEPLER
LLM-enhanced: XRec (LightGCN + LLM)

For training details and hyperparameters, see baselines/README.md.

📈 Evaluation

Our framework includes multiple evaluation approaches:

LLM-based Statement Metrics: St2Exp-P/R/F1 (precision, recall, F1)
NLI-based Statement Metrics: StEnt-, StCoh- (entailment, coherence)
Standard NLI Metrics: SummaC, AlignScore
QG-QA Metrics: QuestEval
Text Similarity: BERTScore, STS, BARTScore, BLEURT

For complete evaluation protocols, see evaluation/README.md.

🚀 Quick Start

1. Extract Statement-Topic-Sentiment Triplets

cd statement_topic_sentiment

PYTHONPATH=. python sts_extraction.py \
  --model /path/to/Meta-Llama-3-8B-Instruct \
  --prompt_text_file prompts/Toys/toys.txt \
  --dataset_path /path/to/reviews.json \
  --output_dir /path/to/output/ \
  --batch_size 32 \
  --max_new_tokens 512

See statement_topic_sentiment/README.md for adapting to new domains.

2. Train a Baseline Model

Example with PETER:

PYTHONPATH=. python3 baselines/PETER/main.py \
    --dataset_name Toys \
    --dataset_dir /path/to/Toys/ \
    --save_dir /path/to/checkpoints/ \
    --emsize 512 \
    --epochs 100 \
    --batch_size 32 \
    --seed 42

3. Post-process Model Outputs

# Clean predictions
PYTHONPATH=. python baselines/output_process/clean.py \
  --infile ${OUTPUT_DIR}/output.csv \
  --outfile ${OUTPUT_DIR}/output.csv

# Extract statements from predictions
PYTHONPATH=. python baselines/output_process/sts_extraction.py \
  --model /path/to/Meta-Llama-3-8B-Instruct \
  --prompt_text_file statement_topic_sentiment/prompts/toys.txt \
  --dataset_path ${OUTPUT_DIR}/output.csv \
  --output_dir ${OUTPUT_DIR}

See baselines/output_process/README.md for details.

4. Evaluate Factual Consistency

LLM-based evaluation:

PYTHONPATH=. python evaluation/llm/statement2doc.py \
    --model /path/to/llama-3.1-8B-instruct \
    --baseline_dir ${BASELINE_DIR} \
    --task statement2explanation \
    --batch_size 24

NLI-based evaluation:

PYTHONPATH=. python evaluation/nli/nli_batch_pairs.py \
  --sts_ref_path ${DATASET_DIR}/sts.csv \
  --sts_pred_path ${BASELINE_DIR}/sts.csv \
  --model_name microsoft/deberta-large-mnli \
  --batch_size 64

See evaluation/README.md for details.

🔍 Key Findings

Our experiments reveal a dramatic disconnect between surface-level quality and factual accuracy:

High fluency scores: BERTScore F1 ranges from 0.81 to 0.90
Low factual precision: Statement-level precision ranges from 4.38% (NRT on Toys) to 32.88% (XRec on Sports)
Poor recall: Models miss 70%+ of ground-truth explanatory content
Standard metrics fail: Similarity metrics don't correlate with factual consistency

Implication: Current models generate fluent but factually inconsistent explanations, highlighting the need for factuality-aware evaluation and model development.

📚 Citation

@article{kabongo2025factual,
  title={On the Factual Consistency of Text-based Explainable Recommendation Models},
  author={Kabongo, Ben and Guigue, Vincent},
  journal={arXiv preprint},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
baselines		baselines
data		data
evaluation		evaluation
statement_topic_sentiment		statement_topic_sentiment
utils		utils
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On the Factual Consistency of Text-based Explainable Recommendation Models

📄 Paper

🗂️ Repository Structure

🎯 Key Contributions

📊 Datasets

🔧 Baselines

📈 Evaluation

🚀 Quick Start

1. Extract Statement-Topic-Sentiment Triplets

2. Train a Baseline Model

3. Post-process Model Outputs

4. Evaluate Factual Consistency

🔍 Key Findings

📚 Citation

About

Uh oh!

Releases

Packages

Languages

BenKabongo25/factual_explainable_recommendation

Folders and files

Latest commit

History

Repository files navigation

On the Factual Consistency of Text-based Explainable Recommendation Models

📄 Paper

🗂️ Repository Structure

🎯 Key Contributions

📊 Datasets

🔧 Baselines

📈 Evaluation

🚀 Quick Start

1. Extract Statement-Topic-Sentiment Triplets

2. Train a Baseline Model

3. Post-process Model Outputs

4. Evaluate Factual Consistency

🔍 Key Findings

📚 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages