A lightweight NLP pipeline for identifying medical and self-medication claims in Reddit posts.
This project builds a multi-stage NLP pipeline to detect and analyze claims in Reddit posts related to self-medication and health discussions.
Unlike simple text classification, the system explicitly models where and how claims are expressed.
- Detects whether a post contains a claim
- Classifies the claim as explicit or implicit
- Extracts the textual span where the claim is expressed
- Performs optional linguistic analyses (typology, hedging, error patterns)
- Interpretability
- Linguistic grounding
- Modular design
- Robustness to noisy Reddit text
The system is implemented using BERT-mini for efficiency and reproducibility.
Reddit Post
↓
Sentence Scoring (claim likelihood)
↓
Claim Detection (binary)
↓
Claim Type Classification (explicit / implicit)
↓
Span Extraction (BIO tagging)
↓
Optional Linguistic Analysis
- Span-aware modeling instead of document-only classification
- Sentence scoring used only at inference time (no retraining)
- Lightweight regex-based sentence splitting (no NLTK dependency)
Binary classification task:
- Question: Does this text contain a claim?
- Model: BERT-mini +
[CLS]classifier - Metric: F1 score
- Best validation F1: ~0.88
Classifies detected claims as:
- Explicit — clearly asserted
- Implicit — inferred, contrastive, experiential
Two modeling variants:
-
Document-level (final model)
-
Sentence-level (exploratory, noisier supervision)
-
Best Macro-F1 (document-level): ~0.62
Token-level BIO tagging:
B-CLAIMI-CLAIMO
Training strategy accounts for noisy supervision:
-
Exact span matches when available
-
Fuzzy token-level alignment otherwise
-
Best validation token Macro-F1: ~0.83
These analyses operate purely on predicted spans and do not modify trained models.
Each predicted span is labeled using surface cues:
- Causal
- Contrastive
- Epistemic
- Normative
Observed trends:
- Implicit claims are frequently contrastive or epistemic
- Multi-label spans are common
- Many claims combine inference + contrast
Counts epistemic hedges (e.g., might, seems, I think).
Corpus-level findings:
- Explicit claims: ~16.6% contain hedging
- Implicit claims: ~13.5% contain hedging
- Explicit claims still hedge surprisingly often
Manual inspection of false positives and negatives revealed:
- Narrative framing
- Experiential reports
- Advice vs assertion
- Implicit causality
- Contrast without an actual claim
Many errors reflect annotation ambiguity, not clean model failure.
Question:
Should claim detection be performed on the full post or at sentence level?
Experiment:
The same trained model is evaluated under three inference strategies.
| Variant | Precision | Recall | F1 |
|---|---|---|---|
| Full Post | 0.845 | 0.974 | 0.905 |
| Any Sentence | 0.833 | 0.952 | 0.889 |
| Best Sentence | 0.833 | 0.952 | 0.889 |
Observation:
- Full-post inference yields highest recall and F1
- Sentence-level inference reduces noise but offers no clear F1 gain
- Claims in Reddit health posts are often globally distributed
Zero-shot evaluation on the IBM Debater claim sentence benchmark (Wikipedia domain).
- 2,500 Wikipedia sentences
- 733 claims / 1,767 non-claims
- Query types:
q_strict,q_mc,q_that,q_cl
| Metric | Value |
|---|---|
| Precision | 0.407 |
| Recall | 0.520 |
| F1 | 0.457 |
| Accuracy | 0.638 |
Confidence behavior:
- Avg confidence (correct): 0.408
- Avg confidence (wrong): 0.527
The model is over-confident on errors, a known failure mode under domain shift.
| Query | Precision | Recall | F1 |
|---|---|---|---|
| q_strict | 0.548 | 0.518 | 0.533 |
| q_that | 0.418 | 0.468 | 0.441 |
| q_mc | 0.280 | 0.574 | 0.377 |
| q_cl | 0.259 | 0.600 | 0.362 |
Interpretation:
- Best performance on strict argumentative claims
- Degradation on loosely phrased factual/contextual sentences
- Highlights discourse-level mismatch, not model size alone
- Implicit claims are linguistically weaker and less confident
- Span-based modeling improves interpretability
- Sentence-level inference helps localization but not raw F1
- Cross-domain transfer remains challenging
- Many “errors” reflect gray areas in claim definition
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtpython src/train.py
python src/train_claim_type.py
python src/train_span.pypython run_sample.pyExample output:
{
"claim": true,
"claim_confidence": 0.71,
"claim_type": "explicit",
"claim_type_confidence": 0.56,
"span": "Just because an MRI shows a disc bulge doesn’t mean it’s the source of pain."
}python -m analysis.run_claim_typology_corpus
python -m analysis.run_hedging_corpus
python -m analysis.summarize_errors
python -m ablation_analysis.run_input_granularity_ablation
python -m benchmark_analysis.eval_ibm_claim_detection- Runs on CPU, CUDA, or Apple MPS
- Trained primarily on Apple M1/M2
- BERT-mini chosen for:
- Fast iteration
- Low memory usage
- Reproducibility
- Implicit claims remain difficult to define and detect
- Span supervision is noisy
- Typology and hedging analyses are heuristic
- Reddit health data raises ethical considerations