Most detector projects stop at accuracy. This repo goes one step further and answers the question you actually need in a workflow:
When the model says “I’m confident”, should we trust it and what should we do when it’s not?
This project produces a Reliability Report Card (calibration + performance), learns a recommended abstention threshold to hit a target auto-decision coverage, and ships a Streamlit dashboard that presents everything in a clean, decision-ready format.
- What problem this solves
- Key ideas
- What the pipeline produces
- Repository structure
- Quickstart
- Data format
- How the pipeline works
- Metrics explained
- Abstention and coverage explained
- Figures explained (each one, deeply)
- Streamlit dashboard explained (tab-by-tab)
- Recommended threshold: what it means
- Decision safety: what this UI prevents
- How to make it production-grade
- Troubleshooting
Detectors are often deployed in environments where a wrong decision is costly:
- flagging humans as AI (false positives) can harm trust or policy enforcement
- missing AI (false negatives) may weaken moderation or integrity workflows
- post-edited AI is especially tricky: it can resemble both classes
Accuracy alone doesn’t tell you whether to trust the model on any specific decision.
This repo introduces two “real-world” requirements:
-
Calibration
- If the model outputs “0.80 confidence,” it should be correct about ~80% of the time (on similar data).
- If it’s overconfident, thresholding becomes dangerous: you think you’re safe when you’re not.
-
Abstention
- A safe detector doesn’t need to auto-decide on everything.
- It can abstain (send uncertain cases to human review) to reduce harmful errors.
- But abstaining too often reduces usability and increases review cost.
So the product question becomes:
What confidence threshold gives us the best tradeoff between coverage (auto-decisions) and correctness?
That’s exactly what the report card + dashboard answer.
A model can be:
- accurate but miscalibrated (probabilities are misleading)
- well-calibrated but not very accurate (it “knows what it doesn’t know,” but still struggles)
You need both for decision-safe deployment.
Coverage = fraction of cases the model decides automatically.
- High coverage means fewer reviews but more exposure to wrong decisions.
- Lower coverage means safer auto-decisions but more review load.
This repo operationalizes that with:
- a recommended confidence threshold
- a clear rule in the UI showing why the system abstained
After running the pipeline, you get:
outputs/metrics_overall.json- accuracy, macro_f1, ECE, Brier score, labels, etc.
outputs/abstention_policy.json- recommended threshold
- expected coverage at that threshold
- (optionally) rule notes
outputs/coverage_curve.csv- threshold → coverage → accuracy → macro_f1 (and potentially more)
outputs/test_predictions.csv- predicted label, confidence
- per-class probabilities
p_<label> - optional “disagreement” features used in the abstain rule
Saved to: reports/figures/
- confusion matrix
- reliability diagram
- coverage vs performance
- confidence histogram
detector-reliability-report-card/
├─ app/
│ └─ app.py # Streamlit dashboard
├─ src/
│ └─ pipeline.py # Training + evaluation + artifacts + plots
├─ data/
│ └─ raw/
│ └─ ai_human_detection.csv # Example input
├─ outputs/
│ ├─ metrics_overall.json
│ ├─ abstention_policy.json
│ ├─ coverage_curve.csv
│ └─ test_predictions.csv
└─ reports/
├─ figures/
│ ├─ confusion_matrix.png
│ ├─ reliability_diagram.png
│ ├─ coverage_vs_accuracy.png
│ └─ probability_histograms.png
└─ screenshots/ # Optional: UI screenshots for README
├─ ui_report_card.png
├─ ui_coverage_curve.png
├─ ui_triage_ui.png
└─ ui_notes.png
pip install -r requirements.txtpython -m src.pipeline --input data/raw/ai_human_detection.csvExpected outcome:
outputs/populated with JSON/CSV artifactsreports/figures/populated with PNG plots- terminal prints a recommended threshold + estimated coverage (example: threshold≈0.61, coverage≈0.71)
streamlit run app/app.pyAt minimum, the pipeline needs:
- a text column (the content)
- a label column (ground truth class)
Typical labels used by this project:
humanaipost_edited_ai
Why 3-class matters: Post-edited AI behaves like an “in-between” distribution. It often creates:
- confusion with
aiwhen edits are light - confusion with
humanwhen edits are heavy
That’s why macro-F1 and confusion analysis are emphasized.
If your dataset uses different column names, preprocess to match the expected schema (or update pipeline mapping).
This section explains what the pipeline is doing conceptually, not just “it trains a model.”
The pipeline trains a simple, explainable baseline (fast and strong enough for report-card purposes). You may see “Primary model: char” in logs, indicating a character-based representation/model variant.
Why baseline models?
- quick to train and iterate
- easy to debug
- provide a strong reference point before heavier models
Decision safety requires probabilities because:
- thresholds operate on probabilities (confidence)
- calibration evaluates probability quality
Raw ML scores are often miscalibrated. Calibration reshapes predicted probabilities so that “0.8” behaves like “~80% correct.”
This repo supports common calibration choices:
- sigmoid (Platt scaling): stable, good default
- isotonic: more flexible but can overfit on small calibration sets
Standard performance metrics on held-out test data.
This is the “trust layer”:
- calibration answers whether confidence aligns with reality
- ECE/Brier quantify that alignment
The pipeline evaluates many confidence thresholds:
- for each threshold
t, auto-decide only whenconfidence ≥ t - compute coverage and metrics on the decided subset
This produces:
- a curve that shows coverage vs performance
- a recommended threshold based on a target coverage
Everything is written out so you can:
- reproduce results
- compare runs (diff JSON/CSV)
- use figures in reports and posts
- power the dashboard without re-training every time
What it measures: overall correctness rate. What it hides: class imbalance and which mistakes matter.
If one class dominates, accuracy may look fine even if you fail on minority classes.
What it measures: F1 per class averaged equally across classes. Why it matters here: AI / human / post-edited classes often have different difficulty and prevalence.
Macro F1 helps prevent a misleading “it’s accurate!” conclusion that’s driven by the easiest class.
A model is well-calibrated if:
- among all predictions made at ~0.70 confidence, about 70% are correct
You can have:
- high accuracy but terrible calibration (dangerous thresholds)
- moderate accuracy but excellent calibration (safer abstention behavior)
ECE bins predictions by confidence:
-
bin 0.6–0.7, 0.7–0.8, etc.
-
for each bin compare:
- mean confidence
- empirical accuracy
Then it averages the absolute gap weighted by bin size.
Interpretation:
- lower is better
- high ECE means “confidence numbers are lying”
Brier measures the squared error of predicted probabilities. It rewards:
- accurate predictions
- probabilities that are close to the true outcome
Interpretation:
- lower is better
- sensitive to both correctness and probability sharpness
Coverage = fraction of samples that the model auto-decides.
If you set threshold high:
- you keep only very confident predictions
- coverage drops
If you set threshold low:
- you decide more often
- coverage increases
When abstaining, you evaluate performance on the decided subset.
This usually increases as threshold rises (coverage drops), because:
- the model keeps only easy/high-confidence examples
- High coverage → less review cost, more wrong auto-decisions
- Low coverage → safer auto-decisions, higher review burden
This repo makes that tradeoff measurable and explicit.
All figures are in reports/figures/.
File: reports/figures/confusion_matrix.png
What it shows
- Rows: true label
- Columns: predicted label
- Each cell: count of examples
How to read it
- Diagonal: correct predictions
- Off-diagonal: confusions
Why it’s critical for 3-class detection In AI vs human vs post-edited AI:
post_edited_ai → aimay mean edits are subtle or model sees AI artifactspost_edited_ai → humanmay mean edits “wash out” artifactsai → humanmay indicate generator style resembles human writing
Actionable use
-
decide where to invest improvement:
- collect more post-edited examples
- add features for edit markers
- adjust class weights or thresholds per class (advanced)
File: reports/figures/reliability_diagram.png
What it shows
- x: predicted confidence (averaged within a bin)
- y: empirical accuracy (within that bin)
- dashed diagonal: perfect calibration
Interpretation
- Points below diagonal → overconfidence (high risk)
- Points above diagonal → underconfidence (model is safer than it claims)
Why it matters If your decision rule uses “confidence ≥ 0.8,” but the model is overconfident:
- you may think you’re auto-deciding safely
- but your true correctness might be far lower than expected
This figure helps validate whether thresholding is trustworthy.
File: reports/figures/coverage_vs_accuracy.png
What it shows
- x: coverage (fraction auto-decided)
- y: performance (accuracy, macro-F1) on auto-decided subset
How to use it
- choose a target coverage your workflow can handle
- check the resulting expected performance
Example reasoning
-
If you can only review 30% of items:
- you need ~70% coverage
- find what accuracy/macro-F1 you get there
-
If you require minimum macro-F1 of 0.70:
- find what coverage you must accept
This connects model behavior to operational constraints.
File: reports/figures/probability_histograms.png
What it shows
- distribution of max predicted probability per sample (“confidence”)
Why it’s useful
-
reveals how often the model is uncertain
-
shows whether a threshold will dramatically change coverage
-
helps detect suspicious confidence behavior:
- everything near 1.0 can be a sign of overconfidence (check calibration)
- everything midrange suggests the model lacks separability
Practical threshold insight A good threshold often lies where:
- you drop the “uncertain mass”
- without destroying coverage
The dashboard turns offline artifacts into a clean decision interface.
Purpose: one screen that answers: “should we trust this model?”
What it shows:
-
top-line metrics:
- accuracy, macro-F1 (quality)
- ECE, Brier (trust)
-
a clean 2×2 grid of figures:
- confusion matrix
- coverage vs performance
- reliability diagram
- confidence histogram
-
recommended abstention policy JSON (if present)
Why the 2×2 grid matters:
- it prevents “scroll blindness”
- lets you compare diagnostics side-by-side
- keeps the view clean (same width and aligned)
Purpose: explore threshold tradeoffs interactively
It typically includes:
- performance vs coverage
- threshold vs coverage/metrics
What it helps you decide:
- “What threshold gives me ~70% auto-decisions?”
- “How much performance do I lose if I increase coverage?”
- “Where are diminishing returns?”
Purpose: show what a decision-safe output would look like to a user/operator
What it demonstrates:
- predicted class
- confidence
- auto-decide vs abstain decision
- a probability breakdown bar chart
Important note (current behavior):
-
this demo uses saved test predictions to demonstrate UI format
-
for real inference:
- persist the trained model
- load it in the app
- run prediction on pasted text
Purpose: document the policy philosophy + upgrade path
It explains:
- accuracy ≠ trust
- calibration and ECE
- coverage as a real product metric
- recommended next upgrades
The pipeline prints a “Recommended threshold” based on your chosen target coverage.
Interpretation:
-
“Threshold = 0.61, coverage ≈ 0.71” means:
- if you auto-decide when confidence ≥ 0.61
- you will auto-decide about 71% of cases (on similar data)
- the remaining ~29% should go to review
Important limitation Coverage estimates are only valid if:
- future data resembles evaluation data
- calibration remains stable
- class mix doesn’t drift heavily
That’s why drift monitoring is part of production-grade upgrades.
This project avoids common failure patterns:
Not safe. 80% accuracy can still mean:
- severe overconfidence
- unacceptable errors on minority classes
- catastrophic errors on specific slices
Not safe unless calibration supports it. If the model is overconfident, 0.9 is not truly 90% reliable.
Not efficient. Abstention focuses review on uncertain cases where humans add the most value.
If you want this to become a real internal tool:
- save trained model artifact (joblib)
- load it in
app.py - run prediction on input text
Track performance/calibration by:
- language
- topic/domain
- content length
- “post-edited intensity” bins
Monitor over time:
- confidence distribution drift
- class mix drift
- calibration drift (ECE over time)
Choose threshold by minimizing:
- error cost (wrong auto-decisions)
- review cost (abstentions)
If you see:
“Please replace
use_container_widthwithwidth...”
Fix:
- use
width="stretch"inst.image()andst.plotly_chart()
This repo’s UI layout is designed to use width="stretch" so figures align cleanly.
If the dashboard says outputs are missing:
- click Run / Refresh in the sidebar
- or run the pipeline from the command line
- ensure your
out_dirandfigures_dirare correct - confirm the app points to the same project root