A systematic empirical study of prediction confidence, calibration, and failure modes in deep neural networks, focusing on the relationship between model confidence and correctness under in-distribution and out-of-distribution (OOD) conditions.
The project combines controlled experimentation, calibration metrics, and qualitative failure analysis to evaluate the reliability of modern classification models beyond standard accuracy-based evaluation.
- Convolutional Neural Network (CNN) implemented from first principles using PyTorch
- CIFAR-10: Training and in-distribution evaluation
- SVHN: Out-of-distribution evaluation
- In-distribution generalization
- Out-of-distribution confidence behavior
- Calibration error and confidence misalignment
- Relationship between prediction confidence and correctness
- Identification of high-confidence incorrect predictions
- Calibration behavior under distribution shift
- Failure modes invisible to accuracy-based metrics
- Limitations of softmax confidence as a reliability measure
- Classification accuracy
- Expected Calibration Error (ECE)
- Confidence histograms (correct vs incorrect predictions)
- Reliability diagrams
- In-distribution vs out-of-distribution confidence distributions
- Confident misclassification on ambiguous samples
- Overconfidence under severe distribution shift
- Calibration degradation despite high training accuracy
- Overlapping confidence distributions between ID and OOD data
- High training accuracy does not imply reliable confidence estimates.
- A non-trivial fraction of incorrect predictions are made with high confidence.
- Out-of-distribution samples receive confidence values comparable to in-distribution data.
- Calibration error increases even when accuracy degradation is moderate.
models/: Neural network architecturesdata/: Dataset loaders (datasets downloaded automatically)experiments/: Configuration files for experimental settingsresults/metrics/: Quantitative evaluation resultsresults/plots/: Calibration and confidence visualizationsresults/summary.md: Written analysis of findings
Machine learning models are increasingly deployed in real-world decision-making systems where incorrect but confident predictions can cause significant harm. While accuracy is the dominant evaluation metric, it fails to capture reliability failures that arise under distribution shift.
This project investigates these hidden failure modes through controlled experimentation and structured analysis, emphasizing the need for confidence-aware evaluation in deployed systems.
pip install -r requirements.txt
python train.py
python analyze_results.py- Temperature scaling and post-hoc calibration
- Abstention mechanisms for low-confidence predictions
- Controlled label-noise experiments
- Architecture comparisons (e.g., ResNet variants)
- OOD detection metrics (AUROC)