An empirical investigation into whether combining soft labels from multiple fine-tuned teachers improves knowledge distillation compared to using a single teacher. We evaluate four fusion strategies across four GLUE benchmark tasks and find that multi-teacher distillation provides marginal-to-no improvement over the best single teacher under our experimental conditions.
Distillation itself helps — the best distilled student closes ~28% of the teacher–student performance gap compared to training from scratch. However, multi-teacher fusion does not meaningfully outperform the best single-teacher baseline. The average multi-teacher advantage across tasks is -0.01 percentage points (effectively zero), with task-level results ranging from +1.1 pp (SST-2) to -0.7 pp (MRPC).
This is a negative result, and we believe it is worth reporting. The findings suggest that for standard GLUE tasks with similarly-architected transformer teachers, the diversity of soft labels across teachers may be insufficient to justify the added complexity of multi-teacher fusion.
| Method | Type | SST-2 (acc) | MNLI (acc) | QQP (F1) | MRPC (F1) | AVG |
|---|---|---|---|---|---|---|
| Teacher: BERT-large | teacher | 93.00 | 86.63 | 88.10 | 91.62 | 89.84 |
| Teacher: RoBERTa-large | teacher | 95.76 | 90.10 | 85.36 | 92.71 | 90.98 |
| Teacher: DeBERTa-base | teacher | 95.18 | 88.42 | 88.24 | 91.88 | 90.93 |
| Student: no distillation | baseline | 90.60 | 81.80 | 87.07 | 89.17 | 87.16 |
| Distill: BERT-large | single | 91.40 | 82.39 | 87.57 | 89.90 | 87.82 |
| Distill: RoBERTa-large | single | 90.94 | 82.70 | 87.21 | 90.76 | 87.90 |
| Distill: DeBERTa-base | single | 91.28 | 82.67 | 87.87 | 91.12 | 88.24 |
| Multi: average | multi | 92.55 | 82.70 | 87.93 | 89.73 | 88.23 |
| Multi: weighted | multi | 91.06 | 82.61 | 87.81 | 90.45 | 87.98 |
| Multi: best-per-sample | multi | 90.94 | 82.40 | 87.72 | 90.14 | 87.80 |
| Multi: learned gating | multi | 91.63 | 82.72 | 87.94 | 89.80 | 88.02 |
Note: main table uses a single seed (seed=42). Differences below ~0.5 pp should be interpreted cautiously without confidence intervals. Multi-seed evaluation infrastructure is available in experiments/run_full_comparison.py.
- Distillation works. Every distilled student outperforms the no-distillation baseline, confirming that soft label training provides useful signal beyond hard labels alone.
- Multi-teacher ≈ single-teacher. The best multi-teacher strategy (average, 88.23 AVG) is statistically indistinguishable from the best single-teacher (DeBERTa-base, 88.24 AVG). The difference is -0.01 pp.
- No strategy consistently wins. Simple averaging is the best multi-teacher strategy overall, but it loses to single-teacher DeBERTa on MRPC by 1.4 pp. No fusion method dominates across all tasks.
- The learned gating network adds complexity without payoff. Despite having trainable parameters to attend over teachers, it does not outperform the parameter-free average.
| Task | Advantage | Verdict |
|---|---|---|
| SST-2 | +1.15 pp | Modest gain |
| MNLI | +0.02 pp | No difference |
| QQP | +0.07 pp | No difference |
| MRPC | -0.67 pp | Single teacher wins |
The best distilled student closes ~28% of the gap between teacher and no-distillation baseline. Multi-teacher and single-teacher strategies close nearly identical fractions — the gap is not meaningfully different between fusion methods.
Student accuracy broken down by whether teachers agree or disagree on an example. Multi-teacher students do not outperform single-teacher students in disagreement regions, which is where fusion strategies should theoretically have an advantage.
A core diagnostic: if teachers are near-certain on almost every example, their soft labels carry little "dark knowledge" (Hinton et al., 2015), and averaging them produces distributions nearly identical to hard labels. This is precisely what we observe.
| Task | Mean Max Prob | Mean Entropy | % High Conf (>0.95) |
|---|---|---|---|
| SST-2 | 0.9922 | 0.0289 | 96.5% |
| MNLI | 0.9796 | 0.0656 | 92.3% |
| QQP | 0.9596 | 0.1181 | 80.0% |
| MRPC | 0.9212 | 0.2063 | 62.6% |
| Teacher | Mean Max Prob | Mean Entropy | % High Conf |
|---|---|---|---|
| BERT-large | 0.9628 | 0.1041 | 82.6% |
| RoBERTa-large | 0.9541 | 0.1279 | 78.6% |
| DeBERTa-base | 0.9725 | 0.0822 | 87.3% |
Interpretation: On SST-2 and MNLI, teachers assign >97% probability to their top prediction on average, with over 92% of examples classified as high-confidence. At this level, soft label distributions are nearly one-hot, leaving minimal "dark knowledge" for distillation to exploit. MRPC shows the most uncertainty (62.6% high-confidence), which correlates with it being the only task where single-teacher meaningfully beats multi-teacher — the noise in teacher signals is amplified, not averaged out, when teachers are uncertain.
This is the strongest evidence for why multi-teacher averaging did not help: the teachers are so confident that ensemble averaging produces distributions nearly identical to any single teacher's output.
We ran a joint grid sweep over temperature T ∈ {3, 6, 9} and alpha α ∈ {0.3, 0.5, 0.7} for both single-teacher DeBERTa and multi-teacher average. Results below reveal a key structural difference between the two methods.
| Task | Single-DeBERTa Best | Score | Multi-Average Best | Score |
|---|---|---|---|---|
| SST-2 | T=9, α=0.3 | 92.20 | T=3, α=0.5 | 92.55 |
| MNLI | T=9, α=0.7 | 83.23 | T=6, α=0.5 | 82.82 |
| QQP | T=3, α=0.5 | 90.98 | T=9, α=0.5 | 91.12 |
| MRPC | T=6, α=0.5 | 87.25 | T=9, α=0.7 | 86.76 |
| α=0.3 | α=0.5 | α=0.7 | |
|---|---|---|---|
| Single-DeBERTa | |||
| T=3 | 91.17 | 91.28 | 90.94 |
| T=6 | 90.60 | 91.17 | 91.51 |
| T=9 | 92.20 | 91.28 | 91.51 |
| Multi-Average | |||
| T=3 | 91.17 | 92.55 | 91.40 |
| T=6 | 91.40 | 91.40 | 92.09 |
| T=9 | 91.97 | 90.94 | 91.40 |
| α=0.3 | α=0.5 | α=0.7 | |
|---|---|---|---|
| Single-DeBERTa | |||
| T=3 | 85.29 | 87.01 | 86.27 |
| T=6 | 84.80 | 87.25 | 85.78 |
| T=9 | 85.78 | 86.03 | 84.56 |
| Multi-Average | |||
| T=3 | 85.78 | 85.29 | 86.27 |
| T=6 | 85.05 | 86.03 | 86.52 |
| T=9 | 84.80 | 85.29 | 86.76 |
Key finding from the sweep: Single-teacher DeBERTa consistently prefers higher temperatures (T=9), which is needed to soften its already-sharp, high-confidence predictions. Multi-teacher average, by contrast, peaks at lower temperatures (T=3 on SST-2) — because ensemble averaging already softens the logit distribution. Applying a high temperature on top of pre-averaged logits over-smooths the signal, explaining why multi-teacher underperforms with T=9 on SST-2. This interaction between temperature and fusion method is a substantive finding, not an artifact of the search grid.
- Low soft-label diversity. All three teachers are pretrained transformer encoders (BERT, RoBERTa, DeBERTa) with similar inductive biases. Critically, they are extremely confident: teachers agree on 85-95% of examples and assign >96% probability to their top prediction on SST-2 and MNLI. Their soft label distributions are nearly identical to hard one-hot labels, leaving minimal dark knowledge for distillation to exploit.
- Task saturation. On well-studied GLUE benchmarks, even DistilBERT reaches ~87-91% accuracy. The remaining error budget is dominated by genuinely ambiguous examples where no amount of teacher fusion helps.
- Averaging washes out signal. Where teachers disagree, averaging their logits produces a blurred distribution that is less informative than the single most confident teacher's prediction. Our failure analysis shows that on disagreement examples, multi-teacher students do not outperform single-teacher students.
- Temperature mismatch. Optimal temperature for single-teacher (T=9) differs from multi-teacher (T=3 on SST-2) because averaging inherently softens distributions. A fixed temperature chosen for one method is suboptimal for the other.
Teachers (fine-tuned independently) Student (distilled)
┌──────────────────────┐
│ BERT-large-uncased │──┐
└──────────────────────┘ │
┌──────────────────────┐ │ logits ┌─────────────────────────┐
│ RoBERTa-large │──┼──────────► │ Fusion Strategy │
└──────────────────────┘ │ │ (avg / weighted / ... │──► DistilBERT-base
┌──────────────────────┐ │ │ → fused soft labels) │ (student)
│ DeBERTa-base │──┘ └─────────────────────────┘
└──────────────────────┘
| # | Strategy | Description | Learnable? |
|---|---|---|---|
| 1 | Average | Element-wise mean of teacher logits | No |
| 2 | Weighted average | Weighted by each teacher's validation performance on the target task | No |
| 3 | Best-per-sample | Select the lowest-entropy (most confident) teacher per example | No |
| 4 | Learned gating | MLP maps student's [CLS] embedding to per-teacher attention weights | Yes |
Standard knowledge distillation loss combining soft and hard targets:
L = α · KL(softmax(student/T), softmax(teacher_fused/T)) · T² + (1-α) · CE(student, hard_labels)
Default hyperparameters: T=6.0, α=0.7 (tuned via grid sweep).
- Hinton et al. (2015) — "Distilling the Knowledge in a Neural Network". The original knowledge distillation paper. Introduces the temperature-scaled KL loss and the concept of "dark knowledge" in soft targets.
- You et al. (2017) — "Learning from Multiple Teacher Networks". One of the earliest works on multi-teacher distillation; motivates the ensemble approach.
- Liu et al. (2020) — "Adaptive Multi-Teacher Multi-Level Knowledge Distillation". Proposes task-adaptive weighting of teacher contributions; our weighted strategy is a simpler version of this idea.
- Tang et al. (2021) — "Why Does CTC Loss Result in Path Collapse with Shallow Transformers?" — context for why transformer teachers often converge to similar representations regardless of pretraining differences.
- Sanh et al. (2019) — "DistilBERT, a distilled version of BERT". The student architecture used in this work.
multi_teacher_distill/
├── configs/
│ └── config.py # Models, tasks, hyperparameters
├── data/
│ ├── load_data.py # GLUE data loading + tokenization
│ └── soft_labels/ # Precomputed teacher logits (.npy)
├── teachers/
│ ├── train_teacher.py # Fine-tune a single teacher
│ └── extract_logits.py # Extract + save teacher soft labels
├── student/
│ ├── train_student.py # Baseline student (no distillation)
│ ├── distillation_trainer.py # Core distillation training loop
│ ├── strategies.py # 4 fusion strategies + single-teacher passthrough
│ └── checkpoints/ # Saved models + eval_results.json
├── analysis/
│ ├── results_table.py # Build master results CSV
│ ├── agreement_analysis.py # Teacher agreement region analysis
│ ├── failure_analysis.py # Error categorization by cause
│ ├── teacher_confidence.py # Teacher confidence analysis
│ └── visualizations.py # Gap analysis, heatmaps, bar charts
├── experiments/
│ ├── configs/sweep_config.py # Temperature/alpha sweep definitions
│ ├── run_temperature_sweep.py # Temperature grid search
│ ├── run_alpha_sweep.py # Alpha grid search
│ ├── run_grid_sweep.py # Joint T × α grid sweep
│ └── run_full_comparison.py # Full comparison across seeds
├── results/
│ ├── master_results.csv # Main results table
│ ├── success_metrics.json # Aggregated metrics
│ ├── gap_analysis.png # Teacher–student gap visualization
│ ├── agreement_breakdown.png # Accuracy by agreement region
│ ├── failure_categories.png # Error breakdown by cause
│ └── ranking_heatmap.png # Strategy ranking across tasks
├── src/
│ ├── run_all_teachers.py # Train all 3 teachers × 4 tasks
│ ├── run_extract_logits.py # Extract logits for all teachers
│ ├── run_student_baseline.py # Train student without distillation
│ ├── run_single_teacher.py # Single-teacher distillation baselines
│ ├── run_distillation.py # Multi-teacher distillation (all strategies)
│ ├── run_analysis.py # Run all analysis scripts
│ └── verify_logits.py # Sanity-check saved logits
└── requirements.txt
pip install -r requirements.txtOr install manually (Python 3.8+, CUDA recommended):
pip install torch==2.3.0 transformers==4.44.2 datasets==4.6.0 evaluate==0.4.6 \
numpy==1.26.4 pandas==2.3.3 matplotlib==3.10.8 seaborn==0.13.2 \
tqdm==4.67.3 scipy# 1. Fine-tune all teachers (3 models × 4 tasks = 12 runs)
python src/run_all_teachers.py
# 2. Extract teacher logits (precompute soft labels for training)
python src/run_extract_logits.py
# 3. Train student baseline (no distillation)
python src/run_student_baseline.py
# 4. Single-teacher distillation baselines
python src/run_single_teacher.py
# 5. Multi-teacher distillation (all 4 fusion strategies × 4 tasks)
python src/run_distillation.py
# 6. Run analysis (results table, agreement analysis, failure analysis, plots)
python src/run_analysis.py# Temperature sweep (fixed α=0.5)
python experiments/run_temperature_sweep.py
# Alpha sweep (uses best T from temperature sweep)
python experiments/run_alpha_sweep.py
# Joint grid sweep over T × α
python experiments/run_grid_sweep.pypython analysis/teacher_confidence.py --split valAll experiments were run on a single GPU. Teacher fine-tuning is the most expensive step (~1-2 hours per teacher-task pair for the large models). Student distillation is fast (~10-20 minutes per configuration) since it trains DistilBERT-base with precomputed logits.
We split validation examples into three regions based on whether teachers agree:
- All agree (~85-95% of examples): All three teachers predict the same class. Student accuracy is high here regardless of distillation method.
- Partial disagree (~5-12%): Two teachers agree, one disagrees. This is where multi-teacher fusion could theoretically help by surfacing the majority vote.
- Full disagree (~0-3%): All teachers predict different classes (only possible for 3-class MNLI). These are the hardest examples.
Finding: Multi-teacher students do not show meaningful accuracy gains over single-teacher students on disagreement examples. The fusion strategies fail to reliably extract signal from teacher disagreements.
Every student error is classified into:
- All teachers also wrong — distillation cannot fix these; the teachers lack the knowledge
- At least one teacher was right — knowledge transfer failed; the student didn't learn from the correct teacher
- Error on disagreement — teachers disagreed and the student picked the wrong side
This breakdown reveals that most student errors fall into category 1 (teachers are also wrong), limiting the ceiling for any distillation improvement.
- Teacher diversity is low. All teachers are pretrained transformer encoders. Using architecturally diverse teachers (e.g., CNN, LSTM, or differently pretrained transformers) might yield different conclusions.
- Single student architecture. We only test DistilBERT-base. Results may differ for smaller or larger students.
- GLUE tasks only. These are well-studied benchmarks where pretrained models already perform well. Multi-teacher distillation may show stronger benefits on harder or lower-resource tasks.
- No confidence intervals in the main table. The reported numbers are single-seed runs (seed=42). Differences below ~0.5 pp should not be interpreted as meaningful without multi-seed validation. The experiment infrastructure supports multi-seed evaluation (see
experiments/run_full_comparison.py), but the main comparison table uses a fixed seed for cost reasons. - Offline distillation only. We precompute teacher logits rather than running teachers at train time. Online distillation with data augmentation or curriculum learning might change the picture.
Based on our analysis, future work could explore:
- Architecturally diverse teachers (e.g., mixing transformers with CNNs or retrieval-augmented models) to increase logit diversity
- Task-adaptive fusion — learning different fusion weights per task or per difficulty level
- Harder tasks where single teachers have larger error rates and more room for complementary coverage
- Curriculum-based fusion — weighting teachers differently across training stages
- Per-temperature optimization — tuning temperature separately for each fusion method, since optimal T differs between single and multi-teacher configurations (as shown in the grid sweep)
This project is for research and educational purposes.

