Multi-Teacher vs. Single-Teacher Knowledge Distillation

An empirical investigation into whether combining soft labels from multiple fine-tuned teachers improves knowledge distillation compared to using a single teacher. We evaluate four fusion strategies across four GLUE benchmark tasks and find that multi-teacher distillation provides marginal-to-no improvement over the best single teacher under our experimental conditions.

Key Finding

Distillation itself helps — the best distilled student closes ~28% of the teacher–student performance gap compared to training from scratch. However, multi-teacher fusion does not meaningfully outperform the best single-teacher baseline. The average multi-teacher advantage across tasks is -0.01 percentage points (effectively zero), with task-level results ranging from +1.1 pp (SST-2) to -0.7 pp (MRPC).

This is a negative result, and we believe it is worth reporting. The findings suggest that for standard GLUE tasks with similarly-architected transformer teachers, the diversity of soft labels across teachers may be insufficient to justify the added complexity of multi-teacher fusion.

Results

Main Comparison (validation set, seed=42)

Method	Type	SST-2 (acc)	MNLI (acc)	QQP (F1)	MRPC (F1)	AVG
Teacher: BERT-large	teacher	93.00	86.63	88.10	91.62	89.84
Teacher: RoBERTa-large	teacher	95.76	90.10	85.36	92.71	90.98
Teacher: DeBERTa-base	teacher	95.18	88.42	88.24	91.88	90.93
Student: no distillation	baseline	90.60	81.80	87.07	89.17	87.16
Distill: BERT-large	single	91.40	82.39	87.57	89.90	87.82
Distill: RoBERTa-large	single	90.94	82.70	87.21	90.76	87.90
Distill: DeBERTa-base	single	91.28	82.67	87.87	91.12	88.24
Multi: average	multi	92.55	82.70	87.93	89.73	88.23
Multi: weighted	multi	91.06	82.61	87.81	90.45	87.98
Multi: best-per-sample	multi	90.94	82.40	87.72	90.14	87.80
Multi: learned gating	multi	91.63	82.72	87.94	89.80	88.02

Note: main table uses a single seed (seed=42). Differences below ~0.5 pp should be interpreted cautiously without confidence intervals. Multi-seed evaluation infrastructure is available in experiments/run_full_comparison.py.

Honest Interpretation

Distillation works. Every distilled student outperforms the no-distillation baseline, confirming that soft label training provides useful signal beyond hard labels alone.
Multi-teacher ≈ single-teacher. The best multi-teacher strategy (average, 88.23 AVG) is statistically indistinguishable from the best single-teacher (DeBERTa-base, 88.24 AVG). The difference is -0.01 pp.
No strategy consistently wins. Simple averaging is the best multi-teacher strategy overall, but it loses to single-teacher DeBERTa on MRPC by 1.4 pp. No fusion method dominates across all tasks.
The learned gating network adds complexity without payoff. Despite having trainable parameters to attend over teachers, it does not outperform the parameter-free average.

Per-Task Multi-Teacher Advantage (best multi - best single)

Task	Advantage	Verdict
SST-2	+1.15 pp	Modest gain
MNLI	+0.02 pp	No difference
QQP	+0.07 pp	No difference
MRPC	-0.67 pp	Single teacher wins

Key Visualizations

Gap Analysis: How Much of the Teacher–Student Gap Distillation Closes

The best distilled student closes ~28% of the gap between teacher and no-distillation baseline. Multi-teacher and single-teacher strategies close nearly identical fractions — the gap is not meaningfully different between fusion methods.

Teacher Agreement Breakdown: Accuracy by Agreement Region

Student accuracy broken down by whether teachers agree or disagree on an example. Multi-teacher students do not outperform single-teacher students in disagreement regions, which is where fusion strategies should theoretically have an advantage.

Teacher Confidence Analysis

A core diagnostic: if teachers are near-certain on almost every example, their soft labels carry little "dark knowledge" (Hinton et al., 2015), and averaging them produces distributions nearly identical to hard labels. This is precisely what we observe.

Per-Task Teacher Confidence (validation split, averaged across 3 teachers)

Task	Mean Max Prob	Mean Entropy	% High Conf (>0.95)
SST-2	0.9922	0.0289	96.5%
MNLI	0.9796	0.0656	92.3%
QQP	0.9596	0.1181	80.0%
MRPC	0.9212	0.2063	62.6%

Per-Teacher Confidence (averaged across 4 tasks)

Teacher	Mean Max Prob	Mean Entropy	% High Conf
BERT-large	0.9628	0.1041	82.6%
RoBERTa-large	0.9541	0.1279	78.6%
DeBERTa-base	0.9725	0.0822	87.3%

Interpretation: On SST-2 and MNLI, teachers assign >97% probability to their top prediction on average, with over 92% of examples classified as high-confidence. At this level, soft label distributions are nearly one-hot, leaving minimal "dark knowledge" for distillation to exploit. MRPC shows the most uncertainty (62.6% high-confidence), which correlates with it being the only task where single-teacher meaningfully beats multi-teacher — the noise in teacher signals is amplified, not averaged out, when teachers are uncertain.

This is the strongest evidence for why multi-teacher averaging did not help: the teachers are so confident that ensemble averaging produces distributions nearly identical to any single teacher's output.

Hyperparameter Sensitivity: Grid Sweep Results

We ran a joint grid sweep over temperature T ∈ {3, 6, 9} and alpha α ∈ {0.3, 0.5, 0.7} for both single-teacher DeBERTa and multi-teacher average. Results below reveal a key structural difference between the two methods.

Optimal Configuration Per Task

Task	Single-DeBERTa Best	Score	Multi-Average Best	Score
SST-2	T=9, α=0.3	92.20	T=3, α=0.5	92.55
MNLI	T=9, α=0.7	83.23	T=6, α=0.5	82.82
QQP	T=3, α=0.5	90.98	T=9, α=0.5	91.12
MRPC	T=6, α=0.5	87.25	T=9, α=0.7	86.76

Full Grid: SST-2 (accuracy)

	α=0.3	α=0.5	α=0.7
Single-DeBERTa
T=3	91.17	91.28	90.94
T=6	90.60	91.17	91.51
T=9	92.20	91.28	91.51
Multi-Average
T=3	91.17	92.55	91.40
T=6	91.40	91.40	92.09
T=9	91.97	90.94	91.40

Full Grid: MRPC (F1)

	α=0.3	α=0.5	α=0.7
Single-DeBERTa
T=3	85.29	87.01	86.27
T=6	84.80	87.25	85.78
T=9	85.78	86.03	84.56
Multi-Average
T=3	85.78	85.29	86.27
T=6	85.05	86.03	86.52
T=9	84.80	85.29	86.76

Key finding from the sweep: Single-teacher DeBERTa consistently prefers higher temperatures (T=9), which is needed to soften its already-sharp, high-confidence predictions. Multi-teacher average, by contrast, peaks at lower temperatures (T=3 on SST-2) — because ensemble averaging already softens the logit distribution. Applying a high temperature on top of pre-averaged logits over-smooths the signal, explaining why multi-teacher underperforms with T=9 on SST-2. This interaction between temperature and fusion method is a substantive finding, not an artifact of the search grid.

Why Multi-Teacher Didn't Help (Our Hypothesis)

Low soft-label diversity. All three teachers are pretrained transformer encoders (BERT, RoBERTa, DeBERTa) with similar inductive biases. Critically, they are extremely confident: teachers agree on 85-95% of examples and assign >96% probability to their top prediction on SST-2 and MNLI. Their soft label distributions are nearly identical to hard one-hot labels, leaving minimal dark knowledge for distillation to exploit.
Task saturation. On well-studied GLUE benchmarks, even DistilBERT reaches ~87-91% accuracy. The remaining error budget is dominated by genuinely ambiguous examples where no amount of teacher fusion helps.
Averaging washes out signal. Where teachers disagree, averaging their logits produces a blurred distribution that is less informative than the single most confident teacher's prediction. Our failure analysis shows that on disagreement examples, multi-teacher students do not outperform single-teacher students.
Temperature mismatch. Optimal temperature for single-teacher (T=9) differs from multi-teacher (T=3 on SST-2) because averaging inherently softens distributions. A fixed temperature chosen for one method is suboptimal for the other.

Architecture

Teachers (fine-tuned independently)     Student (distilled)
┌──────────────────────┐
│ BERT-large-uncased   │──┐
└──────────────────────┘  │
┌──────────────────────┐  │  logits    ┌─────────────────────────┐
│ RoBERTa-large        │──┼──────────► │ Fusion Strategy         │
└──────────────────────┘  │            │ (avg / weighted / ...   │──► DistilBERT-base
┌──────────────────────┐  │            │  → fused soft labels)   │    (student)
│ DeBERTa-base         │──┘            └─────────────────────────┘
└──────────────────────┘

Fusion Strategies

#	Strategy	Description	Learnable?
1	Average	Element-wise mean of teacher logits	No
2	Weighted average	Weighted by each teacher's validation performance on the target task	No
3	Best-per-sample	Select the lowest-entropy (most confident) teacher per example	No
4	Learned gating	MLP maps student's [CLS] embedding to per-teacher attention weights	Yes

Loss Function

Standard knowledge distillation loss combining soft and hard targets:

L = α · KL(softmax(student/T), softmax(teacher_fused/T)) · T² + (1-α) · CE(student, hard_labels)

Default hyperparameters: T=6.0, α=0.7 (tuned via grid sweep).

Related Work

Hinton et al. (2015) — "Distilling the Knowledge in a Neural Network". The original knowledge distillation paper. Introduces the temperature-scaled KL loss and the concept of "dark knowledge" in soft targets.
You et al. (2017) — "Learning from Multiple Teacher Networks". One of the earliest works on multi-teacher distillation; motivates the ensemble approach.
Liu et al. (2020) — "Adaptive Multi-Teacher Multi-Level Knowledge Distillation". Proposes task-adaptive weighting of teacher contributions; our weighted strategy is a simpler version of this idea.
Tang et al. (2021) — "Why Does CTC Loss Result in Path Collapse with Shallow Transformers?" — context for why transformer teachers often converge to similar representations regardless of pretraining differences.
Sanh et al. (2019) — "DistilBERT, a distilled version of BERT". The student architecture used in this work.

Project Structure

multi_teacher_distill/
├── configs/
│   └── config.py                  # Models, tasks, hyperparameters
├── data/
│   ├── load_data.py               # GLUE data loading + tokenization
│   └── soft_labels/               # Precomputed teacher logits (.npy)
├── teachers/
│   ├── train_teacher.py           # Fine-tune a single teacher
│   └── extract_logits.py          # Extract + save teacher soft labels
├── student/
│   ├── train_student.py           # Baseline student (no distillation)
│   ├── distillation_trainer.py    # Core distillation training loop
│   ├── strategies.py              # 4 fusion strategies + single-teacher passthrough
│   └── checkpoints/               # Saved models + eval_results.json
├── analysis/
│   ├── results_table.py           # Build master results CSV
│   ├── agreement_analysis.py      # Teacher agreement region analysis
│   ├── failure_analysis.py        # Error categorization by cause
│   ├── teacher_confidence.py      # Teacher confidence analysis
│   └── visualizations.py          # Gap analysis, heatmaps, bar charts
├── experiments/
│   ├── configs/sweep_config.py    # Temperature/alpha sweep definitions
│   ├── run_temperature_sweep.py   # Temperature grid search
│   ├── run_alpha_sweep.py         # Alpha grid search
│   ├── run_grid_sweep.py          # Joint T × α grid sweep
│   └── run_full_comparison.py     # Full comparison across seeds
├── results/
│   ├── master_results.csv         # Main results table
│   ├── success_metrics.json       # Aggregated metrics
│   ├── gap_analysis.png           # Teacher–student gap visualization
│   ├── agreement_breakdown.png    # Accuracy by agreement region
│   ├── failure_categories.png     # Error breakdown by cause
│   └── ranking_heatmap.png        # Strategy ranking across tasks
├── src/
│   ├── run_all_teachers.py        # Train all 3 teachers × 4 tasks
│   ├── run_extract_logits.py      # Extract logits for all teachers
│   ├── run_student_baseline.py    # Train student without distillation
│   ├── run_single_teacher.py      # Single-teacher distillation baselines
│   ├── run_distillation.py        # Multi-teacher distillation (all strategies)
│   ├── run_analysis.py            # Run all analysis scripts
│   └── verify_logits.py           # Sanity-check saved logits
└── requirements.txt

Reproducing the Experiments

Requirements

pip install -r requirements.txt

Or install manually (Python 3.8+, CUDA recommended):

pip install torch==2.3.0 transformers==4.44.2 datasets==4.6.0 evaluate==0.4.6 \
            numpy==1.26.4 pandas==2.3.3 matplotlib==3.10.8 seaborn==0.13.2 \
            tqdm==4.67.3 scipy

Step-by-Step

# 1. Fine-tune all teachers (3 models × 4 tasks = 12 runs)
python src/run_all_teachers.py

# 2. Extract teacher logits (precompute soft labels for training)
python src/run_extract_logits.py

# 3. Train student baseline (no distillation)
python src/run_student_baseline.py

# 4. Single-teacher distillation baselines
python src/run_single_teacher.py

# 5. Multi-teacher distillation (all 4 fusion strategies × 4 tasks)
python src/run_distillation.py

# 6. Run analysis (results table, agreement analysis, failure analysis, plots)
python src/run_analysis.py

Hyperparameter Sweeps (optional)

# Temperature sweep (fixed α=0.5)
python experiments/run_temperature_sweep.py

# Alpha sweep (uses best T from temperature sweep)
python experiments/run_alpha_sweep.py

# Joint grid sweep over T × α
python experiments/run_grid_sweep.py

Teacher Confidence Report

python analysis/teacher_confidence.py --split val

Hardware

All experiments were run on a single GPU. Teacher fine-tuning is the most expensive step (~1-2 hours per teacher-task pair for the large models). Student distillation is fast (~10-20 minutes per configuration) since it trains DistilBERT-base with precomputed logits.

Analysis Details

Teacher Agreement Analysis

We split validation examples into three regions based on whether teachers agree:

All agree (~85-95% of examples): All three teachers predict the same class. Student accuracy is high here regardless of distillation method.
Partial disagree (~5-12%): Two teachers agree, one disagrees. This is where multi-teacher fusion could theoretically help by surfacing the majority vote.
Full disagree (~0-3%): All teachers predict different classes (only possible for 3-class MNLI). These are the hardest examples.

Finding: Multi-teacher students do not show meaningful accuracy gains over single-teacher students on disagreement examples. The fusion strategies fail to reliably extract signal from teacher disagreements.

Failure Categorization

Every student error is classified into:

All teachers also wrong — distillation cannot fix these; the teachers lack the knowledge
At least one teacher was right — knowledge transfer failed; the student didn't learn from the correct teacher
Error on disagreement — teachers disagreed and the student picked the wrong side

This breakdown reveals that most student errors fall into category 1 (teachers are also wrong), limiting the ceiling for any distillation improvement.

Limitations

Teacher diversity is low. All teachers are pretrained transformer encoders. Using architecturally diverse teachers (e.g., CNN, LSTM, or differently pretrained transformers) might yield different conclusions.
Single student architecture. We only test DistilBERT-base. Results may differ for smaller or larger students.
GLUE tasks only. These are well-studied benchmarks where pretrained models already perform well. Multi-teacher distillation may show stronger benefits on harder or lower-resource tasks.
No confidence intervals in the main table. The reported numbers are single-seed runs (seed=42). Differences below ~0.5 pp should not be interpreted as meaningful without multi-seed validation. The experiment infrastructure supports multi-seed evaluation (see experiments/run_full_comparison.py), but the main comparison table uses a fixed seed for cost reasons.
Offline distillation only. We precompute teacher logits rather than running teachers at train time. Online distillation with data augmentation or curriculum learning might change the picture.

What Would Make Multi-Teacher Work Better?

Based on our analysis, future work could explore:

Architecturally diverse teachers (e.g., mixing transformers with CNNs or retrieval-augmented models) to increase logit diversity
Task-adaptive fusion — learning different fusion weights per task or per difficulty level
Harder tasks where single teachers have larger error rates and more room for complementary coverage
Curriculum-based fusion — weighting teachers differently across training stages
Per-temperature optimization — tuning temperature separately for each fusion method, since optimal T differs between single and multi-teacher configurations (as shown in the grid sweep)

License

This project is for research and educational purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
analysis		analysis
configs		configs
data		data
experiments		experiments
results		results
src		src
student		student
teachers		teachers
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Multi-Teacher vs. Single-Teacher Knowledge Distillation

Key Finding

Results

Main Comparison (validation set, seed=42)

Honest Interpretation

Per-Task Multi-Teacher Advantage (best multi - best single)

Key Visualizations

Gap Analysis: How Much of the Teacher–Student Gap Distillation Closes

Teacher Agreement Breakdown: Accuracy by Agreement Region

Teacher Confidence Analysis

Per-Task Teacher Confidence (validation split, averaged across 3 teachers)

Per-Teacher Confidence (averaged across 4 tasks)

Hyperparameter Sensitivity: Grid Sweep Results

Optimal Configuration Per Task

Full Grid: SST-2 (accuracy)

Full Grid: MRPC (F1)

Why Multi-Teacher Didn't Help (Our Hypothesis)

Architecture

Fusion Strategies

Loss Function

Related Work

Project Structure

Reproducing the Experiments

Requirements

Step-by-Step

Hyperparameter Sweeps (optional)

Teacher Confidence Report

Hardware

Analysis Details

Teacher Agreement Analysis

Failure Categorization

Limitations

What Would Make Multi-Teacher Work Better?

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages