Skip to content

amir4javar/multi_teacher_distillation_experiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Teacher vs. Single-Teacher Knowledge Distillation

An empirical investigation into whether combining soft labels from multiple fine-tuned teachers improves knowledge distillation compared to using a single teacher. We evaluate four fusion strategies across four GLUE benchmark tasks and find that multi-teacher distillation provides marginal-to-no improvement over the best single teacher under our experimental conditions.

Key Finding

Distillation itself helps — the best distilled student closes ~28% of the teacher–student performance gap compared to training from scratch. However, multi-teacher fusion does not meaningfully outperform the best single-teacher baseline. The average multi-teacher advantage across tasks is -0.01 percentage points (effectively zero), with task-level results ranging from +1.1 pp (SST-2) to -0.7 pp (MRPC).

This is a negative result, and we believe it is worth reporting. The findings suggest that for standard GLUE tasks with similarly-architected transformer teachers, the diversity of soft labels across teachers may be insufficient to justify the added complexity of multi-teacher fusion.

Results

Main Comparison (validation set, seed=42)

Method Type SST-2 (acc) MNLI (acc) QQP (F1) MRPC (F1) AVG
Teacher: BERT-large teacher 93.00 86.63 88.10 91.62 89.84
Teacher: RoBERTa-large teacher 95.76 90.10 85.36 92.71 90.98
Teacher: DeBERTa-base teacher 95.18 88.42 88.24 91.88 90.93
Student: no distillation baseline 90.60 81.80 87.07 89.17 87.16
Distill: BERT-large single 91.40 82.39 87.57 89.90 87.82
Distill: RoBERTa-large single 90.94 82.70 87.21 90.76 87.90
Distill: DeBERTa-base single 91.28 82.67 87.87 91.12 88.24
Multi: average multi 92.55 82.70 87.93 89.73 88.23
Multi: weighted multi 91.06 82.61 87.81 90.45 87.98
Multi: best-per-sample multi 90.94 82.40 87.72 90.14 87.80
Multi: learned gating multi 91.63 82.72 87.94 89.80 88.02

Note: main table uses a single seed (seed=42). Differences below ~0.5 pp should be interpreted cautiously without confidence intervals. Multi-seed evaluation infrastructure is available in experiments/run_full_comparison.py.

Honest Interpretation

  • Distillation works. Every distilled student outperforms the no-distillation baseline, confirming that soft label training provides useful signal beyond hard labels alone.
  • Multi-teacher ≈ single-teacher. The best multi-teacher strategy (average, 88.23 AVG) is statistically indistinguishable from the best single-teacher (DeBERTa-base, 88.24 AVG). The difference is -0.01 pp.
  • No strategy consistently wins. Simple averaging is the best multi-teacher strategy overall, but it loses to single-teacher DeBERTa on MRPC by 1.4 pp. No fusion method dominates across all tasks.
  • The learned gating network adds complexity without payoff. Despite having trainable parameters to attend over teachers, it does not outperform the parameter-free average.

Per-Task Multi-Teacher Advantage (best multi - best single)

Task Advantage Verdict
SST-2 +1.15 pp Modest gain
MNLI +0.02 pp No difference
QQP +0.07 pp No difference
MRPC -0.67 pp Single teacher wins

Key Visualizations

Gap Analysis: How Much of the Teacher–Student Gap Distillation Closes

Gap Analysis

The best distilled student closes ~28% of the gap between teacher and no-distillation baseline. Multi-teacher and single-teacher strategies close nearly identical fractions — the gap is not meaningfully different between fusion methods.

Teacher Agreement Breakdown: Accuracy by Agreement Region

Agreement Breakdown

Student accuracy broken down by whether teachers agree or disagree on an example. Multi-teacher students do not outperform single-teacher students in disagreement regions, which is where fusion strategies should theoretically have an advantage.

Teacher Confidence Analysis

A core diagnostic: if teachers are near-certain on almost every example, their soft labels carry little "dark knowledge" (Hinton et al., 2015), and averaging them produces distributions nearly identical to hard labels. This is precisely what we observe.

Per-Task Teacher Confidence (validation split, averaged across 3 teachers)

Task Mean Max Prob Mean Entropy % High Conf (>0.95)
SST-2 0.9922 0.0289 96.5%
MNLI 0.9796 0.0656 92.3%
QQP 0.9596 0.1181 80.0%
MRPC 0.9212 0.2063 62.6%

Per-Teacher Confidence (averaged across 4 tasks)

Teacher Mean Max Prob Mean Entropy % High Conf
BERT-large 0.9628 0.1041 82.6%
RoBERTa-large 0.9541 0.1279 78.6%
DeBERTa-base 0.9725 0.0822 87.3%

Interpretation: On SST-2 and MNLI, teachers assign >97% probability to their top prediction on average, with over 92% of examples classified as high-confidence. At this level, soft label distributions are nearly one-hot, leaving minimal "dark knowledge" for distillation to exploit. MRPC shows the most uncertainty (62.6% high-confidence), which correlates with it being the only task where single-teacher meaningfully beats multi-teacher — the noise in teacher signals is amplified, not averaged out, when teachers are uncertain.

This is the strongest evidence for why multi-teacher averaging did not help: the teachers are so confident that ensemble averaging produces distributions nearly identical to any single teacher's output.

Hyperparameter Sensitivity: Grid Sweep Results

We ran a joint grid sweep over temperature T ∈ {3, 6, 9} and alpha α ∈ {0.3, 0.5, 0.7} for both single-teacher DeBERTa and multi-teacher average. Results below reveal a key structural difference between the two methods.

Optimal Configuration Per Task

Task Single-DeBERTa Best Score Multi-Average Best Score
SST-2 T=9, α=0.3 92.20 T=3, α=0.5 92.55
MNLI T=9, α=0.7 83.23 T=6, α=0.5 82.82
QQP T=3, α=0.5 90.98 T=9, α=0.5 91.12
MRPC T=6, α=0.5 87.25 T=9, α=0.7 86.76

Full Grid: SST-2 (accuracy)

α=0.3 α=0.5 α=0.7
Single-DeBERTa
T=3 91.17 91.28 90.94
T=6 90.60 91.17 91.51
T=9 92.20 91.28 91.51
Multi-Average
T=3 91.17 92.55 91.40
T=6 91.40 91.40 92.09
T=9 91.97 90.94 91.40

Full Grid: MRPC (F1)

α=0.3 α=0.5 α=0.7
Single-DeBERTa
T=3 85.29 87.01 86.27
T=6 84.80 87.25 85.78
T=9 85.78 86.03 84.56
Multi-Average
T=3 85.78 85.29 86.27
T=6 85.05 86.03 86.52
T=9 84.80 85.29 86.76

Key finding from the sweep: Single-teacher DeBERTa consistently prefers higher temperatures (T=9), which is needed to soften its already-sharp, high-confidence predictions. Multi-teacher average, by contrast, peaks at lower temperatures (T=3 on SST-2) — because ensemble averaging already softens the logit distribution. Applying a high temperature on top of pre-averaged logits over-smooths the signal, explaining why multi-teacher underperforms with T=9 on SST-2. This interaction between temperature and fusion method is a substantive finding, not an artifact of the search grid.

Why Multi-Teacher Didn't Help (Our Hypothesis)

  1. Low soft-label diversity. All three teachers are pretrained transformer encoders (BERT, RoBERTa, DeBERTa) with similar inductive biases. Critically, they are extremely confident: teachers agree on 85-95% of examples and assign >96% probability to their top prediction on SST-2 and MNLI. Their soft label distributions are nearly identical to hard one-hot labels, leaving minimal dark knowledge for distillation to exploit.
  2. Task saturation. On well-studied GLUE benchmarks, even DistilBERT reaches ~87-91% accuracy. The remaining error budget is dominated by genuinely ambiguous examples where no amount of teacher fusion helps.
  3. Averaging washes out signal. Where teachers disagree, averaging their logits produces a blurred distribution that is less informative than the single most confident teacher's prediction. Our failure analysis shows that on disagreement examples, multi-teacher students do not outperform single-teacher students.
  4. Temperature mismatch. Optimal temperature for single-teacher (T=9) differs from multi-teacher (T=3 on SST-2) because averaging inherently softens distributions. A fixed temperature chosen for one method is suboptimal for the other.

Architecture

Teachers (fine-tuned independently)     Student (distilled)
┌──────────────────────┐
│ BERT-large-uncased   │──┐
└──────────────────────┘  │
┌──────────────────────┐  │  logits    ┌─────────────────────────┐
│ RoBERTa-large        │──┼──────────► │ Fusion Strategy         │
└──────────────────────┘  │            │ (avg / weighted / ...   │──► DistilBERT-base
┌──────────────────────┐  │            │  → fused soft labels)   │    (student)
│ DeBERTa-base         │──┘            └─────────────────────────┘
└──────────────────────┘

Fusion Strategies

# Strategy Description Learnable?
1 Average Element-wise mean of teacher logits No
2 Weighted average Weighted by each teacher's validation performance on the target task No
3 Best-per-sample Select the lowest-entropy (most confident) teacher per example No
4 Learned gating MLP maps student's [CLS] embedding to per-teacher attention weights Yes

Loss Function

Standard knowledge distillation loss combining soft and hard targets:

L = α · KL(softmax(student/T), softmax(teacher_fused/T)) · T² + (1-α) · CE(student, hard_labels)

Default hyperparameters: T=6.0, α=0.7 (tuned via grid sweep).

Related Work

Project Structure

multi_teacher_distill/
├── configs/
│   └── config.py                  # Models, tasks, hyperparameters
├── data/
│   ├── load_data.py               # GLUE data loading + tokenization
│   └── soft_labels/               # Precomputed teacher logits (.npy)
├── teachers/
│   ├── train_teacher.py           # Fine-tune a single teacher
│   └── extract_logits.py          # Extract + save teacher soft labels
├── student/
│   ├── train_student.py           # Baseline student (no distillation)
│   ├── distillation_trainer.py    # Core distillation training loop
│   ├── strategies.py              # 4 fusion strategies + single-teacher passthrough
│   └── checkpoints/               # Saved models + eval_results.json
├── analysis/
│   ├── results_table.py           # Build master results CSV
│   ├── agreement_analysis.py      # Teacher agreement region analysis
│   ├── failure_analysis.py        # Error categorization by cause
│   ├── teacher_confidence.py      # Teacher confidence analysis
│   └── visualizations.py          # Gap analysis, heatmaps, bar charts
├── experiments/
│   ├── configs/sweep_config.py    # Temperature/alpha sweep definitions
│   ├── run_temperature_sweep.py   # Temperature grid search
│   ├── run_alpha_sweep.py         # Alpha grid search
│   ├── run_grid_sweep.py          # Joint T × α grid sweep
│   └── run_full_comparison.py     # Full comparison across seeds
├── results/
│   ├── master_results.csv         # Main results table
│   ├── success_metrics.json       # Aggregated metrics
│   ├── gap_analysis.png           # Teacher–student gap visualization
│   ├── agreement_breakdown.png    # Accuracy by agreement region
│   ├── failure_categories.png     # Error breakdown by cause
│   └── ranking_heatmap.png        # Strategy ranking across tasks
├── src/
│   ├── run_all_teachers.py        # Train all 3 teachers × 4 tasks
│   ├── run_extract_logits.py      # Extract logits for all teachers
│   ├── run_student_baseline.py    # Train student without distillation
│   ├── run_single_teacher.py      # Single-teacher distillation baselines
│   ├── run_distillation.py        # Multi-teacher distillation (all strategies)
│   ├── run_analysis.py            # Run all analysis scripts
│   └── verify_logits.py           # Sanity-check saved logits
└── requirements.txt

Reproducing the Experiments

Requirements

pip install -r requirements.txt

Or install manually (Python 3.8+, CUDA recommended):

pip install torch==2.3.0 transformers==4.44.2 datasets==4.6.0 evaluate==0.4.6 \
            numpy==1.26.4 pandas==2.3.3 matplotlib==3.10.8 seaborn==0.13.2 \
            tqdm==4.67.3 scipy

Step-by-Step

# 1. Fine-tune all teachers (3 models × 4 tasks = 12 runs)
python src/run_all_teachers.py

# 2. Extract teacher logits (precompute soft labels for training)
python src/run_extract_logits.py

# 3. Train student baseline (no distillation)
python src/run_student_baseline.py

# 4. Single-teacher distillation baselines
python src/run_single_teacher.py

# 5. Multi-teacher distillation (all 4 fusion strategies × 4 tasks)
python src/run_distillation.py

# 6. Run analysis (results table, agreement analysis, failure analysis, plots)
python src/run_analysis.py

Hyperparameter Sweeps (optional)

# Temperature sweep (fixed α=0.5)
python experiments/run_temperature_sweep.py

# Alpha sweep (uses best T from temperature sweep)
python experiments/run_alpha_sweep.py

# Joint grid sweep over T × α
python experiments/run_grid_sweep.py

Teacher Confidence Report

python analysis/teacher_confidence.py --split val

Hardware

All experiments were run on a single GPU. Teacher fine-tuning is the most expensive step (~1-2 hours per teacher-task pair for the large models). Student distillation is fast (~10-20 minutes per configuration) since it trains DistilBERT-base with precomputed logits.

Analysis Details

Teacher Agreement Analysis

We split validation examples into three regions based on whether teachers agree:

  • All agree (~85-95% of examples): All three teachers predict the same class. Student accuracy is high here regardless of distillation method.
  • Partial disagree (~5-12%): Two teachers agree, one disagrees. This is where multi-teacher fusion could theoretically help by surfacing the majority vote.
  • Full disagree (~0-3%): All teachers predict different classes (only possible for 3-class MNLI). These are the hardest examples.

Finding: Multi-teacher students do not show meaningful accuracy gains over single-teacher students on disagreement examples. The fusion strategies fail to reliably extract signal from teacher disagreements.

Failure Categorization

Every student error is classified into:

  1. All teachers also wrong — distillation cannot fix these; the teachers lack the knowledge
  2. At least one teacher was right — knowledge transfer failed; the student didn't learn from the correct teacher
  3. Error on disagreement — teachers disagreed and the student picked the wrong side

This breakdown reveals that most student errors fall into category 1 (teachers are also wrong), limiting the ceiling for any distillation improvement.

Limitations

  • Teacher diversity is low. All teachers are pretrained transformer encoders. Using architecturally diverse teachers (e.g., CNN, LSTM, or differently pretrained transformers) might yield different conclusions.
  • Single student architecture. We only test DistilBERT-base. Results may differ for smaller or larger students.
  • GLUE tasks only. These are well-studied benchmarks where pretrained models already perform well. Multi-teacher distillation may show stronger benefits on harder or lower-resource tasks.
  • No confidence intervals in the main table. The reported numbers are single-seed runs (seed=42). Differences below ~0.5 pp should not be interpreted as meaningful without multi-seed validation. The experiment infrastructure supports multi-seed evaluation (see experiments/run_full_comparison.py), but the main comparison table uses a fixed seed for cost reasons.
  • Offline distillation only. We precompute teacher logits rather than running teachers at train time. Online distillation with data augmentation or curriculum learning might change the picture.

What Would Make Multi-Teacher Work Better?

Based on our analysis, future work could explore:

  1. Architecturally diverse teachers (e.g., mixing transformers with CNNs or retrieval-augmented models) to increase logit diversity
  2. Task-adaptive fusion — learning different fusion weights per task or per difficulty level
  3. Harder tasks where single teachers have larger error rates and more room for complementary coverage
  4. Curriculum-based fusion — weighting teachers differently across training stages
  5. Per-temperature optimization — tuning temperature separately for each fusion method, since optimal T differs between single and multi-teacher configurations (as shown in the grid sweep)

License

This project is for research and educational purposes.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages