Does fine-tuning a draft model on domain-specific preference data improve speculative decoding acceptance rate for that domain?
This project investigates the intersection of two modern LLM techniques — Direct Preference Optimization (DPO) and Speculative Decoding — in the context of assistive technology (AT) queries. We fine-tune a small draft model on AT-domain preference data and measure whether alignment improves token acceptance rates during speculative decoding against a larger, frozen verifier.
- Background
- Hypothesis
- Models
- Methodology
- Results
- Key Finding
- Project Structure
- Reproducing This Work
- Limitations
- Future Work
- References
Speculative Decoding accelerates LLM inference by using a small draft model to speculatively generate K tokens, which a larger verifier model then accepts or rejects in a single parallel forward pass. The speedup depends entirely on the token acceptance rate α — the fraction of draft tokens the verifier accepts. Formally:
α = 1 - TV(p_verifier, q_draft)
Where TV is total variation distance between the two models' output distributions. The lower the divergence, the higher α, and the faster inference becomes.
The natural question: if we fine-tune the draft model to be more similar to the verifier on a specific domain, does α improve for that domain?
DPO (Rafailov et al., 2023) allows us to align a model toward preferred
outputs without a separate reward model, using only preference pairs
(prompt, chosen, rejected). It optimizes:
L_DPO = -E log σ( β·log[π_θ(y_w|x)/π_ref(y_w|x)] - β·log[π_θ(y_l|x)/π_ref(y_l|x)] )
This project applies DPO to the draft model using AT-domain preference data, then measures whether the aligned draft model achieves higher α during speculative decoding against an unaligned verifier.
A draft model fine-tuned via DPO on AT-domain preference pairs will exhibit lower token distribution divergence from the verifier model on AT-domain queries, resulting in measurably higher acceptance rate α and reduced inference latency for that domain.
Null hypothesis: DPO alignment does not significantly change acceptance rate for in-domain queries compared to the base draft model.
| Role | Model | Parameters | Quantization |
|---|---|---|---|
| Draft (base) | Phi-3 Mini Instruct | 3.8B | 4-bit NF4 |
| Draft (aligned) | Phi-3 Mini + DPO LoRA | 3.8B + ~14M trainable | 4-bit NF4 |
| Verifier | Phi-3 Medium Instruct | 14B | 4-bit NF4 |
Why Phi-3 Mini + Phi-3 Medium? Speculative decoding requires both models to share an identical vocabulary. Phi-3 Mini and Phi-3 Medium both use the same 32,011-token tokenizer, making them a valid model pair. Mistral-7B (32,000 tokens) was initially considered but verified to have 991 vocabulary mismatches with Phi-3 Mini — rendering it incompatible for speculative decoding.
Generated 985 preference pairs covering the assistive technology domain using a structured rotation across disability types, user types, and task types to maximize diversity.
Disability types: low vision, blindness, motor impairment, cognitive
disability, hearing impairment, dyslexia, ADHD, ALS,
color blindness, tremor, multiple disabilities, deafblindness
User types: end user, developer, IT admin, teacher, occupational
therapist, caregiver, procurement officer
Task types: tool recommendation, configuration, troubleshooting,
standards compliance, enterprise deployment, comparison
Each preference pair:
{
"prompt": "What screen reader works best for a blind developer using VS Code?",
"chosen": "NVDA with the VSCode accessibility plugin provides the best experience...",
"rejected": "There are many screen readers available that could work for developers..."
}Chosen responses name specific AT tools, cite relevant standards (WCAG, EN 301 549, Section 508), and give actionable guidance. Rejected responses are plausible but vague, generic, and non-actionable.
Dataset statistics:
Total pairs: 985
Avg chosen length: 555 chars
Avg rejected length: 281 chars
Length ratio (c/r): 1.97x
Near-duplicate prompts: 0 (verified with 12-word key deduplication)
Fine-tuned Phi-3 Mini using QLoRA + DPO on the generated dataset.
Base model: microsoft/Phi-3-mini-4k-instruct
Method: QLoRA (4-bit NF4) + DPO
LoRA rank: 16
LoRA alpha: 32
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
DPO β: 0.1
Epochs: 2
Effective batch: 16 (batch=2, grad_accum=8)
Learning rate: 5e-5 (cosine schedule)
Max sequence length: 512
Hardware: RTX 3090 24GB (Vast.ai)
Training time: ~1.5 hours
Trainable params: ~14M / 3.8B (0.37%)
Qualitative comparison on held-out AT prompt:
| Response | |
|---|---|
| Base | "Assistive technology tools can be helpful... DynaVox Communication Devices..." |
| DPO-tuned | "Eye-tracking Technology: Devices like Tobii Dynavox I-Series allow users to type by selecting letters and words on a screen with their eyes..." |
The DPO-tuned model produces structured, tool-specific responses consistent with chosen training examples.
Implemented speculative decoding from scratch in PyTorch (Leviathan et al., 2023).
Core loop:
1. Draft phase: Small model autoregressively generates K tokens, storing q(t_i)
2. Verify phase: Large model runs ONE forward pass over all K tokens, producing p(t_i)
3. Accept/reject: Accept token i with probability min(1, p(t_i)/q(t_i))
4. On rejection: Sample correction from residual(t) = normalize(max(0, p(t) - q(t)))
5. Free token: If all K accepted, sample one more from verifier's position K+1
The residual sampling step guarantees output distribution is mathematically identical to sampling from the verifier alone.
K = 4, max_new_tokens = 200, temperature = 0.7
Experiment A: Base Phi-3 Mini (draft) + Phi-3 Medium (verifier, frozen)
Experiment B: DPO-tuned Phi-3 Mini (draft) + Phi-3 Medium (verifier, frozen)
Evaluated on:
10 AT-domain prompts (in-domain)
2 general prompts (out-of-domain)
Primary metric: Token acceptance rate α
Secondary metric: Latency (ms), tokens/second
| Prompt | Query Type | Base α | DPO α | Δα |
|---|---|---|---|---|
| Screen reader for blind users | Recommendation | 0.777 | 0.750 | -0.027 |
| Keyboard alternatives (motor) | Recommendation | 0.754 | 0.665 | -0.089 |
| Eye tracking for ALS | Recommendation | 0.676 | 0.795 | +0.119 |
| AT tools for dyslexia | Recommendation | 0.754 | 0.824 | +0.070 |
| Captioning for deaf employees | Recommendation | 0.577 | 0.623 | +0.046 |
| NVDA configuration | Procedural | 0.704 | 0.668 | -0.036 |
| Low vision + motor (combined) | Recommendation | 0.696 | 0.768 | +0.072 |
| React screen reader testing | Procedural/Technical | 0.746 | 0.689 | -0.057 |
| WCAG 2.1 vs EN 301 549 | Definitional | 0.720 | 0.810 | +0.090 |
| IT admin AT deployment | Procedural | 0.887 | 0.892 | +0.005 |
| Average | 0.729 | 0.748 | +0.019 |
| Prompt | Base α | DPO α | Δα |
|---|---|---|---|
| Photosynthesis | 1.000 | 0.934 | -0.066 |
| Binary search algorithm | 0.865 | 0.791 | -0.074 |
| Average | 0.933 | 0.863 | -0.070 |
| Condition | Base α | DPO α | Δα |
|---|---|---|---|
| AT-domain (all) | 0.729 | 0.748 | +0.019 |
| AT-domain — recommendation/definitional | 0.676 | 0.752 | +0.067 |
| AT-domain — procedural/technical | 0.784 | 0.749 | -0.052 |
| Out-of-domain | 0.933 | 0.863 | -0.070 |
DPO alignment effects are query-type dependent, not simply domain dependent.
The aggregate in-domain improvement (+0.019) conceals two opposing effects:
DPO consistently helps on recommendation and definitional queries (+0.067): these directly match the training distribution — "what tool works for X", "what is the difference between X and Y". The draft model's token distribution shifts closer to the verifier's for these response patterns.
DPO consistently hurts on procedural and technical queries (-0.052): "how do I configure X", "how do I programmatically test Y". These query types were underrepresented in the training dataset, so DPO shifted the model's distribution away from the step-by-step prose the verifier expects.
Out-of-domain degradation is consistent (-0.070) across both general prompts tested, indicating domain-specific alignment imposes a real out-of-distribution penalty.
Practical implication: The training dataset's query-type composition matters as much as its domain coverage. Domain alignment for speculative decoding requires query-type diversity in training data — not just topic diversity. A dataset of purely recommendation-style pairs will improve α on recommendations but may degrade it on procedural queries within the same domain.
Research_DPO/
├── dpo_clean.jsonl # 985 AT-domain preference pairs
├── data_generation.py # Preference dataset generation
├── clean_dataset.py # Dataset deduplication / cleaning
├── preprocess_and_load_dataset.py # Tokenisation + dataset loading
├── train_dpo.py # QLoRA + DPO fine-tuning
├── speculative_decode.py # Speculative decoding engine + A/B runner
├── test_trained_model.py # Qualitative model output checks
├── check_vocab_compatibility.py # Verifies Phi-3 Mini/Medium share identical vocabulary
├── check_dataset.py # Validates dataset stats, tool coverage, and deduplication
├── final_adapter/ # LoRA adapter config and tokenizer files
│ ├── adapter_config.json # adapter_model.safetensors excluded via .gitignore
│ └── tokenizer files
└── README.md
pip install transformers trl peft bitsandbytes accelerate datasets torchDataset generation uses a rotation-based prompting strategy with ChatGPT.
Use the provided dpo_clean.jsonl directly, or regenerate via data_generation.py.
Requires 24GB+ VRAM (tested on RTX 3090 via Vast.ai, ~$0.40 total):
python train_dpo.py
# Dry run (verify setup before full training):
python train_dpo.py --dry-runAdapter saved to ./final_adapter/.
Also requires 24GB+ VRAM for both models simultaneously in 4-bit:
# Base draft model
python speculative_decode.py --mode base \
--prompt "What screen reader works best for blind users?"
# DPO-tuned draft model
python speculative_decode.py --mode dpo \
--prompt "What screen reader works best for blind users?"
# Direct A/B comparison
python speculative_decode.py --mode compare \
--prompt "What screen reader works best for blind users?"The --adapter-path flag defaults to ./final_adapter. Override if your adapter
is saved elsewhere.
| Step | Cost |
|---|---|
| Dataset generation (985 pairs, ChatGPT free tier) | $0 |
| DPO fine-tuning (Vast.ai RTX 3090, ~1.5hrs) | ~$0.40 |
| Experiments (Vast.ai RTX 3090, ~2hrs) | ~$0.60 |
| Total | ~$1.00 |
Small evaluation set: 10 in-domain and 2 out-of-domain prompts. Statistical significance cannot be claimed. A robust study requires 100+ prompts per condition with variance analysis.
Single unaligned verifier: Phi-3 Medium was used as-is without any preference tuning. A DPO-aligned verifier may produce different acceptance patterns — the formatting mismatch between a preference-tuned draft and an instruction-tuned-only verifier may partially explain the procedural query degradation.
Synthetic preference data: Training data was generated using ChatGPT rather than human AT expert judgments. Chosen/rejected distinction reflects GPT-4 quality assessments, not real user preferences.
Single domain: Only assistive technology was tested. The query-type dependency finding may generalize but this is unverified.
- Test with a DPO-aligned verifier to isolate the formatting mismatch confound
- Expand evaluation to 100+ prompts with confidence intervals
- Train with query-type balanced dataset (equal recommendation/procedural pairs) and measure whether procedural query degradation is recovered
- Extend to medical, legal, and code domains to test generalizability of the query-type dependency finding
- Investigate mixed-domain DPO training to mitigate out-of-domain degradation
- Leviathan et al. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
- Chen et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. DeepMind.
- Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
- Microsoft (2024). Phi-3 Technical Report.
- Sun et al. (2025). Training Domain Draft Models for Speculative Decoding: Best Practices and Insights. arXiv.