Domain-Aligned Speculative Decoding for Assistive Technology

Does fine-tuning a draft model on domain-specific preference data improve speculative decoding acceptance rate for that domain?

This project investigates the intersection of two modern LLM techniques — Direct Preference Optimization (DPO) and Speculative Decoding — in the context of assistive technology (AT) queries. We fine-tune a small draft model on AT-domain preference data and measure whether alignment improves token acceptance rates during speculative decoding against a larger, frozen verifier.

Background

Speculative Decoding accelerates LLM inference by using a small draft model to speculatively generate K tokens, which a larger verifier model then accepts or rejects in a single parallel forward pass. The speedup depends entirely on the token acceptance rate α — the fraction of draft tokens the verifier accepts. Formally:

α = 1 - TV(p_verifier, q_draft)

Where TV is total variation distance between the two models' output distributions. The lower the divergence, the higher α, and the faster inference becomes.

The natural question: if we fine-tune the draft model to be more similar to the verifier on a specific domain, does α improve for that domain?

DPO (Rafailov et al., 2023) allows us to align a model toward preferred outputs without a separate reward model, using only preference pairs (prompt, chosen, rejected). It optimizes:

L_DPO = -E log σ( β·log[π_θ(y_w|x)/π_ref(y_w|x)] - β·log[π_θ(y_l|x)/π_ref(y_l|x)] )

This project applies DPO to the draft model using AT-domain preference data, then measures whether the aligned draft model achieves higher α during speculative decoding against an unaligned verifier.

Hypothesis

A draft model fine-tuned via DPO on AT-domain preference pairs will exhibit lower token distribution divergence from the verifier model on AT-domain queries, resulting in measurably higher acceptance rate α and reduced inference latency for that domain.

Null hypothesis: DPO alignment does not significantly change acceptance rate for in-domain queries compared to the base draft model.

Models

Role	Model	Parameters	Quantization
Draft (base)	Phi-3 Mini Instruct	3.8B	4-bit NF4
Draft (aligned)	Phi-3 Mini + DPO LoRA	3.8B + ~14M trainable	4-bit NF4
Verifier	Phi-3 Medium Instruct	14B	4-bit NF4

Why Phi-3 Mini + Phi-3 Medium? Speculative decoding requires both models to share an identical vocabulary. Phi-3 Mini and Phi-3 Medium both use the same 32,011-token tokenizer, making them a valid model pair. Mistral-7B (32,000 tokens) was initially considered but verified to have 991 vocabulary mismatches with Phi-3 Mini — rendering it incompatible for speculative decoding.

Methodology

1. Dataset Generation

Generated 985 preference pairs covering the assistive technology domain using a structured rotation across disability types, user types, and task types to maximize diversity.

Disability types:  low vision, blindness, motor impairment, cognitive
                   disability, hearing impairment, dyslexia, ADHD, ALS,
                   color blindness, tremor, multiple disabilities, deafblindness

User types:        end user, developer, IT admin, teacher, occupational
                   therapist, caregiver, procurement officer

Task types:        tool recommendation, configuration, troubleshooting,
                   standards compliance, enterprise deployment, comparison

Each preference pair:

{
  "prompt": "What screen reader works best for a blind developer using VS Code?",
  "chosen": "NVDA with the VSCode accessibility plugin provides the best experience...",
  "rejected": "There are many screen readers available that could work for developers..."
}

Chosen responses name specific AT tools, cite relevant standards (WCAG, EN 301 549, Section 508), and give actionable guidance. Rejected responses are plausible but vague, generic, and non-actionable.

Dataset statistics:

Total pairs:              985
Avg chosen length:        555 chars
Avg rejected length:      281 chars
Length ratio (c/r):       1.97x
Near-duplicate prompts:   0 (verified with 12-word key deduplication)

2. DPO Fine-tuning

Fine-tuned Phi-3 Mini using QLoRA + DPO on the generated dataset.

Base model:          microsoft/Phi-3-mini-4k-instruct
Method:              QLoRA (4-bit NF4) + DPO
LoRA rank:           16
LoRA alpha:          32
Target modules:      q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
DPO β:               0.1
Epochs:              2
Effective batch:     16 (batch=2, grad_accum=8)
Learning rate:       5e-5 (cosine schedule)
Max sequence length: 512
Hardware:            RTX 3090 24GB (Vast.ai)
Training time:       ~1.5 hours
Trainable params:    ~14M / 3.8B (0.37%)

Qualitative comparison on held-out AT prompt:

	Response
Base	"Assistive technology tools can be helpful... DynaVox Communication Devices..."
DPO-tuned	"Eye-tracking Technology: Devices like Tobii Dynavox I-Series allow users to type by selecting letters and words on a screen with their eyes..."

The DPO-tuned model produces structured, tool-specific responses consistent with chosen training examples.

3. Speculative Decoding Engine

Implemented speculative decoding from scratch in PyTorch (Leviathan et al., 2023).

Core loop:

1. Draft phase:   Small model autoregressively generates K tokens, storing q(t_i)
2. Verify phase:  Large model runs ONE forward pass over all K tokens, producing p(t_i)
3. Accept/reject: Accept token i with probability min(1, p(t_i)/q(t_i))
4. On rejection:  Sample correction from residual(t) = normalize(max(0, p(t) - q(t)))
5. Free token:    If all K accepted, sample one more from verifier's position K+1

The residual sampling step guarantees output distribution is mathematically identical to sampling from the verifier alone.

K = 4, max_new_tokens = 200, temperature = 0.7

4. Experiment Design

Experiment A:  Base Phi-3 Mini (draft)      + Phi-3 Medium (verifier, frozen)
Experiment B:  DPO-tuned Phi-3 Mini (draft) + Phi-3 Medium (verifier, frozen)

Evaluated on:
  10 AT-domain prompts  (in-domain)
   2 general prompts    (out-of-domain)

Primary metric:   Token acceptance rate α
Secondary metric: Latency (ms), tokens/second

Results

In-Domain Results (AT Queries)

Prompt	Query Type	Base α	DPO α	Δα
Screen reader for blind users	Recommendation	0.777	0.750	-0.027
Keyboard alternatives (motor)	Recommendation	0.754	0.665	-0.089
Eye tracking for ALS	Recommendation	0.676	0.795	+0.119
AT tools for dyslexia	Recommendation	0.754	0.824	+0.070
Captioning for deaf employees	Recommendation	0.577	0.623	+0.046
NVDA configuration	Procedural	0.704	0.668	-0.036
Low vision + motor (combined)	Recommendation	0.696	0.768	+0.072
React screen reader testing	Procedural/Technical	0.746	0.689	-0.057
WCAG 2.1 vs EN 301 549	Definitional	0.720	0.810	+0.090
IT admin AT deployment	Procedural	0.887	0.892	+0.005
Average		0.729	0.748	+0.019

Out-of-Domain Results

Prompt	Base α	DPO α	Δα
Photosynthesis	1.000	0.934	-0.066
Binary search algorithm	0.865	0.791	-0.074
Average	0.933	0.863	-0.070

Summary by Condition

Condition	Base α	DPO α	Δα
AT-domain (all)	0.729	0.748	+0.019
AT-domain — recommendation/definitional	0.676	0.752	+0.067
AT-domain — procedural/technical	0.784	0.749	-0.052
Out-of-domain	0.933	0.863	-0.070

Key Finding

DPO alignment effects are query-type dependent, not simply domain dependent.

The aggregate in-domain improvement (+0.019) conceals two opposing effects:

DPO consistently helps on recommendation and definitional queries (+0.067): these directly match the training distribution — "what tool works for X", "what is the difference between X and Y". The draft model's token distribution shifts closer to the verifier's for these response patterns.

DPO consistently hurts on procedural and technical queries (-0.052): "how do I configure X", "how do I programmatically test Y". These query types were underrepresented in the training dataset, so DPO shifted the model's distribution away from the step-by-step prose the verifier expects.

Out-of-domain degradation is consistent (-0.070) across both general prompts tested, indicating domain-specific alignment imposes a real out-of-distribution penalty.

Practical implication: The training dataset's query-type composition matters as much as its domain coverage. Domain alignment for speculative decoding requires query-type diversity in training data — not just topic diversity. A dataset of purely recommendation-style pairs will improve α on recommendations but may degrade it on procedural queries within the same domain.

Project Structure

Research_DPO/
├── dpo_clean.jsonl                   # 985 AT-domain preference pairs
├── data_generation.py                # Preference dataset generation
├── clean_dataset.py                  # Dataset deduplication / cleaning
├── preprocess_and_load_dataset.py    # Tokenisation + dataset loading
├── train_dpo.py                      # QLoRA + DPO fine-tuning
├── speculative_decode.py             # Speculative decoding engine + A/B runner
├── test_trained_model.py             # Qualitative model output checks
├── check_vocab_compatibility.py      # Verifies Phi-3 Mini/Medium share identical vocabulary
├── check_dataset.py                  # Validates dataset stats, tool coverage, and deduplication
├── final_adapter/                    # LoRA adapter config and tokenizer files
│   ├── adapter_config.json           # adapter_model.safetensors excluded via .gitignore
│   └── tokenizer files
└── README.md

Reproducing This Work

Requirements

pip install transformers trl peft bitsandbytes accelerate datasets torch

Step 1 — Generate Dataset

Dataset generation uses a rotation-based prompting strategy with ChatGPT. Use the provided dpo_clean.jsonl directly, or regenerate via data_generation.py.

Step 2 — Fine-tune with DPO

Requires 24GB+ VRAM (tested on RTX 3090 via Vast.ai, ~$0.40 total):

python train_dpo.py

# Dry run (verify setup before full training):
python train_dpo.py --dry-run

Adapter saved to ./final_adapter/.

Step 3 — Run Experiments

Also requires 24GB+ VRAM for both models simultaneously in 4-bit:

# Base draft model
python speculative_decode.py --mode base \
  --prompt "What screen reader works best for blind users?"

# DPO-tuned draft model
python speculative_decode.py --mode dpo \
  --prompt "What screen reader works best for blind users?"

# Direct A/B comparison
python speculative_decode.py --mode compare \
  --prompt "What screen reader works best for blind users?"

The --adapter-path flag defaults to ./final_adapter. Override if your adapter is saved elsewhere.

Total Reproduction Cost

Step	Cost
Dataset generation (985 pairs, ChatGPT free tier)	$0
DPO fine-tuning (Vast.ai RTX 3090, ~1.5hrs)	~$0.40
Experiments (Vast.ai RTX 3090, ~2hrs)	~$0.60
Total	~$1.00

Limitations

Small evaluation set: 10 in-domain and 2 out-of-domain prompts. Statistical significance cannot be claimed. A robust study requires 100+ prompts per condition with variance analysis.

Single unaligned verifier: Phi-3 Medium was used as-is without any preference tuning. A DPO-aligned verifier may produce different acceptance patterns — the formatting mismatch between a preference-tuned draft and an instruction-tuned-only verifier may partially explain the procedural query degradation.

Synthetic preference data: Training data was generated using ChatGPT rather than human AT expert judgments. Chosen/rejected distinction reflects GPT-4 quality assessments, not real user preferences.

Single domain: Only assistive technology was tested. The query-type dependency finding may generalize but this is unverified.

Future Work

Test with a DPO-aligned verifier to isolate the formatting mismatch confound
Expand evaluation to 100+ prompts with confidence intervals
Train with query-type balanced dataset (equal recommendation/procedural pairs) and measure whether procedural query degradation is recovered
Extend to medical, legal, and code domains to test generalizability of the query-type dependency finding
Investigate mixed-domain DPO training to mitigate out-of-domain degradation

References

Leviathan et al. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
Chen et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. DeepMind.
Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
Microsoft (2024). Phi-3 Technical Report.
Sun et al. (2025). Training Domain Draft Models for Speculative Decoding: Best Practices and Insights. arXiv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain-Aligned Speculative Decoding for Assistive Technology

Table of Contents

Background

Hypothesis

Models

Methodology

1. Dataset Generation

2. DPO Fine-tuning

3. Speculative Decoding Engine

4. Experiment Design

Results

In-Domain Results (AT Queries)

Out-of-Domain Results

Summary by Condition

Key Finding

Project Structure

Reproducing This Work

Requirements

Step 1 — Generate Dataset

Step 2 — Fine-tune with DPO

Step 3 — Run Experiments

Total Reproduction Cost

Limitations

Future Work

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
final_adapter		final_adapter
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
check_dataset.py		check_dataset.py
check_vocab_compatibility.py		check_vocab_compatibility.py
clean_dataset.py		clean_dataset.py
data_generation.py		data_generation.py
dpo_clean.jsonl		dpo_clean.jsonl
preprocess_and_load_dataset.py		preprocess_and_load_dataset.py
requirements.txt		requirements.txt
speculative_decode.py		speculative_decode.py
test_trained_model.py		test_trained_model.py
train_dpo.py		train_dpo.py

Folders and files

Latest commit

History

Repository files navigation

Domain-Aligned Speculative Decoding for Assistive Technology

Table of Contents

Background

Hypothesis

Models

Methodology

1. Dataset Generation

2. DPO Fine-tuning

3. Speculative Decoding Engine

4. Experiment Design

Results

In-Domain Results (AT Queries)

Out-of-Domain Results

Summary by Condition

Key Finding

Project Structure

Reproducing This Work

Requirements

Step 1 — Generate Dataset

Step 2 — Fine-tune with DPO

Step 3 — Run Experiments

Total Reproduction Cost

Limitations

Future Work

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages