Skip to content

protocorn/Research_DPO

Repository files navigation

Domain-Aligned Speculative Decoding for Assistive Technology

Does fine-tuning a draft model on domain-specific preference data improve speculative decoding acceptance rate for that domain?

This project investigates the intersection of two modern LLM techniques — Direct Preference Optimization (DPO) and Speculative Decoding — in the context of assistive technology (AT) queries. We fine-tune a small draft model on AT-domain preference data and measure whether alignment improves token acceptance rates during speculative decoding against a larger, frozen verifier.


Table of Contents


Background

Speculative Decoding accelerates LLM inference by using a small draft model to speculatively generate K tokens, which a larger verifier model then accepts or rejects in a single parallel forward pass. The speedup depends entirely on the token acceptance rate α — the fraction of draft tokens the verifier accepts. Formally:

α = 1 - TV(p_verifier, q_draft)

Where TV is total variation distance between the two models' output distributions. The lower the divergence, the higher α, and the faster inference becomes.

The natural question: if we fine-tune the draft model to be more similar to the verifier on a specific domain, does α improve for that domain?

DPO (Rafailov et al., 2023) allows us to align a model toward preferred outputs without a separate reward model, using only preference pairs (prompt, chosen, rejected). It optimizes:

L_DPO = -E log σ( β·log[π_θ(y_w|x)/π_ref(y_w|x)] - β·log[π_θ(y_l|x)/π_ref(y_l|x)] )

This project applies DPO to the draft model using AT-domain preference data, then measures whether the aligned draft model achieves higher α during speculative decoding against an unaligned verifier.


Hypothesis

A draft model fine-tuned via DPO on AT-domain preference pairs will exhibit lower token distribution divergence from the verifier model on AT-domain queries, resulting in measurably higher acceptance rate α and reduced inference latency for that domain.

Null hypothesis: DPO alignment does not significantly change acceptance rate for in-domain queries compared to the base draft model.


Models

Role Model Parameters Quantization
Draft (base) Phi-3 Mini Instruct 3.8B 4-bit NF4
Draft (aligned) Phi-3 Mini + DPO LoRA 3.8B + ~14M trainable 4-bit NF4
Verifier Phi-3 Medium Instruct 14B 4-bit NF4

Why Phi-3 Mini + Phi-3 Medium? Speculative decoding requires both models to share an identical vocabulary. Phi-3 Mini and Phi-3 Medium both use the same 32,011-token tokenizer, making them a valid model pair. Mistral-7B (32,000 tokens) was initially considered but verified to have 991 vocabulary mismatches with Phi-3 Mini — rendering it incompatible for speculative decoding.


Methodology

1. Dataset Generation

Generated 985 preference pairs covering the assistive technology domain using a structured rotation across disability types, user types, and task types to maximize diversity.

Disability types:  low vision, blindness, motor impairment, cognitive
                   disability, hearing impairment, dyslexia, ADHD, ALS,
                   color blindness, tremor, multiple disabilities, deafblindness

User types:        end user, developer, IT admin, teacher, occupational
                   therapist, caregiver, procurement officer

Task types:        tool recommendation, configuration, troubleshooting,
                   standards compliance, enterprise deployment, comparison

Each preference pair:

{
  "prompt": "What screen reader works best for a blind developer using VS Code?",
  "chosen": "NVDA with the VSCode accessibility plugin provides the best experience...",
  "rejected": "There are many screen readers available that could work for developers..."
}

Chosen responses name specific AT tools, cite relevant standards (WCAG, EN 301 549, Section 508), and give actionable guidance. Rejected responses are plausible but vague, generic, and non-actionable.

Dataset statistics:

Total pairs:              985
Avg chosen length:        555 chars
Avg rejected length:      281 chars
Length ratio (c/r):       1.97x
Near-duplicate prompts:   0 (verified with 12-word key deduplication)

2. DPO Fine-tuning

Fine-tuned Phi-3 Mini using QLoRA + DPO on the generated dataset.

Base model:          microsoft/Phi-3-mini-4k-instruct
Method:              QLoRA (4-bit NF4) + DPO
LoRA rank:           16
LoRA alpha:          32
Target modules:      q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
DPO β:               0.1
Epochs:              2
Effective batch:     16 (batch=2, grad_accum=8)
Learning rate:       5e-5 (cosine schedule)
Max sequence length: 512
Hardware:            RTX 3090 24GB (Vast.ai)
Training time:       ~1.5 hours
Trainable params:    ~14M / 3.8B (0.37%)

Qualitative comparison on held-out AT prompt:

Response
Base "Assistive technology tools can be helpful... DynaVox Communication Devices..."
DPO-tuned "Eye-tracking Technology: Devices like Tobii Dynavox I-Series allow users to type by selecting letters and words on a screen with their eyes..."

The DPO-tuned model produces structured, tool-specific responses consistent with chosen training examples.

3. Speculative Decoding Engine

Implemented speculative decoding from scratch in PyTorch (Leviathan et al., 2023).

Core loop:

1. Draft phase:   Small model autoregressively generates K tokens, storing q(t_i)
2. Verify phase:  Large model runs ONE forward pass over all K tokens, producing p(t_i)
3. Accept/reject: Accept token i with probability min(1, p(t_i)/q(t_i))
4. On rejection:  Sample correction from residual(t) = normalize(max(0, p(t) - q(t)))
5. Free token:    If all K accepted, sample one more from verifier's position K+1

The residual sampling step guarantees output distribution is mathematically identical to sampling from the verifier alone.

K = 4, max_new_tokens = 200, temperature = 0.7

4. Experiment Design

Experiment A:  Base Phi-3 Mini (draft)      + Phi-3 Medium (verifier, frozen)
Experiment B:  DPO-tuned Phi-3 Mini (draft) + Phi-3 Medium (verifier, frozen)

Evaluated on:
  10 AT-domain prompts  (in-domain)
   2 general prompts    (out-of-domain)

Primary metric:   Token acceptance rate α
Secondary metric: Latency (ms), tokens/second

Results

In-Domain Results (AT Queries)

Prompt Query Type Base α DPO α Δα
Screen reader for blind users Recommendation 0.777 0.750 -0.027
Keyboard alternatives (motor) Recommendation 0.754 0.665 -0.089
Eye tracking for ALS Recommendation 0.676 0.795 +0.119
AT tools for dyslexia Recommendation 0.754 0.824 +0.070
Captioning for deaf employees Recommendation 0.577 0.623 +0.046
NVDA configuration Procedural 0.704 0.668 -0.036
Low vision + motor (combined) Recommendation 0.696 0.768 +0.072
React screen reader testing Procedural/Technical 0.746 0.689 -0.057
WCAG 2.1 vs EN 301 549 Definitional 0.720 0.810 +0.090
IT admin AT deployment Procedural 0.887 0.892 +0.005
Average 0.729 0.748 +0.019

Out-of-Domain Results

Prompt Base α DPO α Δα
Photosynthesis 1.000 0.934 -0.066
Binary search algorithm 0.865 0.791 -0.074
Average 0.933 0.863 -0.070

Summary by Condition

Condition Base α DPO α Δα
AT-domain (all) 0.729 0.748 +0.019
AT-domain — recommendation/definitional 0.676 0.752 +0.067
AT-domain — procedural/technical 0.784 0.749 -0.052
Out-of-domain 0.933 0.863 -0.070

Key Finding

DPO alignment effects are query-type dependent, not simply domain dependent.

The aggregate in-domain improvement (+0.019) conceals two opposing effects:

DPO consistently helps on recommendation and definitional queries (+0.067): these directly match the training distribution — "what tool works for X", "what is the difference between X and Y". The draft model's token distribution shifts closer to the verifier's for these response patterns.

DPO consistently hurts on procedural and technical queries (-0.052): "how do I configure X", "how do I programmatically test Y". These query types were underrepresented in the training dataset, so DPO shifted the model's distribution away from the step-by-step prose the verifier expects.

Out-of-domain degradation is consistent (-0.070) across both general prompts tested, indicating domain-specific alignment imposes a real out-of-distribution penalty.

Practical implication: The training dataset's query-type composition matters as much as its domain coverage. Domain alignment for speculative decoding requires query-type diversity in training data — not just topic diversity. A dataset of purely recommendation-style pairs will improve α on recommendations but may degrade it on procedural queries within the same domain.


Project Structure

Research_DPO/
├── dpo_clean.jsonl                   # 985 AT-domain preference pairs
├── data_generation.py                # Preference dataset generation
├── clean_dataset.py                  # Dataset deduplication / cleaning
├── preprocess_and_load_dataset.py    # Tokenisation + dataset loading
├── train_dpo.py                      # QLoRA + DPO fine-tuning
├── speculative_decode.py             # Speculative decoding engine + A/B runner
├── test_trained_model.py             # Qualitative model output checks
├── check_vocab_compatibility.py      # Verifies Phi-3 Mini/Medium share identical vocabulary
├── check_dataset.py                  # Validates dataset stats, tool coverage, and deduplication
├── final_adapter/                    # LoRA adapter config and tokenizer files
│   ├── adapter_config.json           # adapter_model.safetensors excluded via .gitignore
│   └── tokenizer files
└── README.md

Reproducing This Work

Requirements

pip install transformers trl peft bitsandbytes accelerate datasets torch

Step 1 — Generate Dataset

Dataset generation uses a rotation-based prompting strategy with ChatGPT. Use the provided dpo_clean.jsonl directly, or regenerate via data_generation.py.

Step 2 — Fine-tune with DPO

Requires 24GB+ VRAM (tested on RTX 3090 via Vast.ai, ~$0.40 total):

python train_dpo.py

# Dry run (verify setup before full training):
python train_dpo.py --dry-run

Adapter saved to ./final_adapter/.

Step 3 — Run Experiments

Also requires 24GB+ VRAM for both models simultaneously in 4-bit:

# Base draft model
python speculative_decode.py --mode base \
  --prompt "What screen reader works best for blind users?"

# DPO-tuned draft model
python speculative_decode.py --mode dpo \
  --prompt "What screen reader works best for blind users?"

# Direct A/B comparison
python speculative_decode.py --mode compare \
  --prompt "What screen reader works best for blind users?"

The --adapter-path flag defaults to ./final_adapter. Override if your adapter is saved elsewhere.

Total Reproduction Cost

Step Cost
Dataset generation (985 pairs, ChatGPT free tier) $0
DPO fine-tuning (Vast.ai RTX 3090, ~1.5hrs) ~$0.40
Experiments (Vast.ai RTX 3090, ~2hrs) ~$0.60
Total ~$1.00

Limitations

Small evaluation set: 10 in-domain and 2 out-of-domain prompts. Statistical significance cannot be claimed. A robust study requires 100+ prompts per condition with variance analysis.

Single unaligned verifier: Phi-3 Medium was used as-is without any preference tuning. A DPO-aligned verifier may produce different acceptance patterns — the formatting mismatch between a preference-tuned draft and an instruction-tuned-only verifier may partially explain the procedural query degradation.

Synthetic preference data: Training data was generated using ChatGPT rather than human AT expert judgments. Chosen/rejected distinction reflects GPT-4 quality assessments, not real user preferences.

Single domain: Only assistive technology was tested. The query-type dependency finding may generalize but this is unverified.


Future Work

  • Test with a DPO-aligned verifier to isolate the formatting mismatch confound
  • Expand evaluation to 100+ prompts with confidence intervals
  • Train with query-type balanced dataset (equal recommendation/procedural pairs) and measure whether procedural query degradation is recovered
  • Extend to medical, legal, and code domains to test generalizability of the query-type dependency finding
  • Investigate mixed-domain DPO training to mitigate out-of-domain degradation

References

  • Leviathan et al. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
  • Chen et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. DeepMind.
  • Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
  • Microsoft (2024). Phi-3 Technical Report.
  • Sun et al. (2025). Training Domain Draft Models for Speculative Decoding: Best Practices and Insights. arXiv.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors