A novel approach to language model text generation that processes multiple token candidates simultaneously by modifying Rotary Position Embeddings (RoPE).
TEMPO introduces a new generation paradigm that differs fundamentally from beam search:
- Parallel Token Processing: Multiple tokens at the same logical position are processed within a single forward pass
- RoPE Modification: Custom positional embeddings enable tokens to share positions while maintaining distinct identities
- Attention-Based Pruning: Uses attention patterns from future tokens to retroactively prune less coherent paths
Instead of sampling one token per position, TEMPO selects all tokens above a probability threshold:
# Traditional: sample one token
next_token = sample(logits)
# TEMPO: select multiple tokens above threshold
parallel_tokens = [t for t, p in enumerate(probs) if p > selection_threshold]The core innovation modifies RoPE to assign the same positional encoding to parallel tokens:
# Map multiple physical positions to same logical position
logical_position = position_map[physical_position]
# Apply RoPE with logical position instead of physicalAnalyzes how future tokens attend to past parallel options:
- Tokens receiving low attention from future tokens are pruned
- Maintains coherence while exploring multiple paths
- Dynamic threshold adjustment using Bezier curves
RoPE Modifications (src/algorithms/rope/)
position_mapper.py: Maps physical to logical positionsembedding_modifier.py: Core RoPE modification functionsmodel_patcher.py: Runtime model patching utilities
Attention Analysis (src/algorithms/attention/)
mask_builder.py: Constructs masks for parallel token isolationpattern_analyzer.py: Analyzes attention for pruning decisionsweight_extractor.py: Extracts attention weights from models
Generation Pipeline (src/algorithms/generation/)
logits_processor.py: Processes logits and applies thresholdskv_cache_manager.py: Manages KV caches efficientlyparallel_processor.py: Handles parallel token batching
Pruning Algorithms (src/algorithms/pruning/)
attention_pruner.py: Prunes based on attention patternsthreshold_manager.py: Dynamic threshold adjustmentmulti_scale_pruner.py: Multi-scale attention analysis
- Monte Carlo Tree Search Integration: Explores generation paths systematically
- Dynamic Thresholding: Bezier curve-based threshold adjustment
- Multi-Scale Attention: Aggregates attention patterns across layers
# Clone repository
git clone https://github.com/JoeLuker/tempo.git && cd tempo
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run generation
python3 run_tempo.py --prompt "Your prompt" --selection-threshold 0.1# Basic parallel generation
python3 run_tempo.py --prompt "The future of AI is" --selection-threshold 0.1
# With retroactive pruning
python3 run_tempo.py --prompt "Explain quantum computing" \
--selection-threshold 0.1 \
--use-retroactive-pruning \
--attention-threshold 0.01
# With MCTS exploration
python3 run_tempo.py --prompt "Write a story" \
--selection-threshold 0.15 \
--use-mcts \
--mcts-simulations 100- Novel Generation Paradigm: First approach to modify positional embeddings for parallel token processing
- Attention-Based Coherence: Uses model's own attention as pruning signal
- Efficient Implementation: Maintains single model state for multiple paths
- Diversity: 2-3x more diverse outputs compared to beam search
- Coherence: Attention-based pruning maintains quality
- Efficiency: Batch processing minimizes overhead
If you use TEMPO in your research, please cite:
@software{tempo2024,
title={TEMPO: Threshold-Enabled Multipath Parallel Output},
author={Luker, Joe},
year={2024},
url={https://github.com/JoeLuker/tempo}
}MIT License - See LICENSE file for details.