Skip to content

ALA: Asynchronous LLM Advisor - Bounded Logit Perturbation Channels for LLM-Guided Reinforcement Learning

License

Notifications You must be signed in to change notification settings

enfuse/ala-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ALA: Asynchronous LLM Advisor

Bounded Logit Perturbation Channels for LLM-Guided Reinforcement Learning

DOI TechRxiv License: CC BY 4.0 GitHub

Abstract

We introduce ALA (Asynchronous LLM Advisor), a novel architecture that enables large language models to provide real-time strategic guidance to reinforcement learning agents via bounded logit perturbation channels. Unlike approaches that replace RL policies with LLM decisions or use LLMs only for pre-training, ALA creates a continuous, asynchronous advisory channel that nudges agent behavior while preserving learned policies.

Key innovations include:

  • Time-bounded bias expiration — Stale advice automatically expires
  • Multi-advisor voting — Parallel LLM queries with priority-weighted selection
  • Importance sampling correction — Maintains unbiased PPO gradients despite biased action sampling

Author

Cahlen Humphreys
Enfuse Labs
ch@enfuse.io

Paper

System Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  DGX Spark      │     │   RTX 5090      │     │  Jetson Orin    │
│  (LLM)          │────▶│   (Router)      │────▶│  (Actor)        │
│                 │     │                 │     │                 │
│ GPT-OSS-20B     │     │ Bounds & routes │     │ logits += bias  │
│ ~30 tok/s       │     │ biases          │     │ action=softmax  │
└─────────────────┘     └─────────────────┘     └─────────────────┘
      │                       │                       │
   3-5 sec                  <50ms                  ~10-15ms
    async                   sync                    sync

Key Equation

The importance sampling correction for PPO with ALA biases:

ratio = exp(log π_new(a|s) - log π_old(a|s) - β × bias[a])

This ensures unbiased policy gradients despite biased action sampling.

Hardware

Component Role Specs
NVIDIA DGX Spark LLM Server GPT-OSS-20B, 128GB HBM
NVIDIA RTX 5090 Learner + ALA Router 32GB VRAM
NVIDIA Jetson Orin AGX Actor (20 bots) 64GB unified memory

Citation

@software{humphreys2026ala,
  author       = {Humphreys, Cahlen},
  title        = {ALA: Asynchronous LLM Advisor for Real-Time Guidance in Reinforcement Learning},
  year         = 2026,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.18172889},
  url          = {https://doi.org/10.5281/zenodo.18172889}
}

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Links

About

ALA: Asynchronous LLM Advisor - Bounded Logit Perturbation Channels for LLM-Guided Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages