Awesome Edge AI for Multimodal Agents
A curated list of papers, frameworks, benchmarks, and applications for efficient multimodal agents (LLMs, text-to-image, speech, world models, etc.) on mobile and edge devices .
Focused on inference engines, optimization, and deployment for real-world use.
The next generation of AI agents is multimodal β capable of understanding and generating text, images, speech, video, and embodied interactions .
Running these models on mobile and edge devices unlocks:
Privacy : data stays on-device
Low latency : real-time interaction without cloud roundtrips
Accessibility : AI everywhere, even offline
Efficiency : tailored for constrained environments
This repo tracks the latest progress in making multimodal AI efficient, deployable, and agent-ready on edge hardware .
Title
Venue
Year
Materials
Description
A Comprehensive Survey on On-Device AI Models
ACM Comput. Surveys
2024
Paper
Broad on-device overview (models, systems).
Mobile Edge Intelligence for Large Language Models
arXiv
2024
Paper
Survey of LLMs at mobile edge (latency, offload).
Efficient Diffusion Models: A Survey
arXiv
2025
Paper
Efficient diffusion (algo & systems) for edge.
Efficient Diffusion Models (IEEE TPAMI)
TPAMI
2025
Paper
Practice-focused survey incl. deployment.
π§ LLM Inference on Edge
Title
Venue
Year
Materials
Description
LLM as a System Service on Mobile Devices (LLMS)
arXiv
2024
Paper
KV-cache mgmt., compression & swapping on phones.
Bringing Open LLMs to Consumer Devices (MLC-LLM)
Blog
2023
Post
Universal deployment: phones, browsers, Apple/AMD/NVIDIA.
Llama.cpp (GGML)
GitHub
2023β
Repo
C/C++ local inference across CPUs/NPUs/GPUs.
Large Language Models on Mobile Devices: Measurements & Optimizations
MobiSys
2024
Paper
Empirical study of on-device LLM cost/latency.
πΌοΈ Multimodal & Generative Models
Title
Venue
Year
Materials
Description
MobileCLIP
CVPR
2024
Paper | Code
Image-text models optimized for iPhone latency.
LLaVA-Mini (1 vision token)
arXiv
2025
Paper
Compresses vision tokens β 1 token for LMMs.
MobileVLM
arXiv
2023β24
Paper | Code
VLM tuned for mobile throughput.
EdgeSAM
arXiv
2023
Paper | Proj
Distilled SAM at 30+ FPS on iPhone 14.
MiniCPM-V (efficient MLLM)
Nat. Commun.
2025
Paper
On-device MLLM progress since 2024 releases.
π World Models & Embodied AI
Title
Venue
Year
Materials
Description
AndroidWorld: Dynamic Benchmarking for Mobile Agents
arXiv
2024
Paper | Site
116 tasks across 20 Android apps; agent eval.
π€ Agent Systems on Edge
Title
Venue
Year
Materials
Description
MobiAgent: Systematic Framework for Customizable Mobile Agents
arXiv
2025
Paper
Mobile agent models + acceleration + benchmark suite.
EcoAgent: EdgeβCloud Collaborative Mobile Automation
arXiv
2025
Paper
Planner in cloud + execution/observation on-edge.
LLM as a System Service (OS-level integration)
arXiv
2024
Paper
System support for stateful on-device LLMs.
Mobile-Agent-v3 / GUI-Owl (GUI automation)
arXiv
2025
Paper
SOTA open models on AndroidWorld/OSWorld.
Democratizing Agentic AI with Fast Test-Time Scaling on the Edge (FlashTTS)
arXiv
2025
Paper
Serving system for efficient test-time scaling on edge; 2.2Γ higher goodput and 38β68% lower latency vs. vLLM baseline.
βοΈ Frameworks & Inference Engines
π οΈ Optimization Techniques
Category
Methods / Papers
Description
Paper
Code
Quantization
GPTQ, AWQ, SmoothQuant, OmniQuant, QuaRot, QLoRA, DoRA
W4/W8A8, group-wise or NF4 quantization; activation-aware scaling; outlier rotation; low-bit PEFT; LoRA decomposition for fine-tuning.
GPTQ / AWQ / SmoothQuant / QuaRot / QLoRA / DoRA
GPTQ / AWQ / SmoothQuant / QuaRot / QLoRA
KV-cache Quantization
KVQuant, ZipCache, QAQ
2β3 bit KV compression with <0.1 perplexity drop; enables million-token context windows and memory savings.
KVQuant / ZipCache / QAQ
KVQuant / ZipCache
Pruning & Sparsity
SparseGPT, Wanda, Wanda++, Movement pruning, N:M sparsity
Unstructured/structured sparsity up to 60% with minimal accuracy loss; block- and activation-aware pruning for LLMs.
SparseGPT / Wanda / Movement Pruning
SparseGPT / Wanda
Efficient Attention
FlashAttention-3, PagedAttention (vLLM), MQA/GQA
Mixed-precision & warp-specialized kernels; KV cache paging; fewer KV heads for faster decode.
FlashAttention-3 / PagedAttention
FlashAttention / vLLM
Speculative & Multi-token Decoding
Medusa, EAGLE, EAGLE-3
Multi-head speculative decoding; feature- and token-level prediction; 2β3.6Γ speedup.
Medusa / EAGLE
Medusa / EAGLE
Multimodal Compression
ToMe, DynamicViT, LLaVA-Mini
Token merging/pruning for ViTs; dynamic vision token selection; extreme compression (1 vision token vs 576).
ToMe / DynamicViT / LLaVA-Mini
ToMe / LLaVA-Mini
Efficient Diffusion
Consistency Models, LCM, LCM-LoRA, ADD, SDXL-Turbo, SnapFusion
Few-step or 1-step generation; distillation & adversarial training; mobile-ready pipelines for <2s inference.
Consistency Models / LCM / ADD / SDXL-Turbo / SnapFusion
LCM / SDXL-Turbo / SnapFusion
System-level TTS Optimization
FlashTTS
Fast test-time scaling for agentic LLMs on edge; speculative beam extension, dynamic prefix scheduling, memory-aware model allocation. 2.2Γ higher goodput, 38β68% latency reduction vs. vLLM.
FlashTTS
β
π Benchmarks & Datasets
Benchmark / Dataset
Category
Description
Link
MLPerf Tiny
Embedded / TinyML
Industry-standard inference benchmark suite for ultra-low-power embedded devices (microcontrollers); covers tasks like keyword spotting, visual wake words, image classification, anomaly detection. Measures accuracy, latency, and energy.
MLPerf Tiny
AI Benchmark
Mobile AI
Mobile AI performance suite that scores AI workloads across devices, measuring CPU, GPU, and NPU performance.
AI Benchmark
AndroidWorld
UI Agent / Autonomous
Dynamic benchmarking environment for autonomous agents controlling Android UIs. Contains 116 programmatically generated tasks across 20 apps; supports reproducible evaluation and robustness testing.
AndroidWorld (GitHub)
Geekbench AI
Device AI Scoring
AI-centric workload scoring benchmark that measures CPU, GPU, and NPU performance across a variety of AI tasks.
Geekbench AI
MLPerf Client
Client LLM / Desktop
Client-side benchmarking toolkit for evaluating LLM and AI workloads on desktops, laptops, and similar devices.
MLPerf β Client benchmarks
AIoTBench
Mobile / Embedded (Legacy)
Older mobile/embedded benchmark suite evaluating inference speed across mobile frameworks (TensorFlow Lite, Caffe2, PyTorch Mobile). Introduces metrics like VIPS and VOPS.
AIoTBench (arXiv)
π± Applications & Use Cases
Category
Examples / Papers
Description
Paper
Code
On-device Chat Assistants
MobileLLM, MobiLlama, EdgeMoE
Sub-billion or sparse LLMs optimized for phones; low memory/latency assistants.
MobileLLM / MobiLlama / EdgeMoE
MobileLLM / MobiLlama
Real-time Speech Translation & Vision
Whisper, SeamlessM4T, MobileCLIP
On-device ASR + translation; efficient vision-language for realtime apps.
Whisper / SeamlessM4T / MobileCLIP
Whisper
AR/VR Embodied & GUI Agents
Voyager, AppAgent, Mobile-Agent
Embodied agents (3D/VR) and GUI agents that operate smartphone apps.
Voyager / AppAgent / Mobile-Agent
Voyager / AppAgent / Mobile-Agent
Edge Creative Tools (Image/Video/Music)
SnapFusion, MobileDiffusion, LCM/LCM-LoRA, SDXL-Turbo
Distillation/few-step diffusion for on-device image/video; single-step accelerators; practical mobile T2I.
SnapFusion / MobileDiffusion / LCM / SDXL-Turbo
SnapFusion / MobileDiffusion / LCM
Wearable Voice Agents
ClawWatch (NullClaw + Vosk)
First AI agent running natively on a smartwatch. NullClaw (2.8 MB Zig binary) + Vosk offline STT (68 MB) + cloud LLM on Galaxy Watch; ~71 MB total, ~1 MB RAM.
β
ClawWatch
Robotics & IoT AI
RT-2, Octo, OpenVLA, Mobile ALOHA
VLA policies and low-cost teleop datasets enabling general robot skills; efficient fine-tuning/serving.
RT-2 / Octo / OpenVLA / Mobile ALOHA
RT-2 / OpenVLA / Mobile ALOHA
Always-On AI Assistants
OpenClaw , ClawBox
Self-hosted AI assistant platform on NVIDIA Jetson Orin Nano (67 TOPS, 15W). Multi-agent workflows, browser automation, messaging (Telegram/WhatsApp/Discord).
β
OpenClaw
π Community & Resources
Pull requests are welcome! Please follow the Awesome List Guidelines .
βοΈ Inspired by the vision of efficient multimodal agents everywhere β from phones to IoT to autonomous systems.