Awesome Edge AI for Multimodal Agents

A curated list of papers, frameworks, benchmarks, and applications for efficient multimodal agents (LLMs, text-to-image, speech, world models, etc.) on mobile and edge devices.
Focused on inference engines, optimization, and deployment for real-world use.

🔹 Introduction

The next generation of AI agents is multimodal — capable of understanding and generating text, images, speech, video, and embodied interactions.
Running these models on mobile and edge devices unlocks:

Privacy: data stays on-device
Low latency: real-time interaction without cloud roundtrips
Accessibility: AI everywhere, even offline
Efficiency: tailored for constrained environments

This repo tracks the latest progress in making multimodal AI efficient, deployable, and agent-ready on edge hardware.

📄 Papers

🔖 Surveys & Overviews

Title	Venue	Year	Materials	Description
A Comprehensive Survey on On-Device AI Models	ACM Comput. Surveys	2024	Paper	Broad on-device overview (models, systems).
Mobile Edge Intelligence for Large Language Models	arXiv	2024	Paper	Survey of LLMs at mobile edge (latency, offload).
Efficient Diffusion Models: A Survey	arXiv	2025	Paper	Efficient diffusion (algo & systems) for edge.
Efficient Diffusion Models (IEEE TPAMI)	TPAMI	2025	Paper	Practice-focused survey incl. deployment.

🧠 LLM Inference on Edge

Title	Venue	Year	Materials	Description
LLM as a System Service on Mobile Devices (LLMS)	arXiv	2024	Paper	KV-cache mgmt., compression & swapping on phones.
Bringing Open LLMs to Consumer Devices (MLC-LLM)	Blog	2023	Post	Universal deployment: phones, browsers, Apple/AMD/NVIDIA.
Llama.cpp (GGML)	GitHub	2023–	Repo	C/C++ local inference across CPUs/NPUs/GPUs.
Large Language Models on Mobile Devices: Measurements & Optimizations	MobiSys	2024	Paper	Empirical study of on-device LLM cost/latency.

🖼️ Multimodal & Generative Models

Title	Venue	Year	Materials	Description
MobileCLIP	CVPR	2024	Paper \| Code	Image-text models optimized for iPhone latency.
LLaVA-Mini (1 vision token)	arXiv	2025	Paper	Compresses vision tokens → 1 token for LMMs.
MobileVLM	arXiv	2023–24	Paper \| Code	VLM tuned for mobile throughput.
EdgeSAM	arXiv	2023	Paper \| Proj	Distilled SAM at 30+ FPS on iPhone 14.
MiniCPM-V (efficient MLLM)	Nat. Commun.	2025	Paper	On-device MLLM progress since 2024 releases.

🌎 World Models & Embodied AI

Title	Venue	Year	Materials	Description
AndroidWorld: Dynamic Benchmarking for Mobile Agents	arXiv	2024	Paper \| Site	116 tasks across 20 Android apps; agent eval.

🤖 Agent Systems on Edge

Title	Venue	Year	Materials	Description
MobiAgent: Systematic Framework for Customizable Mobile Agents	arXiv	2025	Paper	Mobile agent models + acceleration + benchmark suite.
EcoAgent: Edge–Cloud Collaborative Mobile Automation	arXiv	2025	Paper	Planner in cloud + execution/observation on-edge.
LLM as a System Service (OS-level integration)	arXiv	2024	Paper	System support for stateful on-device LLMs.
Mobile-Agent-v3 / GUI-Owl (GUI automation)	arXiv	2025	Paper	SOTA open models on AndroidWorld/OSWorld.
Democratizing Agentic AI with Fast Test-Time Scaling on the Edge (FlashTTS)	arXiv	2025	Paper	Serving system for efficient test-time scaling on edge; 2.2× higher goodput and 38–68% lower latency vs. vLLM baseline.

⚙️ Frameworks & Inference Engines

ONNX Runtime — Cross-platform accelerator; hardware backends
TensorRT — Compiler + runtime for low-latency inference
Core ML — Apple on-device ML
LiteRT (TensorFlow Lite) — Google’s on-device runtime
MNN — Alibaba’s lightweight, efficient engine
llama.cpp — Portable C/C++ LLM/VLM inference
MLC-LLM — TVM-based universal deployment

🛠️ Optimization Techniques

Category	Methods / Papers	Description	Paper	Code
Quantization	GPTQ, AWQ, SmoothQuant, OmniQuant, QuaRot, QLoRA, DoRA	W4/W8A8, group-wise or NF4 quantization; activation-aware scaling; outlier rotation; low-bit PEFT; LoRA decomposition for fine-tuning.	GPTQ / AWQ / SmoothQuant / QuaRot / QLoRA / DoRA	GPTQ / AWQ / SmoothQuant / QuaRot / QLoRA
KV-cache Quantization	KVQuant, ZipCache, QAQ	2–3 bit KV compression with <0.1 perplexity drop; enables million-token context windows and memory savings.	KVQuant / ZipCache / QAQ	KVQuant / ZipCache
Pruning & Sparsity	SparseGPT, Wanda, Wanda++, Movement pruning, N:M sparsity	Unstructured/structured sparsity up to 60% with minimal accuracy loss; block- and activation-aware pruning for LLMs.	SparseGPT / Wanda / Movement Pruning	SparseGPT / Wanda
Efficient Attention	FlashAttention-3, PagedAttention (vLLM), MQA/GQA	Mixed-precision & warp-specialized kernels; KV cache paging; fewer KV heads for faster decode.	FlashAttention-3 / PagedAttention	FlashAttention / vLLM
Speculative & Multi-token Decoding	Medusa, EAGLE, EAGLE-3	Multi-head speculative decoding; feature- and token-level prediction; 2–3.6× speedup.	Medusa / EAGLE	Medusa / EAGLE
Multimodal Compression	ToMe, DynamicViT, LLaVA-Mini	Token merging/pruning for ViTs; dynamic vision token selection; extreme compression (1 vision token vs 576).	ToMe / DynamicViT / LLaVA-Mini	ToMe / LLaVA-Mini
Efficient Diffusion	Consistency Models, LCM, LCM-LoRA, ADD, SDXL-Turbo, SnapFusion	Few-step or 1-step generation; distillation & adversarial training; mobile-ready pipelines for <2s inference.	Consistency Models / LCM / ADD / SDXL-Turbo / SnapFusion	LCM / SDXL-Turbo / SnapFusion
System-level TTS Optimization	FlashTTS	Fast test-time scaling for agentic LLMs on edge; speculative beam extension, dynamic prefix scheduling, memory-aware model allocation. 2.2× higher goodput, 38–68% latency reduction vs. vLLM.	FlashTTS	–

📊 Benchmarks & Datasets

Benchmark / Dataset	Category	Description	Link
MLPerf Tiny	Embedded / TinyML	Industry-standard inference benchmark suite for ultra-low-power embedded devices (microcontrollers); covers tasks like keyword spotting, visual wake words, image classification, anomaly detection. Measures accuracy, latency, and energy.	MLPerf Tiny
AI Benchmark	Mobile AI	Mobile AI performance suite that scores AI workloads across devices, measuring CPU, GPU, and NPU performance.	AI Benchmark
AndroidWorld	UI Agent / Autonomous	Dynamic benchmarking environment for autonomous agents controlling Android UIs. Contains 116 programmatically generated tasks across 20 apps; supports reproducible evaluation and robustness testing.	AndroidWorld (GitHub)
Geekbench AI	Device AI Scoring	AI-centric workload scoring benchmark that measures CPU, GPU, and NPU performance across a variety of AI tasks.	Geekbench AI
MLPerf Client	Client LLM / Desktop	Client-side benchmarking toolkit for evaluating LLM and AI workloads on desktops, laptops, and similar devices.	MLPerf – Client benchmarks
AIoTBench	Mobile / Embedded (Legacy)	Older mobile/embedded benchmark suite evaluating inference speed across mobile frameworks (TensorFlow Lite, Caffe2, PyTorch Mobile). Introduces metrics like VIPS and VOPS.	AIoTBench (arXiv)

📱 Applications & Use Cases

Category	Examples / Papers	Description	Paper	Code
On-device Chat Assistants	MobileLLM, MobiLlama, EdgeMoE	Sub-billion or sparse LLMs optimized for phones; low memory/latency assistants.	MobileLLM / MobiLlama / EdgeMoE	MobileLLM / MobiLlama
Real-time Speech Translation & Vision	Whisper, SeamlessM4T, MobileCLIP	On-device ASR + translation; efficient vision-language for realtime apps.	Whisper / SeamlessM4T / MobileCLIP	Whisper
AR/VR Embodied & GUI Agents	Voyager, AppAgent, Mobile-Agent	Embodied agents (3D/VR) and GUI agents that operate smartphone apps.	Voyager / AppAgent / Mobile-Agent	Voyager / AppAgent / Mobile-Agent
Edge Creative Tools (Image/Video/Music)	SnapFusion, MobileDiffusion, LCM/LCM-LoRA, SDXL-Turbo	Distillation/few-step diffusion for on-device image/video; single-step accelerators; practical mobile T2I.	SnapFusion / MobileDiffusion / LCM / SDXL-Turbo	SnapFusion / MobileDiffusion / LCM
Wearable Voice Agents	ClawWatch (NullClaw + Vosk)	First AI agent running natively on a smartwatch. NullClaw (2.8 MB Zig binary) + Vosk offline STT (68 MB) + cloud LLM on Galaxy Watch; ~71 MB total, ~1 MB RAM.	–	ClawWatch
Robotics & IoT AI	RT-2, Octo, OpenVLA, Mobile ALOHA	VLA policies and low-cost teleop datasets enabling general robot skills; efficient fine-tuning/serving.	RT-2 / Octo / OpenVLA / Mobile ALOHA	RT-2 / OpenVLA / Mobile ALOHA
Always-On AI Assistants	OpenClaw, ClawBox	Self-hosted AI assistant platform on NVIDIA Jetson Orin Nano (67 TOPS, 15W). Multi-agent workflows, browser automation, messaging (Telegram/WhatsApp/Discord).	—	OpenClaw

🌍 Community & Resources

Awesome Edge AI — Related list
ClawBox Hardware — Pre-configured edge AI assistant hardware
MLC AI Community
ONNX Community

🤝 Contributing

Pull requests are welcome! Please follow the Awesome List Guidelines.

⭐️ Inspired by the vision of efficient multimodal agents everywhere — from phones to IoT to autonomous systems.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE		LICENSE
README.md		README.md
image.png		image.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Edge AI for Multimodal Agents

📑 Contents

🔹 Introduction

📄 Papers

🔖 Surveys & Overviews

🧠 LLM Inference on Edge

🖼️ Multimodal & Generative Models

🌎 World Models & Embodied AI

🤖 Agent Systems on Edge

⚙️ Frameworks & Inference Engines

🛠️ Optimization Techniques

📊 Benchmarks & Datasets

📱 Applications & Use Cases

🌍 Community & Resources

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Edge AI for Multimodal Agents

📑 Contents

🔹 Introduction

📄 Papers

🔖 Surveys & Overviews

🧠 LLM Inference on Edge

🖼️ Multimodal & Generative Models

🌎 World Models & Embodied AI

🤖 Agent Systems on Edge

⚙️ Frameworks & Inference Engines

🛠️ Optimization Techniques

📊 Benchmarks & Datasets

📱 Applications & Use Cases

🌍 Community & Resources

🤝 Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages