The first of its kind,A fully offline, private AI chat app for Android
The only Android LLM app that literally cannot phone home. All LLM inference runs entirely on-device via llama.cpp. No internet. No cloud. No tracking. Your conversations stay yours.
- 100% Offline — No INTERNET permission in the manifest. Cannot phone home.
- On-Device Inference — Runs GGUF models via llama.cpp with optimized ARM NEON/SVE/i8mm native libraries
- Streaming Responses — Token-by-token output (~25 tok/s on budget devices, 40-60+ on flagships)
- Import Any Model — Bring your own GGUF models at runtime via file picker
- Multiple Conversations — Auto-titled from your first message, renameable, searchable
- Advanced Sampling — Temperature, Top-P, Top-K, Min-P, Repeat Penalty with explanations
- Theming — System/Light/Dark/AMOLED Black + 9 accent colour options
- System Prompts — General, Coder, Creative Writer, Tutor, or write your own
- Text-to-Speech — Read AI responses aloud using your device's TTS engine
- Thinking Tag Stripping — Hides
<think>blocks from reasoning models like Qwen - Security — Encrypted settings, optional biometric lock, secure file deletion
- Chat Backup — Export/import all conversations as JSON
- Built-in Help — Guide for downloading models from HuggingFace
- Gemma4 - Now supported in Version 3
| Model | Size | Best For |
|---|---|---|
| Gemma 3 270M Q4_K_M | ~300 MB | 4GB RAM devices, fast responses |
| Qwen3.5 0.8B Q4_K_M | ~530 MB | Good balance for 4-6GB RAM |
| Gemma 3 1B Q4_K_M | ~750 MB | Recommended for 6-8GB RAM |
| Qwen3.5 4B Q4_K_M | ~2.5 GB | Best quality for 8GB+ RAM |
Search for the model name + "GGUF" on HuggingFace. Choose Q4_K_M quantization for best quality/speed balance.
- Download the APK from Releases
- On your device: Settings → Apps → Install unknown apps → allow your file manager
- Open the APK and tap Install
- Complete onboarding and import a GGUF model from Settings
Or via ADB:
adb install OfflineLLM-v3.0.0-signed_release.apk
- JDK 17, Android SDK (compileSdk 36), NDK r27, CMake 3.22.1
git clone --recurse-submodules https://github.com/jegly/OfflineLLM.git
cd OfflineLLM
# Optional: bundle a model in the APK
cp /path/to/model.gguf app/src/main/assets/model/
# Build
./gradlew assembleDebugFirst build compiles llama.cpp from source (~15-20 min). Subsequent builds are fast.
OfflineLLM/
├── smollm/ ← Native llama.cpp JNI module
│ └── src/main/
│ ├── cpp/ ← C++ inference engine + JNI bridge
│ └── java/ ← SmolLM.kt, GGUFReader.kt wrappers
├── app/ ← Main Android application
│ └── src/main/java/com/jegly/offlineLLM/
│ ├── ai/ ← InferenceEngine, ModelManager, SystemPrompts
│ ├── data/ ← Room database, DAOs, repositories
│ ├── di/ ← Hilt dependency injection modules
│ ├── ui/ ← Compose screens, components, theme, navigation
│ └── utils/ ← BiometricHelper, MemoryMonitor, SecurityUtils, TTS
└── llama.cpp/ ← Git submodule
| Device Tier | RAM | Expected Speed |
|---|---|---|
| Budget (ZTE, etc.) | 4 GB | ~25 tok/s with 270M model |
| Mid-range (Pixel 7) | 6-8 GB | 30-50 tok/s with 1B model |
| Flagship (Pixel 10 Pro) | 12-16 GB | 40-60+ tok/s with 4B model |
OfflineLLM gives you full control over how the model generates text:
| Parameter | Default | What It Does |
|---|---|---|
| Temperature | 0.7 | Controls randomness. Lower = focused. Higher = creative. |
| Top-P | 0.9 | Nucleus sampling. Only considers tokens above this cumulative probability. |
| Top-K | 40 | Limits selection to the K most likely tokens. |
| Min-P | 0.1 | Filters tokens below this fraction of the top token's probability. |
| Repeat Penalty | 1.1 | Penalises repeated tokens. 1.0 = no penalty. |
| Context Size | 4096 | How many tokens of conversation history the model can see. |
- Zero network permissions — no INTERNET, no ACCESS_NETWORK_STATE
- No Google Play Services or Firebase dependencies
- Encrypted settings via Jetpack Security
- Optional biometric lock
- Memory Tagging Extension enabled (
memtagMode="sync") - Secure deletion — files overwritten before removal
- No logging of prompts or responses
Apache License 2.0
llama.cpp backend: MIT License. Native wrapper adapted from SmolChat-Android (Apache 2.0).





