Skip to content

Latest commit

 

History

History
179 lines (131 loc) · 6.67 KB

File metadata and controls

179 lines (131 loc) · 6.67 KB

The first of its kind,A fully offline, private AI chat app for Android

The only Android LLM app that literally cannot phone home. All LLM inference runs entirely on-device via llama.cpp. No internet. No cloud. No tracking. Your conversations stay yours.

Kotlin Android License llama.cpp Offline Jetpack Compose GitHub all releases


If this project helped you, please ⭐️ star it to help others find it

Screenshots


Features

  • 100% Offline — No INTERNET permission in the manifest. Cannot phone home.
  • On-Device Inference — Runs GGUF models via llama.cpp with optimized ARM NEON/SVE/i8mm native libraries
  • Streaming Responses — Token-by-token output (~25 tok/s on budget devices, 40-60+ on flagships)
  • Import Any Model — Bring your own GGUF models at runtime via file picker
  • Multiple Conversations — Auto-titled from your first message, renameable, searchable
  • Advanced Sampling — Temperature, Top-P, Top-K, Min-P, Repeat Penalty with explanations
  • Theming — System/Light/Dark/AMOLED Black + 9 accent colour options
  • System Prompts — General, Coder, Creative Writer, Tutor, or write your own
  • Text-to-Speech — Read AI responses aloud using your device's TTS engine
  • Thinking Tag Stripping — Hides <think> blocks from reasoning models like Qwen
  • Security — Encrypted settings, optional biometric lock, secure file deletion
  • Chat Backup — Export/import all conversations as JSON
  • Built-in Help — Guide for downloading models from HuggingFace
  • Gemma4 - Now supported in Version 3
  • RAG - Persistent memory feature coming soon !

Recommended Models

Model Size Best For
Model (Q4_K_M) Approx. Size RAM Required / Best For
:--- :--- :---
gemma-3-270m-it-qat-Q4_K_M.gguf ~300 MB 2-4GB RAM devices, fast responses
Qwen3.5 0.8B Q4_K_M ~530 MB Good balance for 4-6GB RAM
gemma-4-E2B-it-GGUF (2.3B effective) ~1.3 GB Recommended for 6-8GB RAM
Qwen3.5 4B Q4_K_M ~2.5 GB Best quality for 8GB+ RAM
gemma-4-E4B-it-GGUF (4.5B effective) ~2.5 GB Recommended for 6-8GB RAM

Search for the model name + "GGUF" on HuggingFace. Choose Q4_K_M quantization for best quality/speed balance.


Install

  1. Download the APK from Releases
  2. On your device: Settings → Apps → Install unknown apps → allow your file manager
  3. Open the APK and tap Install
  4. Complete onboarding and import a GGUF model from Settings

Or via ADB:

adb install OfflineLLM-v4.0.0-signed_release.apk
    

SHA256SUM: 9c36b1dc7e0eec0c4c32e0706c2769d1d612eafbfca6070acf022ac29f34c4f8


Build from Source

Prerequisites

  • JDK 17, Android SDK (compileSdk 36), NDK r27, CMake 3.22.1
git clone --recurse-submodules https://github.com/jegly/OfflineLLM.git
cd OfflineLLM

# Optional: bundle a model in the APK
cp /path/to/model.gguf app/src/main/assets/model/

# Build
./gradlew assembleDebug

First build compiles llama.cpp from source (~15-20 min). Subsequent builds are fast.


Architecture

OfflineLLM/
├── smollm/              ← Native llama.cpp JNI module
│   └── src/main/
│       ├── cpp/         ← C++ inference engine + JNI bridge
│       └── java/        ← SmolLM.kt, GGUFReader.kt wrappers
├── app/                 ← Main Android application
│   └── src/main/java/com/jegly/offlineLLM/
│       ├── ai/          ← InferenceEngine, ModelManager, SystemPrompts
│       ├── data/        ← Room database, DAOs, repositories
│       ├── di/          ← Hilt dependency injection modules
│       ├── ui/          ← Compose screens, components, theme, navigation
│       └── utils/       ← BiometricHelper, MemoryMonitor, SecurityUtils, TTS
└── llama.cpp/           ← Git submodule

Performance

Device Tier RAM Expected Speed
Budget (ZTE, etc.) 4 GB ~25 tok/s with 270M model
Mid-range (Pixel 7) 6-8 GB 30-50 tok/s with 1B model
Flagship (Pixel 10 Pro) 12-16 GB 40-60+ tok/s with 4B model

Sampling Parameters

OfflineLLM gives you full control over how the model generates text:

Parameter Default What It Does
Temperature 0.7 Controls randomness. Lower = focused. Higher = creative.
Top-P 0.9 Nucleus sampling. Only considers tokens above this cumulative probability.
Top-K 40 Limits selection to the K most likely tokens.
Min-P 0.1 Filters tokens below this fraction of the top token's probability.
Repeat Penalty 1.1 Penalises repeated tokens. 1.0 = no penalty.
Context Size 4096 How many tokens of conversation history the model can see.

Security & Privacy

  • Zero network permissions — no INTERNET, no ACCESS_NETWORK_STATE
  • No Google Play Services or Firebase dependencies
  • Encrypted settings via Jetpack Security
  • Optional biometric lock
  • Memory Tagging Extension enabled (memtagMode="sync")
  • Secure deletion — files overwritten before removal
  • No logging of prompts or responses

License

Apache License 2.0

llama.cpp backend: MIT License. Native wrapper adapted from SmolChat-Android (Apache 2.0).