Native AI inference for Android devices
Run GGUF models directly on your Android device with optimized performance and zero cloud dependency!
This is an Android binding for llama.cpp written in Kotlin, designed specifically for native Android applications. Built to leverage modern hardware capabilities, this library brings efficient large language model inference to mobile devices. The project is inspired by cui-llama.rn and llama.cpp: Inference of LLaMA and multimodal models in pure C/C++, specifically tailored for native Android development in Kotlin.
- Modernized Core: Native codebase synchronized with the latest
llama.cppupstream (viacui-llama.rn). - Multimodal Support: Full support for vision/multimodal models (e.g., LLaVA) using
mmprojfiles. - Improved File Handling: Migrated to a robust File Descriptor (FD) passing mechanism, bypassing Android's scoped storage restrictions and
/proc/self/fdlimitations. - Architecture Support: Expanded support beyond ARM-only; optimized for 64-bit platforms (
arm64-v8aandx86_64). - Real-time Streaming: Enhanced JNI streaming logic with robust UTF-8 buffering to prevent crashes during token generation.
- UI State Feedback: Improved
LlamaHelperto provide immediate feedback during long-running image analysis/projection phases.
Modern Android devices (Snapdragon, MediaTek, Exynos, Tensor) now possess the power to run sophisticated AI models locally. Kotlin-LlamaCpp enables:
- True On-Device AI: Run large language models entirely on your phone or tablet—no internet required, complete privacy.
- Hardware-Accelerated Inference: Automatic detection and utilization of CPU features (i8mm, dotprod) for hardware-accelerated inference on ARM and x86.
- Mobile-First Design: Context management and batch interruption designed for the constraints of mobile processors.
- Multimodal Capabilities: Analyze images locally using multimodal projectors (
mmproj).
- Multi-Architecture Support: Built for
arm64-v8aandx86_64with automatic CPU feature detection. - Multimodal Inference: Support for LLaVA and other vision models with per-prompt image injection.
- Efficient Mobile Inference: Context Shift support from kobold.cpp enables longer conversations without memory overflow.
- Kotlin-First Design: Helper class to handle initialization and context management seamlessly.
- Seamless Android Integration: Uses
ContentResolverand File Descriptors to work naturally with Android 11+ scoped storage.
You can find a complete, ready-to-build demo application in the /app directory.
The demo showcases model loading, multimodal image selection, handling inference in a ViewModel, and displaying generated text in a Jetpack Compose UI.
Add the following to your project's build.gradle:
dependencies {
implementation 'io.github.ljcamargo:llamacpp-kotlin:0.4.0'
}// Initialize and load
llamaHelper.load(
path = modelUriString, // e.g. from a file picker
contextLength = 2048,
) {
// Model loaded!
}
// Generate
llamaHelper.predict("Why is the sky blue?")To use multimodal features, you need a base GGUF model and its corresponding mmproj file.
// Load with multimodal projector
llamaHelper.load(
path = baseModelUri,
contextLength = 4096,
mmprojPath = mmprojUri // Pass the projector file here
) {
// Multimodal context ready!
}
// Generate with an image
llamaHelper.predict(
prompt = "Describe this image in detail:",
imagePath = selectedImageUri // Pass the image URI here
)The native C++ core is synchronized with the latest llama.cpp developments. For instructions on how to update or rebuild the native components, see the llamaCpp README.
Contributions are welcome! Please feel free to submit a Pull Request.
MIT
This project builds upon the work of:
- llama.cpp by Georgi Gerganov
- cui-llama.rn by Vali-98
- llama.rn