TrueLarge-RT: Break the Memory Wall on Android.
Features • Installation • Quick Start • Architecture • Citation
Important
Performance Milestone: TrueLarge-RT successfully runs Llama-3.3-70B-Instruct-Q2_XS on the Realme 2 Pro with only ~1400MB available RAM (4GB device) at 0.01 TPS. This demonstrates absolute scalability on legacy hardware constrained by UFS 2.1 storage speeds.
TrueLarge-RT is a high-performance native inference engine that enables 32B+ parameter LLMs to run on consumer Android devices (4GB-8GB RAM) without crashing.
Traditional runtimes (ONNX, TFLite, etc.) require the full model to fit in RAM. TrueLarge-RT uses a proprietary Deep-Pipelined Layer-by-Layer (LBL) engine to stream weights from storage, offering infinite scalability limited only by your disk size.
By implementing a Smart Hybrid Engine, TrueLarge-RT automatically selects the optimal execution strategy for your hardware:
Condition: Free RAM > Model Size + 1GB
This mode pins the entire model into physical RAM using mlock(), preventing any OS swapping. It delivers zero-latency access to weights, making it perfect for smaller models (e.g., Llama-3-8B on a 12GB device) where maximum token speed is the priority.
Condition: Free RAM > 75% of Model Size
Leverages the operating system's virtual memory pager (mmap). The OS intelligently keeps frequently used layers in RAM while paging out unused ones. This is the most efficient mode for models that mostly fit in memory but need a little breathing room.
Condition: Free RAM < 75% of Model Size
The core innovation of TrueLarge-RT. It allocates a small, fixed compute buffer (approx. 800MB) and streams model weights layer-by-layer from disk during inference.
Benchmark: Successfully runs Llama-3.3-70B-Instruct-Q2_XS on a Realme 2 Pro (4GB RAM).
- No Quantization Required: Run full precision (FP16) or high-quality (Q4_K_M) GGUF models.
- Native Performance: Built on
llama.cppwith custom ARM NEON/DotProd kernels. - Resilient Model Manager: Integrated background downloader for massive model files.
- Professional Telemetry: Real-time graphs for TTFT (ms), RAM (MB), and CPU (GHz).
- Universal Support: Works with Llama 3, Qwen 2.5, Mistral, Phi-3, and more.
- Mixture-of-Experts (MoE): Support is actively being researched.
TrueLarge-RT is distributed as a standard Android Studio project.
# 1. Clone the repository
git clone https://github.com/nareshis21/Truelarge-RT.git
# 2. Open in Android Studio
# File -> Open -> Select 'Truelarge-RT' folder
# 3. Sync Gradle & Build
## Building from Source
You can build the APK using Android Studio or via command line:
### Option 1: Command Line
```bash
# Debug APK
./gradlew assembleDebug
# Release APK (requires signing config)
./gradlew assembleReleaseThe output APK will be in app/build/outputs/apk/debug/.
- Open the project in Android Studio.
- Go to Build > Build Bundle(s) / APK(s) > Build APK(s).
## Quick Start
### 1. Download a Model
Use the built-in `ModelDownloadManager` or push via adb:
```bash
adb push Qwen2.5-32B-Instruct-Q4_K_M.gguf /sdcard/Download/TrueLarge/models/
val engine = NativeEngine()
// Initialize (auto-detects LBL vs RAM mode)
val success = engine.init("/path/to/model.gguf", nThreads = 4, gpuLayers = 0)
if (success) {
// Start a session
engine.createSession("Explain quantum physics", keepHistory = true)
// Generate tokens
while (true) {
val tokenBytes = engine.step() ?: break
print(String(tokenBytes))
}
}TrueLarge-RT shatters the "Memory Wall" by strictly separating Compute Memory from Model Storage.
In conventional inference (e.g., standard llama.cpp or TFLite), the engine attempts to load the entire model weights into RAM. For a 32B model, even at 4-bit quantization, this is ~20GB. On a typical smartphone with 8GB RAM, this triggers the Low Memory Killer (LMK) instantly, making large-leaf inference physically impossible.
TrueLarge-RT treats the storage (UFS 4.0/3.1) as an extension of the memory hierarchy. Instead of RAM residency, we prioritize Streaming Throughput.
- Unlimited Model Size: Since RAM is only used for the current layer calculation, you can run a 70B model on a 4GB device.
- Kernel-Level Efficiency: We use
mmapto map the file once, and then usemadvise(MADV_DONTNEED)to purge the "hot" data from RAM immediately after the layer pass. This prevents the OS from thrashing. - Deep-Pipelined Overlap: Conventional LBL is slow because the CPU waits for the Disk. Our engine uses an Eager Prefetch Queue that peeks up to 3 layers ahead. On UFS 4.0, this hides 100% of the loading latency by saturating the storage pipeline while the CPU/NPU is compute-bound.
- Greedy RAM Window: On high-end devices, the engine uses a "Greedy" strategy—reducing the OS safety buffer to just 500MB and caching up to 80 layers simultaneously to minimize disk wear and maximize speed.
- Eager Prefetch Queue: Replaced the single-layer prefetcher with a thread-safe
std::deque. This ensures the disk I/O thread never starves, even during complex GQA/Attention computation. - Inter-Token Pipelining: We eliminate the sampling gap by proactively prefetching Layers 0-8 of the next token immediately while the current one is still being sampled.
- Multi-Layer Eviction Protection: A sophisticated tracker ensures the prefetcher never evicts a layer that is either currently active OR sitting in the eager queue.
- Micro-Pipelining: We optimize the "Ping-Pong" buffers so that compute and I/O are perfectly interleaved.
- I/O Hinting: Using
MADV_SEQUENTIALandMADV_WILLNEEDto trigger hardware-level read-ahead for 70B+ parameters.
As the inference progresses, the engine identifies which tensors are needed for the next token generation and maps only those specific pages into the NPU's address space.
To maintain high throughput, TrueLarge-RT implements a "Rolling Buffer" strategy. It does not just demand-page; it proactively manages the flow of tensors.
- prefetch(N+1): While the NPU is busy computing Layer N, the DMA engine is already pulling Layer N+1 from disk into a standby buffer.
- compute(N): The CPU/GPU executes matrix multiplication on the currently loaded weights.
- discard(N-1): Once a layer is done, its memory pages are immediately marked as
MADV_DONTNEED, telling the kernel to reclaim that physical RAM instantly.
We bypass the standard read() syscalls and CPU copy loops. By using mmap with MAP_PRIVATE and hardware-aligned offsets, the kernel maps the layer data directly from NVMe storage into the Neural Processing Unit (or CPU) address space.
- Code Reference:
LayerLoader::loadLayerMap - Benefit: Zero CPU overhead for data movement.
On initialization, the engine probes /proc/meminfo to calculate a precise "Safe Working Budget":
// simplified logic from TrueLargeRuntime.cpp
long safety_buffer = avail_ram > 8GB ? 400MB : 500MB;
long safe_budget = Available_RAM - safety_buffer - KV_Cache_Size;
int max_layers = safe_budget / Single_Layer_Size; // Cap at 80 for hybrid residenceThis ensures the OS never kills the app for OOM (Out of Memory), even when running 32B models on 4GB legacy devices.
To hide latency, TrueLarge-RT maintains two lightweight compute contexts (ctx_ping and ctx_pong). While one context validates the computation graph for the current layer, the next layer's weights can be pre-fetched into the alternate buffer, ensuring the GPU/CPU is never starved of work.
| Model | Precision | Device | Chipset | Storage | Device RAM | Used RAM (MB) | Speed (TPS) |
|---|---|---|---|---|---|---|---|
| Llama-3.3-70B | Q2_XS | Realme GT Neo 3T | SD870 | UFS 3.1 | 6GB | ~1500 | 0.040 |
| Llama-3.3-70B | Q2_XS | Poco M4 Pro 5G | Dimensity 810 | UFS 2.2 | 6GB | ~3000 | 0.023 |
| Llama-3.3-70B | Q2_XS | Realme 2 Pro | SD660 | UFS 2.1 | 4GB | ~1400 | 0.01 |
| Llama-3.1-8B | Q4_K_M | Poco M4 Pro 5G | Dimensity 810 | UFS 2.2 | 6GB | ~3000 | 0.1 |
| Llama-3.1-8B | Q4_K_M | Realme 2 Pro | SD660 | UFS 2.1 | 4GB | ~1400 | 0.045 |
| Qwen-2.5-0.5B | Q4_K_M | Poco M4 Pro 5G | Dimensity 810 | UFS 2.2 | 6GB | ~3000 | 13.0 |
| Qwen-2.5-0.5B | Q4_K_M | Realme 2 Pro | SD660 | UFS 2.1 | 4GB | ~1400 | 15.0 |
MIT License. See LICENSE for details.
@software{TrueLargeRT2026,
author = {Lahajal, Naresh Kumar},
title = {TrueLarge-RT: Layer-by-Layer Inference Engine for Android},
year = {2026},
url = {https://github.com/nareshis21/Truelarge-RT}
}


