Skip to content

nareshis21/Truelarge-RT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TrueLarge-RT

TrueLarge-RT: Break the Memory Wall on Android.

License: MIT Platform Cpp Kotlin

FeaturesInstallationQuick StartArchitectureCitation

Important

Performance Milestone: TrueLarge-RT successfully runs Llama-3.3-70B-Instruct-Q2_XS on the Realme 2 Pro with only ~1400MB available RAM (4GB device) at 0.01 TPS. This demonstrates absolute scalability on legacy hardware constrained by UFS 2.1 storage speeds.

TrueLarge-RT is a high-performance native inference engine that enables 32B+ parameter LLMs to run on consumer Android devices (4GB-8GB RAM) without crashing.

Traditional runtimes (ONNX, TFLite, etc.) require the full model to fit in RAM. TrueLarge-RT uses a proprietary Deep-Pipelined Layer-by-Layer (LBL) engine to stream weights from storage, offering infinite scalability limited only by your disk size.

The Breakthrough

By implementing a Smart Hybrid Engine, TrueLarge-RT automatically selects the optimal execution strategy for your hardware:

1. Turbo Mode (RAM-Lock)

Condition: Free RAM > Model Size + 1GB

This mode pins the entire model into physical RAM using mlock(), preventing any OS swapping. It delivers zero-latency access to weights, making it perfect for smaller models (e.g., Llama-3-8B on a 12GB device) where maximum token speed is the priority.


2. Balanced Mode (mmap)

Condition: Free RAM > 75% of Model Size

Leverages the operating system's virtual memory pager (mmap). The OS intelligently keeps frequently used layers in RAM while paging out unused ones. This is the most efficient mode for models that mostly fit in memory but need a little breathing room.


3. Layer-by-Layer (LBL) Mode

Condition: Free RAM < 75% of Model Size

The core innovation of TrueLarge-RT. It allocates a small, fixed compute buffer (approx. 800MB) and streams model weights layer-by-layer from disk during inference.

Benchmark: Successfully runs Llama-3.3-70B-Instruct-Q2_XS on a Realme 2 Pro (4GB RAM).


Key Features

  • No Quantization Required: Run full precision (FP16) or high-quality (Q4_K_M) GGUF models.
  • Native Performance: Built on llama.cpp with custom ARM NEON/DotProd kernels.
  • Resilient Model Manager: Integrated background downloader for massive model files.
  • Professional Telemetry: Real-time graphs for TTFT (ms), RAM (MB), and CPU (GHz).
  • Universal Support: Works with Llama 3, Qwen 2.5, Mistral, Phi-3, and more.
  • Mixture-of-Experts (MoE): Support is actively being researched.

Installation

TrueLarge-RT is distributed as a standard Android Studio project.

# 1. Clone the repository
git clone https://github.com/nareshis21/Truelarge-RT.git

# 2. Open in Android Studio
# File -> Open -> Select 'Truelarge-RT' folder

# 3. Sync Gradle & Build

## Building from Source

You can build the APK using Android Studio or via command line:

### Option 1: Command Line
```bash
# Debug APK
./gradlew assembleDebug

# Release APK (requires signing config)
./gradlew assembleRelease

The output APK will be in app/build/outputs/apk/debug/.

Option 2: Android Studio

  1. Open the project in Android Studio.
  2. Go to Build > Build Bundle(s) / APK(s) > Build APK(s).

## Quick Start

### 1. Download a Model
Use the built-in `ModelDownloadManager` or push via adb:
```bash
adb push Qwen2.5-32B-Instruct-Q4_K_M.gguf /sdcard/Download/TrueLarge/models/

2. Initialize Engine (Kotlin)

val engine = NativeEngine()

// Initialize (auto-detects LBL vs RAM mode)
val success = engine.init("/path/to/model.gguf", nThreads = 4, gpuLayers = 0)

if (success) {
    // Start a session
    engine.createSession("Explain quantum physics", keepHistory = true)
    
    // Generate tokens
    while (true) {
        val tokenBytes = engine.step() ?: break
        print(String(tokenBytes))
    }
}

Technology Deep Dive: Layer-by-Layer (LBL)

TrueLarge-RT shatters the "Memory Wall" by strictly separating Compute Memory from Model Storage.

The Problem: The "RAM Wall"

In conventional inference (e.g., standard llama.cpp or TFLite), the engine attempts to load the entire model weights into RAM. For a 32B model, even at 4-bit quantization, this is ~20GB. On a typical smartphone with 8GB RAM, this triggers the Low Memory Killer (LMK) instantly, making large-leaf inference physically impossible.

The Solution: TrueLarge LBL Engine

TrueLarge-RT treats the storage (UFS 4.0/3.1) as an extension of the memory hierarchy. Instead of RAM residency, we prioritize Streaming Throughput.

Why LBL is Better for Edge:

  1. Unlimited Model Size: Since RAM is only used for the current layer calculation, you can run a 70B model on a 4GB device.
  2. Kernel-Level Efficiency: We use mmap to map the file once, and then use madvise(MADV_DONTNEED) to purge the "hot" data from RAM immediately after the layer pass. This prevents the OS from thrashing.
  3. Deep-Pipelined Overlap: Conventional LBL is slow because the CPU waits for the Disk. Our engine uses an Eager Prefetch Queue that peeks up to 3 layers ahead. On UFS 4.0, this hides 100% of the loading latency by saturating the storage pipeline while the CPU/NPU is compute-bound.
  4. Greedy RAM Window: On high-end devices, the engine uses a "Greedy" strategy—reducing the OS safety buffer to just 500MB and caching up to 80 layers simultaneously to minimize disk wear and maximize speed.

Our Custom Optimizations (Generation 2 & 3):

  • Eager Prefetch Queue: Replaced the single-layer prefetcher with a thread-safe std::deque. This ensures the disk I/O thread never starves, even during complex GQA/Attention computation.
  • Inter-Token Pipelining: We eliminate the sampling gap by proactively prefetching Layers 0-8 of the next token immediately while the current one is still being sampled.
  • Multi-Layer Eviction Protection: A sophisticated tracker ensures the prefetcher never evicts a layer that is either currently active OR sitting in the eager queue.
  • Micro-Pipelining: We optimize the "Ping-Pong" buffers so that compute and I/O are perfectly interleaved.
  • I/O Hinting: Using MADV_SEQUENTIAL and MADV_WILLNEED to trigger hardware-level read-ahead for 70B+ parameters.

Figure 1: The High-Speed Data Bus acting as a virtual memory bridge

As the inference progresses, the engine identifies which tensors are needed for the next token generation and maps only those specific pages into the NPU's address space.

2. Execution Flow (The "Rolling Buffer")

To maintain high throughput, TrueLarge-RT implements a "Rolling Buffer" strategy. It does not just demand-page; it proactively manages the flow of tensors.

Figure 2: The Layer-by-Layer Lifecycle

  1. prefetch(N+1): While the NPU is busy computing Layer N, the DMA engine is already pulling Layer N+1 from disk into a standby buffer.
  2. compute(N): The CPU/GPU executes matrix multiplication on the currently loaded weights.
  3. discard(N-1): Once a layer is done, its memory pages are immediately marked as MADV_DONTNEED, telling the kernel to reclaim that physical RAM instantly.

Core Innovations

1. Zero-Copy mmap Loading

We bypass the standard read() syscalls and CPU copy loops. By using mmap with MAP_PRIVATE and hardware-aligned offsets, the kernel maps the layer data directly from NVMe storage into the Neural Processing Unit (or CPU) address space.

  • Code Reference: LayerLoader::loadLayerMap
  • Benefit: Zero CPU overhead for data movement.

2. Dynamic RAM Budgeting

On initialization, the engine probes /proc/meminfo to calculate a precise "Safe Working Budget":

// simplified logic from TrueLargeRuntime.cpp
long safety_buffer = avail_ram > 8GB ? 400MB : 500MB;
long safe_budget = Available_RAM - safety_buffer - KV_Cache_Size;
int max_layers = safe_budget / Single_Layer_Size; // Cap at 80 for hybrid residence

This ensures the OS never kills the app for OOM (Out of Memory), even when running 32B models on 4GB legacy devices.

3. "Ping-Pong" Context Swapping

To hide latency, TrueLarge-RT maintains two lightweight compute contexts (ctx_ping and ctx_pong). While one context validates the computation graph for the current layer, the next layer's weights can be pre-fetched into the alternate buffer, ensuring the GPU/CPU is never starved of work.

Performance Benchmark

Model Precision Device Chipset Storage Device RAM Used RAM (MB) Speed (TPS)
Llama-3.3-70B Q2_XS Realme GT Neo 3T SD870 UFS 3.1 6GB ~1500 0.040
Llama-3.3-70B Q2_XS Poco M4 Pro 5G Dimensity 810 UFS 2.2 6GB ~3000 0.023
Llama-3.3-70B Q2_XS Realme 2 Pro SD660 UFS 2.1 4GB ~1400 0.01
Llama-3.1-8B Q4_K_M Poco M4 Pro 5G Dimensity 810 UFS 2.2 6GB ~3000 0.1
Llama-3.1-8B Q4_K_M Realme 2 Pro SD660 UFS 2.1 4GB ~1400 0.045
Qwen-2.5-0.5B Q4_K_M Poco M4 Pro 5G Dimensity 810 UFS 2.2 6GB ~3000 13.0
Qwen-2.5-0.5B Q4_K_M Realme 2 Pro SD660 UFS 2.1 4GB ~1400 15.0

Acknowledgements

  • llama.cpp: The core tensor library.
  • AirLLM: Inspiration for divide-and-conquer inference.

License

MIT License. See LICENSE for details.

Citation

@software{TrueLargeRT2026,
  author = {Lahajal, Naresh Kumar},
  title = {TrueLarge-RT: Layer-by-Layer Inference Engine for Android},
  year = {2026},
  url = {https://github.com/nareshis21/Truelarge-RT}
}

About

Android inference engine running 20B+ parameter LLMs on 4GB-8GB RAM devices. Features proprietary Layer-by-Layer (LBL) streaming, zero-copy mmap loading, and native C++/Kotlin architecture.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors