Comprehensive guide for building LLM applications with liblloyal's composable primitives.
- Installation & Setup
- Quick Start
- Core Patterns
- Advanced Features
- Cache Management Strategies
- Development & Testing
- Best Practices
- Troubleshooting
liblloyal is a header-only C++ library providing composable primitives for llama.cpp inference.
# Add liblloyal to your project
git submodule add https://github.com/lloyal-ai/liblloyal.git
git submodule update --init --recursiveWhen you include llama.cpp and liblloyal via add_subdirectory(), liblloyal automatically sets up the required include paths. No manual configuration needed.
cmake_minimum_required(VERSION 3.18)
project(my_app)
set(CMAKE_CXX_STANDARD 20)
# 1. Add llama.cpp first (creates llama, ggml targets)
add_subdirectory(vendor/llama.cpp)
# 2. Add liblloyal (auto-configures include paths for llama/llama.h)
add_subdirectory(vendor/liblloyal)
# 3. Create your target and link
add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE liblloyal::liblloyal)What happens automatically:
- liblloyal detects the
llamatarget and links to it - Generates wrapper headers
llama/llama.handllama/ggml.hin the build directory (cross-platform compatible) - Exports include paths via
liblloyal::liblloyaltarget
Your code can use:
#include <lloyal/tokenizer.hpp> // liblloyal headers
#include <llama/llama.h> // llama.cpp headers (auto-resolved)When running liblloyal tests with a custom llama.cpp location:
cmake -B build \
-DLLOYAL_BUILD_INTEGRATION_TESTS=ON \
-DLLAMA_CPP_DIR=/path/to/your/llama.cppFor consumers not using add_subdirectory() (e.g., pre-built llama.cpp), you'll need to set up include paths manually:
# Include headers
target_include_directories(your_target PRIVATE
liblloyal/include
llama.cpp/include # For llama.h
llama.cpp/ggml/include # For ggml.h
)
# Link llama.cpp
target_link_libraries(your_target PRIVATE llama)Note: liblloyal headers use #include <llama/llama.h> (with llama/ prefix). If your llama.cpp has flat includes, you may need to create a wrapper directory structure.
s.header_dir = "lloyal"
s.source_files = "liblloyal/include/**/*.{hpp,h}"Minimal example demonstrating complete inference pipeline.
From: liblloyal/tests/integration/multi_sequence_integration_test.cpp:46-94
#include <lloyal/model_registry.hpp>
#include <lloyal/tokenizer.hpp>
#include <lloyal/decode.hpp>
#include <lloyal/sampler.hpp>
#include <llama/llama.h>
int main() {
// Initialize backend
llama_backend_init();
// Load model (shared across contexts)
auto model_params = llama_model_default_params();
model_params.n_gpu_layers = -1; // Full GPU offload
auto model = lloyal::ModelRegistry::acquire("model.gguf", model_params);
// Create inference context
auto ctx_params = llama_context_default_params();
ctx_params.n_ctx = 2048;
ctx_params.n_batch = 512;
llama_context* ctx = llama_init_from_model(model.get(), ctx_params);
// Get vocabulary
auto vocab = llama_model_get_vocab(model.get());
// Tokenize input
std::string prompt = "Explain quantum computing:";
auto tokens = lloyal::tokenizer::tokenize(vocab, prompt, false, false);
// Decode through model
lloyal::decoder::decode_tokens(ctx, tokens, 0, ctx_params.n_batch);
// Sample next token
llama_token next = lloyal::sampler::greedy(ctx, vocab);
// Convert to text
std::string output = lloyal::tokenizer::detokenize(vocab, next);
std::cout << output;
// Cleanup
llama_free(ctx);
llama_backend_free();
}Key points:
ModelRegistry::acquire()enables model weight sharing (ref-counted)decode_tokens()processes prompt through modelgreedy()selects highest probability token- Each context has independent KV cache (isolated sessions)
From: include/lloyal/tokenizer.hpp
#include <lloyal/tokenizer.hpp>
// Tokenize text to token IDs
auto vocab = llama_model_get_vocab(model);
auto tokens = lloyal::tokenizer::tokenize(vocab, "Hello world", false, false);
// tokens = [1, 15043, 3186]
// Detokenize single token
std::string text = lloyal::tokenizer::detokenize(vocab, tokens[0]);
// Detokenize batch
std::string full_text = lloyal::tokenizer::detokenize(vocab, tokens);
// Check for end-of-generation
bool is_done = lloyal::tokenizer::is_eog(vocab, token);
// Vocab size
int n_vocab = lloyal::tokenizer::vocab_size(vocab);Special tokens:
// Add BOS (beginning-of-sequence) token
bool add_bos = true;
auto tokens = lloyal::tokenizer::tokenize(vocab, prompt, add_bos, false);
// Parse special tokens (e.g., "<|im_start|>")
bool parse_special = true;
auto tokens = lloyal::tokenizer::tokenize(vocab, prompt, false, parse_special);From: include/lloyal/decode.hpp
#include <lloyal/decode.hpp>
// Decode token batch (initial prompt)
std::vector<llama_token> tokens = {1, 100, 200, 300};
lloyal::decoder::decode_tokens(ctx, tokens, 0, n_batch);
// Decode single token (generation loop)
int n_past = static_cast<int>(tokens.size());
llama_token next_token = sample_next();
lloyal::decoder::decode_tokens(ctx, {next_token}, n_past, n_batch);
n_past++;Multi-sequence decoding:
// Enable multi-sequence in context params
ctx_params.n_seq_max = 4; // Support 4 parallel sequences
// Decode to different sequences
lloyal::decoder::decode_tokens(ctx, tokens, 0, n_batch, /*seq_id=*/0);
lloyal::decoder::decode_tokens(ctx, tokens, 0, n_batch, /*seq_id=*/1);
// Each sequence maintains independent KV stateFrom: include/lloyal/sampler.hpp
#include <lloyal/sampler.hpp>
// Greedy sampling (argmax)
llama_token token = lloyal::sampler::greedy(ctx, vocab);
// Temperature sampling
auto params = llama_sampler_chain_default_params();
// ... configure params ...
llama_token token = lloyal::sampler::sample_with_params(ctx, vocab, params);
// Common parameter patterns
params.temp = 0.7f; // Temperature (0.0 = greedy, 1.0 = neutral)
params.top_k = 40; // Top-K filtering
params.top_p = 0.95f; // Nucleus (top-p) sampling
params.min_p = 0.05f; // Min-p filtering
params.typical_p = 1.0f; // Typical sampling
params.penalty_repeat = 1.1f; // Repetition penaltyFrom: include/lloyal/chat_template.hpp
#include <lloyal/chat_template.hpp>
#include <nlohmann/json.hpp>
// Build conversation
nlohmann::json messages = nlohmann::json::array();
messages.push_back({
{"role", "user"},
{"content", "What is the capital of France?"}
});
// Format with model's built-in template
auto result = lloyal::chat_template::format(
model,
messages.dump(),
"" // Empty string = use model's template
);
// result.prompt contains formatted text
// Example: "<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"
// Tokenize formatted prompt
auto tokens = lloyal::tokenizer::tokenize(vocab, result.prompt, true, true);Multi-turn conversation:
// Add assistant response
messages.push_back({
{"role", "assistant"},
{"content", "The capital of France is Paris."}
});
// Add follow-up question
messages.push_back({
{"role", "user"},
{"content", "What about Italy?"}
});
// Reformat entire conversation
auto result = lloyal::chat_template::format(model, messages.dump(), "");From: include/lloyal/metrics.hpp
liblloyal provides dual-level uncertainty metrics for test-time alignment, adaptive sampling, and quality monitoring.
- Model metrics - Raw logits (before filters) → model's inherent belief
- Sampling metrics - Post-filter logits (after top-k/p/temp) → actual sampled distribution
#include <lloyal/metrics.hpp>
#include <lloyal/logits.hpp>
// Get raw logits from model
float* logits = lloyal::logits::get(ctx);
int n_vocab = lloyal::tokenizer::vocab_size(vocab);
// Compute model entropy (uncertainty of next token)
float h = lloyal::metrics::model_entropy(logits, n_vocab);
// Use for routing decisions
if (h > 5.0f) {
// High entropy → trigger retrieval or context expansion
}
// Compute surprisal for sampled token
llama_token token = sampler::greedy(ctx, vocab);
float s = lloyal::metrics::model_surprisal(logits, n_vocab, token);
if (s > 5.0f) {
// High surprisal → model is uncertain about this token
}From: include/lloyal/metrics.hpp:327-361
// Create perplexity tracker
auto ppl_handle = lloyal::metrics::create_perplexity();
// Generation loop
for (int i = 0; i < max_tokens; i++) {
float* logits = lloyal::logits::get(ctx);
llama_token token = sample_next();
// Compute and track surprisal
float s = lloyal::metrics::model_surprisal(logits, n_vocab, token);
lloyal::metrics::add_surprisal(ppl_handle, s);
// Decode next token
lloyal::decoder::decode_tokens(ctx, {token}, n_past++, n_batch);
}
// Get perplexity (exp of average surprisal)
float ppl = lloyal::metrics::get_ppl(ppl_handle);
int count = lloyal::metrics::get_count(ppl_handle);
std::cout << "Perplexity: " << ppl << " over " << count << " tokens\n";
if (ppl > 50.0f) {
// High perplexity → consider retrieval or cache eviction
}
// Free tracker
lloyal::metrics::free_perplexity(ppl_handle);- KV eviction gates: High entropy → trigger retrieval before cache pruning
- Adaptive sampling: Collapsed distribution → widen search parameters
- Quality monitoring: Track perplexity for confidence estimates
- Branch comparison: Compare perplexity across alternative continuations
From: liblloyal/tests/integration/embedding_integration_test.cpp:243-290
Extract semantic embeddings for similarity search, semantic caching, or retrieval augmented generation.
#include <lloyal/embedding.hpp>
// Check if model supports embeddings
if (lloyal::embedding::has_embeddings(model)) {
int32_t dim = lloyal::embedding::dimension(model);
std::cout << "Embedding dimension: " << dim << "\n";
}// Create dedicated context for embeddings
auto ctx_params = llama_context_default_params();
ctx_params.n_ctx = 512;
ctx_params.n_batch = 512;
ctx_params.embeddings = true; // Enable embeddings
ctx_params.pooling_type = LLAMA_POOLING_TYPE_MEAN; // Mean pooling
llama_context* embed_ctx = llama_init_from_model(model.get(), ctx_params);From: examples/embed/embed.mjs:63-77
// Tokenize text
std::string query = "What is machine learning?";
auto tokens = lloyal::tokenizer::tokenize(vocab, query, true, true);
// Clear KV cache (each text needs fresh context)
lloyal::kv::clear_all(embed_ctx);
// Encode for embeddings (marks all tokens with logits=true)
lloyal::embedding::encode(embed_ctx, tokens, n_batch);
// Extract L2-normalized embedding (unit length for cosine similarity)
auto embedding = lloyal::embedding::get(embed_ctx, lloyal::embedding::Normalize::L2);
// embedding.size() == dimensionFrom: liblloyal/tests/integration/embedding_integration_test.cpp:292-345
// Embed multiple texts
auto emb1 = get_embedding(embed_ctx, "The cat sat on the mat");
auto emb2 = get_embedding(embed_ctx, "A cat rested on the rug");
auto emb3 = get_embedding(embed_ctx, "Stock prices rose sharply");
// Compute similarity (for L2-normalized vectors, this is dot product)
float sim_similar = lloyal::embedding::cosine_similarity(emb1, emb2);
float sim_different = lloyal::embedding::cosine_similarity(emb1, emb3);
std::cout << "Similar sentences: " << sim_similar << "\n"; // ~0.8
std::cout << "Different sentences: " << sim_different << "\n"; // ~0.3- Semantic search: Find similar documents/passages by embedding similarity
- Semantic caching: Cache responses by embedding distance thresholds
- RAG pipelines: Embed queries and documents for retrieval
- Clustering: Group similar texts by embedding proximity
Note: For meaningful semantic embeddings, use dedicated embedding models like nomic-embed-text or bge-small-en. Standard LLMs work but aren't optimized for this task.
From: liblloyal/tests/integration/multi_sequence_integration_test.cpp:46-94
Enable parallel hypothesis exploration, speculative decoding, or A/B testing within a single context (shared model weights).
// Configure context for multiple sequences
auto ctx_params = llama_context_default_params();
ctx_params.n_ctx = 512;
ctx_params.n_batch = 128;
ctx_params.n_seq_max = 4; // Support 4 parallel sequences
llama_context* ctx = llama_init_from_model(model.get(), ctx_params);std::string prompt = "Once upon a time";
auto tokens = lloyal::tokenizer::tokenize(vocab, prompt, false, false);
// Decode to sequence 0
lloyal::decoder::decode_tokens(ctx, tokens, 0, n_batch, /*seq_id=*/0);
// Decode to sequence 1 (independent KV state)
lloyal::decoder::decode_tokens(ctx, tokens, 0, n_batch, /*seq_id=*/1);
// Check positions independently
llama_pos pos0 = lloyal::kv::pos_max(ctx, 0);
llama_pos pos1 = lloyal::kv::pos_max(ctx, 1);
// Both sequences have same number of tokens, but independent KV stateFrom: include/lloyal/kv.hpp:114-125
// Fork sequence 0 to create sequence 1
lloyal::kv::seq_cp(ctx, /*src=*/0, /*dst=*/1);
// Now seq 1 has same KV state as seq 0
// Generate different continuations
llama_token token_a = sample_with_temperature(ctx, vocab, 0.7f, /*seq=*/0);
llama_token token_b = sample_with_temperature(ctx, vocab, 1.2f, /*seq=*/1);
// Continue each branch independently
lloyal::decoder::decode_tokens(ctx, {token_a}, n_past, n_batch, 0);
lloyal::decoder::decode_tokens(ctx, {token_b}, n_past, n_batch, 1);From: include/lloyal/kv.hpp:138-148
// After comparing multiple branches, keep only the best
int best_seq = compare_branches(); // Your selection logic
// Remove all sequences except best_seq
lloyal::kv::seq_keep(ctx, best_seq);
// Now only best_seq remains, continue generationFrom: include/lloyal/kv.hpp:54-75
// Remove specific sequence without affecting others
lloyal::kv::remove_range(ctx, /*seq=*/1, /*p0=*/-1, /*p1=*/-1);
// Verify it's gone
llama_pos pos = lloyal::kv::pos_max(ctx, 1);
// pos == -1 (empty)- Parallel hypothesis exploration: Fork prompt, explore multiple continuations
- Speculative decoding: Draft with small model on seq=0, verify with large model on seq=1
- A/B testing: Compare different sampling strategies on identical context
- Beam search: Maintain top-k sequences, prune low-probability branches
liblloyal provides persistent handle-based APIs for efficient reuse of complex objects across generation loops.
From: include/lloyal/sampler.hpp
#include <lloyal/sampler.hpp>
// Create reusable sampler chain (configure once, use many times)
auto params = llama_sampler_chain_default_params();
params.temp = 0.7f;
params.top_k = 40;
params.top_p = 0.95f;
auto chain = lloyal::sampler::create_chain(model, params);
// Reuse chain across generation loop (no repeated initialization)
for (int i = 0; i < max_tokens; i++) {
// Apply filters (top-k, top-p, temperature)
lloyal::sampler::apply(chain, ctx, vocab);
// Sample token
llama_token token = lloyal::sampler::sample(chain, ctx);
// Decode and continue
lloyal::decoder::decode_tokens(ctx, {token}, n_past++, n_batch);
}
// Free when done
lloyal::sampler::free(chain);Why handles?
- Efficiency: Avoid repeated sampler initialization (expensive)
- State management: Grammar samplers maintain internal state across tokens
- Reusability: Same chain for entire generation, or clone for branches
From: include/lloyal/grammar.hpp
#include <lloyal/grammar.hpp>
// Convert JSON schema to GBNF grammar
nlohmann::json schema = {
{"type", "object"},
{"properties", {
{"name", {{"type", "string"}}},
{"age", {{"type", "integer"}}}
}},
{"required", {"name", "age"}}
};
std::string gbnf = lloyal::grammar::from_json_schema(schema.dump());
// Create grammar sampler handle (maintains parse state)
auto grammar_handle = lloyal::grammar::init_sampler(model, gbnf);
// Use throughout generation (grammar state tracks valid tokens)
for (int i = 0; i < max_tokens; i++) {
llama_token token = lloyal::grammar::sample(grammar_handle, ctx, vocab);
lloyal::decoder::decode_tokens(ctx, {token}, n_past++, n_batch);
}
// Result is guaranteed valid JSON
lloyal::grammar::free(grammar_handle);From: include/lloyal/metrics.hpp:404-412
// Track perplexity on main branch
auto ppl_main = lloyal::metrics::create_perplexity();
// ... add tokens to ppl_main ...
// Fork branch and clone metrics (preserves history)
lloyal::kv::seq_cp(ctx, 0, 1);
auto ppl_alt = lloyal::metrics::clone_perplexity(ppl_main);
// Now both branches track perplexity independently
// Compare results
float ppl_1 = lloyal::metrics::get_ppl(ppl_main);
float ppl_2 = lloyal::metrics::get_ppl(ppl_alt);
// Free both
lloyal::metrics::free_perplexity(ppl_main);
lloyal::metrics::free_perplexity(ppl_alt);From: include/lloyal/kv.hpp
#include <lloyal/kv.hpp>
// Clear entire cache (start new conversation)
lloyal::kv::clear_all(ctx);
// Check cache position
llama_pos pos = lloyal::kv::pos_max(ctx, 0);
// pos == -1 means empty, otherwise returns number of tokens - 1
// Remove range [p0, p1) from cache
lloyal::kv::remove_range(ctx, /*seq=*/0, /*p0=*/100, /*p1=*/200);
// Removes tokens at positions 100-199From: liblloyal/tests/integration/kv_file_persistence_test.cpp:36-92
Save and restore conversation state across app restarts, fork decision points, or share context.
// Populate KV cache
std::vector<llama_token> conversation = {1, 100, 200, 300};
lloyal::decoder::decode_tokens(ctx, conversation, 0, n_batch);
// Save to file (includes KV state + tokens)
const std::string filepath = "session.llama";
size_t bytes = lloyal::kv::write_file(ctx, 0, filepath, conversation);
if (bytes > 0) {
std::cout << "Saved " << bytes << " bytes\n";
}// Clear cache first
lloyal::kv::clear_all(ctx);
// Load state from file
auto data = lloyal::kv::read_file(ctx, 0, filepath);
// data.tokens contains the tokens
// data.bytes_read contains file size
// KV cache is automatically restored
// Verify restoration
llama_pos max_pos = lloyal::kv::pos_max(ctx, 0);
assert(max_pos == static_cast<llama_pos>(data.tokens.size() - 1));
// Continue generation from restored state
llama_token next = sampler::greedy(ctx, vocab);- Exit and Resume: Save before app termination, restore on next launch
- Conversation Forking: Save at decision points, load to explore alternatives
- Context Sharing: Upload session file to cloud, share across devices
Example: Forking Conversations
// Save state at decision point
lloyal::kv::write_file(ctx, 0, "fork_point.llama", tokens);
// Explore path A
generate_response("Option A prompt");
lloyal::kv::write_file(ctx, 0, "path_a.llama", tokens);
// Backtrack and explore path B
lloyal::kv::clear_all(ctx);
auto data = lloyal::kv::read_file(ctx, 0, "fork_point.llama");
generate_response("Option B prompt");
lloyal::kv::write_file(ctx, 0, "path_b.llama", tokens);From: liblloyal/tests/integration/clear_and_reseed_test.cpp:172-260
One strategy for managing context limits: preserve anchor tokens (attention sinks) + recent tail, evict middle tokens via cache reconstruction.
Problem: Context window fills during long conversations (n_past → n_ctx).
Solution: Reconstruct cache with:
- Anchor tokens (original first N tokens, typically 4)
- Recent tail (last M tokens, typically 252)
- Evict middle (everything between anchors and tail)
This maintains contiguous positions [0, 1, 2, ..., anchor_size + tail_size - 1] instead of unbounded gaps.
From: include/lloyal/kv.hpp:544-604
// CRITICAL: Capture anchor tokens ONCE at conversation start
std::vector<llama_token> ORIGINAL_ANCHORS;
void start_conversation(const std::string& initial_prompt) {
auto tokens = lloyal::tokenizer::tokenize(vocab, initial_prompt, false, false);
// Capture first 4 tokens as anchors (NEVER change these)
const int ANCHOR_COUNT = 4;
ORIGINAL_ANCHORS.assign(tokens.begin(), tokens.begin() + ANCHOR_COUNT);
// Decode initial prompt
lloyal::decoder::decode_tokens(ctx, tokens, 0, n_batch);
n_past = static_cast<int>(tokens.size());
}
void compress_if_needed() {
llama_pos current_pos = lloyal::kv::pos_max(ctx, 0);
const int COMPRESSION_THRESHOLD = n_ctx - 10;
if (current_pos >= COMPRESSION_THRESHOLD) {
// Prepare tail (recent 252 tokens)
const int TAIL_SIZE = 252;
size_t tail_start = all_tokens.size() - TAIL_SIZE;
std::vector<llama_token> tail(
all_tokens.begin() + tail_start,
all_tokens.end()
);
// Reconstruct with ORIGINAL anchors (not rolling "first 4")
lloyal::kv::clear_and_reseed(ctx, ORIGINAL_ANCHORS, tail, n_batch);
// Update position counter
n_past = ANCHOR_COUNT + TAIL_SIZE;
}
}MUST: Use ORIGINAL_ANCHORS captured at conversation start
MUST NOT: Use rolling "first 4" tokens on each compression
Incorrect (will degrade quality):
// ❌ WRONG: Reusing different anchors each time
auto sinks = std::vector<llama_token>(tokens.begin(), tokens.begin() + 4);
lloyal::kv::clear_and_reseed(ctx, sinks, tail, n_batch);Correct:
// ✅ RIGHT: Same ORIGINAL_ANCHORS every time
lloyal::kv::clear_and_reseed(ctx, ORIGINAL_ANCHORS, tail, n_batch);Theory: Xiao et al. (2023) "Efficient Streaming Language Models with Attention Sinks" demonstrated that transformer attention develops stable "sinks" at initial positions. Maintaining these sinks + recent context preserves perplexity while enabling unbounded generation.
Empirical: <10% perplexity increase with 4 anchors + 252 tail (paper: 3.7%).
See: liblloyal/tests/integration/clear_and_reseed_test.cpp for validation tests.
Use when:
- Long conversations beyond context limit
- Bounded memory is critical
- Initial prompt establishes important context
Don't use when:
- Context never fills (most single-turn tasks)
- You need full conversation history (use larger model or RAG)
- Initial tokens aren't representative (quality degrades)
Alternatives:
- Increase n_ctx (if memory allows)
- Summarization + re-prompt (higher quality, slower)
- Sliding window (simpler, loses early context)
- Retrieval augmented generation (best quality, most complex)
From: Current guide.md:384-427 (validated API)
Share model weights across independent user sessions for memory efficiency.
#include <lloyal/model_registry.hpp>
class InferenceService {
private:
std::shared_ptr<llama_model> model_;
std::unordered_map<std::string, llama_context*> contexts_;
public:
InferenceService(const std::string& model_path) {
// Single model load (4GB)
model_ = lloyal::ModelRegistry::acquire(
model_path,
llama_model_default_params()
);
}
~InferenceService() {
for (auto& [user_id, ctx] : contexts_) {
llama_free(ctx);
}
}
bool create_session(const std::string& user_id) {
auto ctx_params = llama_context_default_params();
ctx_params.n_ctx = 2048;
// Shares model_ weights, independent KV cache
llama_context* ctx = llama_init_from_model(model_.get(), ctx_params);
if (!ctx) return false;
contexts_[user_id] = ctx;
return true;
}
std::string infer(const std::string& user_id, const std::string& prompt) {
auto it = contexts_.find(user_id);
if (it == contexts_.end()) return "";
// Per-user isolated inference
lloyal::kv::clear_all(it->second);
auto tokens = lloyal::tokenizer::tokenize(
llama_model_get_vocab(model_.get()),
prompt,
false,
false
);
lloyal::decoder::decode_tokens(it->second, tokens, 0, 512);
llama_token next = lloyal::sampler::greedy(
it->second,
llama_model_get_vocab(model_.get())
);
return lloyal::tokenizer::detokenize(
llama_model_get_vocab(model_.get()),
next
);
}
};Memory efficiency: 1 model (~4GB) + N KV caches (~200MB each) instead of N full models.
liblloyal has comprehensive test coverage with both stub-based unit tests and integration tests against real llama.cpp.
Stub-based tests validate API contracts without requiring real models (fast, no external dependencies).
cd liblloyal/tests
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
./build/TestRunner --successWhat they test:
- 84+ test cases covering all primitives
- API contracts (null safety, error handling)
- Edge cases (empty inputs, boundary conditions)
- No real models needed (uses stubs)
Integration tests use real llama.cpp to validate correctness with actual models.
Setup llama.cpp:
# Reads version from .llama-cpp-version
.github/scripts/setup-llama-cpp.sh
# Build llama.cpp (uses build-llama.sh script)
LLAMA_DIR=llama.cpp .github/scripts/build-llama.shBuild and run:
cd tests
cmake -B build_integration \
-DLLOYAL_BUILD_INTEGRATION_TESTS=ON \
-DLLAMA_CPP_DIR=../llama.cpp \
-DCMAKE_BUILD_TYPE=Release
cmake --build build_integration
# Run with test model (any GGUF model works)
LLAMA_TEST_MODEL=path/to/model.gguf ./build_integration/IntegrationRunner
# Some tests need dedicated embedding model
LLAMA_EMBED_MODEL=path/to/nomic-embed.gguf ./build_integration/IntegrationRunnerWhat they test:
- Multi-sequence operations (seq_cp, seq_keep)
- KV file persistence (write_file/read_file)
- Embeddings (encode, extract, cosine similarity)
- Context compression (clear_and_reseed position contiguity)
- Real model inference workflows
CI Configuration:
- Tests run on GitHub Actions
- Llama.cpp version pinned in
.llama-cpp-version - Build cached (keyed by llama.cpp version)
- Matrix: macOS (arm64), Linux (x64), sanitizers (ASan, UBSan, LeakSan)
liblloyal pins llama.cpp version for reproducible builds. To update:
Edit .llama-cpp-version:
# Current content (example)
b8087
# Update to new version
echo "b7000" > .llama-cpp-versionTest locally:
# Setup will read new version
.github/scripts/setup-llama-cpp.sh
# Build llama.cpp
LLAMA_DIR=llama.cpp .github/scripts/build-llama.sh
# Run integration tests
cd tests
cmake -B build_integration \
-DLLOYAL_BUILD_INTEGRATION_TESTS=ON \
-DLLAMA_CPP_DIR=../llama.cpp
cmake --build build_integration
LLAMA_TEST_MODEL=path/to/model.gguf ./build_integration/IntegrationRunnerCommit:
git add .llama-cpp-version
git commit -m "chore: update llama.cpp to b7000"
git pushCI automatically:
- Reads
.llama-cpp-version - Clones llama.cpp at that commit
- Builds with
.github/scripts/build-llama.sh - Caches build (keyed by version)
- Runs integration tests
- Fails PR if tests break
See: .github/workflows/tests.yml for full CI configuration.
Efficient patterns:
// ✅ Share models via ModelRegistry (ref-counted)
auto model = lloyal::ModelRegistry::acquire("model.gguf", params);
// Model shared across all contexts using same path+params
// ✅ Destroy idle contexts
llama_free(ctx);
ctx = nullptr;
// ✅ Use clear_and_reseed for unbounded conversations
if (n_past > n_ctx - 10) {
lloyal::kv::clear_and_reseed(ctx, anchors, tail, n_batch);
}Inefficient patterns:
// ❌ Loading same model multiple times (wastes GB)
auto model1 = std::shared_ptr<llama_model>(
llama_load_model_from_file("model.gguf", params),
llama_free_model
);
auto model2 = std::shared_ptr<llama_model>(
llama_load_model_from_file("model.gguf", params), // Loads again!
llama_free_model
);
// ❌ Keeping unused contexts alive (leaks KV memory)
// Don't keep contexts in map if user disconnected
// ❌ Unbounded cache growth
// Without compression, n_past → n_ctx → crashKey parameters:
// Mobile-optimized (iPhone, iPad)
ctx_params.n_ctx = 1024; // Smaller context = less memory
ctx_params.n_batch = 128; // Smaller batch = less decode memory
ctx_params.n_threads = 2; // Match efficiency cores
ctx_params.n_gpu_layers = -1; // Full Metal offload
// Server-optimized (Linux, high-end GPU)
ctx_params.n_ctx = 4096; // Larger context for long conversations
ctx_params.n_batch = 512; // Larger batch = faster prompt processing
ctx_params.n_threads = 8; // Match physical cores
ctx_params.n_gpu_layers = -1; // Full GPU offloadParameter effects:
n_ctx: Larger = longer context, more KV memory (~200MB per 2048 ctx)n_batch: Larger = faster prompt processing, more decode memoryn_threads: Match physical cores (diminishing returns beyond)n_gpu_layers: -1 for full offload (fastest), 0 for CPU-only
Benchmarking:
auto start = std::chrono::high_resolution_clock::now();
// Your inference code here
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Tokens/sec: " << (token_count * 1000.0 / duration.count()) << "\n";Validate inputs and check return values:
#include <optional>
std::optional<std::string> safe_generate(
llama_context* ctx,
const llama_model* model,
const std::string& prompt
) {
try {
// Validate inputs
if (!ctx || !model || prompt.empty()) {
return std::nullopt;
}
auto vocab = llama_model_get_vocab(model);
auto tokens = lloyal::tokenizer::tokenize(vocab, prompt, false, false);
if (tokens.empty()) {
return std::nullopt;
}
// Check context capacity
if (tokens.size() > static_cast<size_t>(llama_n_ctx(ctx))) {
return std::nullopt; // Prompt too long
}
// Decode and sample
lloyal::decoder::decode_tokens(ctx, tokens, 0, 512);
llama_token next = lloyal::sampler::greedy(ctx, vocab);
return lloyal::tokenizer::detokenize(vocab, next);
} catch (const std::exception& e) {
std::cerr << "Generation error: " << e.what() << "\n";
return std::nullopt;
}
}Defensive programming:
// Check model loaded
if (!model) {
std::cerr << "Failed to load model\n";
return 1;
}
// Check context created
llama_context* ctx = llama_init_from_model(model.get(), ctx_params);
if (!ctx) {
std::cerr << "Failed to create context (out of memory?)\n";
return 1;
}
// Check tokenization succeeded
auto tokens = lloyal::tokenizer::tokenize(vocab, prompt, false, false);
if (tokens.empty()) {
std::cerr << "Tokenization failed\n";
return 1;
}Symptoms: ModelRegistry::acquire() returns nullptr
Solutions:
- Verify file path is absolute and accessible
- Confirm GGUF format (not GGML, PyTorch, safetensors, etc.)
- Check available disk space (models are memory-mapped)
- Try default params (disable GPU offload initially)
auto model = lloyal::ModelRegistry::acquire("model.gguf", llama_model_default_params());
if (!model) {
// Debug: Try absolute path
model = lloyal::ModelRegistry::acquire(
"/full/path/to/model.gguf",
llama_model_default_params()
);
}
if (!model) {
std::cerr << "Check: file exists, is GGUF format, is readable\n";
}Symptoms: llama_init_from_model() returns nullptr
Solutions:
- Reduce
n_ctx(context window) - Reduce
n_batch(batch size) - Use smaller quantized model (Q4_K_M instead of F16)
- Reduce
n_gpu_layersif GPU memory constrained
auto ctx_params = llama_context_default_params();
ctx_params.n_ctx = 512; // Reduce from 2048
ctx_params.n_batch = 128; // Reduce from 512
llama_context* ctx = llama_init_from_model(model.get(), ctx_params);
if (!ctx) {
std::cerr << "Still OOM? Try smaller model quantization\n";
}Symptoms: decode_tokens() throws exception with message about context capacity
Cause: n_past + tokens.size() > n_ctx
Solutions:
- Use
clear_and_reseed()for compression (see Cache Management) - Increase
n_ctxwhen creating context - Truncate input prompt
- Summarize conversation history before continuing
llama_pos current_pos = lloyal::kv::pos_max(ctx, 0);
int n_ctx = llama_n_ctx(ctx);
if (current_pos + new_tokens.size() > n_ctx - 10) {
// Option 1: Compress
lloyal::kv::clear_and_reseed(ctx, anchors, tail, n_batch);
// Option 2: Clear and start over
lloyal::kv::clear_all(ctx);
// Option 3: Increase n_ctx (requires new context)
// llama_free(ctx);
// ctx_params.n_ctx = 4096;
// ctx = llama_init_from_model(model.get(), ctx_params);
}Symptoms: Output quality drops after clear_and_reseed()
Likely cause: Changing anchor tokens between compressions
Solution: Verify anchor immutability
void compress() {
static bool first_call = true;
static std::vector<llama_token> EXPECTED_ANCHORS;
if (first_call) {
EXPECTED_ANCHORS = ORIGINAL_ANCHORS;
first_call = false;
} else {
// Verify anchors haven't changed
assert(ORIGINAL_ANCHORS == EXPECTED_ANCHORS);
}
auto tail = std::vector<llama_token>(
all_tokens.end() - 252,
all_tokens.end()
);
lloyal::kv::clear_and_reseed(ctx, ORIGINAL_ANCHORS, tail, n_batch);
}Symptoms: decode_tokens() throws exception
Common causes:
- Null context or empty token vector
- Position overflow:
n_past + tokens.size() > n_ctx - Batch size exceeds context limit:
tokens.size() > n_batch - Invalid sequence ID (multi-sequence)
Debug:
try {
lloyal::decoder::decode_tokens(ctx, tokens, n_past, n_batch);
} catch (const std::exception& e) {
std::cerr << "Decode error: " << e.what() << "\n";
std::cerr << " Context: " << (ctx ? "valid" : "null") << "\n";
std::cerr << " Tokens: " << tokens.size() << "\n";
std::cerr << " Position: " << n_past << "\n";
std::cerr << " n_ctx: " << llama_n_ctx(ctx) << "\n";
std::cerr << " n_batch: " << n_batch << "\n";
if (n_past + tokens.size() > llama_n_ctx(ctx)) {
std::cerr << "Position overflow - trigger compression\n";
}
}Symptoms: embedding::get() throws "embeddings unavailable"
Causes:
- Context created without
embeddings = true - Pooling not enabled (
pooling_type = NONE) - Tokens not encoded with
embedding::encode()(needlogits=truefor all tokens)
Solution:
// Verify context configuration
auto ctx_params = llama_context_default_params();
ctx_params.embeddings = true; // REQUIRED
ctx_params.pooling_type = LLAMA_POOLING_TYPE_MEAN; // REQUIRED
llama_context* ctx = llama_init_from_model(model.get(), ctx_params);
// Verify pooling enabled
if (!lloyal::embedding::has_pooling(ctx)) {
std::cerr << "Pooling not enabled!\n";
}
// Use embedding::encode (not decoder::decode_tokens)
lloyal::kv::clear_all(ctx);
lloyal::embedding::encode(ctx, tokens, n_batch); // Marks all tokens with logits=true
// Now extraction should work
auto emb = lloyal::embedding::get(ctx, lloyal::embedding::Normalize::L2);Within this repository:
- API headers:
include/lloyal/*.hpp- Full API documentation in header comments - Integration tests:
tests/integration/- Real-world usage examples - Unit tests:
tests/- API contract validation
External:
- llama.cpp: https://github.com/ggml-org/llama.cpp
- StreamingLLM paper: Xiao et al. (2023) "Efficient Streaming Language Models with Attention Sinks"
Note: This guide documents liblloyal C++ primitives. For React Native bindings, see the parent lloyal.node project.