This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
quantum-llama.cpp is a modified fork of llama.cpp that integrates Quantum Random Number Generators (QRNGs) into token generation. The core idea: "the output is co-authored by the universe itself."
Key difference from upstream: Uses true quantum randomness instead of pseudo-random number generation for sampling.
# Standard build (MUST use -DLLAMA_CURL=OFF for quantum features)
cmake -B build -DLLAMA_CURL=OFF
cmake --build build --config Release -j
# With CUDA
cmake -B build -DLLAMA_CURL=OFF -DGGML_CUDA=ON
cmake --build build --config Release -j
# Debug build
cmake -B build -DCMAKE_BUILD_TYPE=Debug -DLLAMA_CURL=OFF
cmake --build buildBuilt binaries are placed in build/bin/.
# Run test suite
ctest --test-dir build --output-on-failure -j
# Server tests (requires Python venv)
cd tools/server/tests
source ../../../.venv/bin/activate
./tests.shAlways format before committing:
git clang-formatKey style rules:
- 4-space indentation, 120-column limit
- Pointer/reference:
void * ptr,int & a snake_casefor functions, variables, types- Enum values:
ENUM_NAME_VALUE(uppercase with prefix)
src/- Main llama library (llama-*.cpp/hmodules)include/llama.h- Public C APIggml/- Vendored tensor librarytools/- Executables (llama-cli,llama-server, etc.)common/- Shared utilitieslibpsirngclient/- Git submodule for gRPC QRNG client
psirngclient-manager.cpp/h- Singleton managing QRNG connectionsanu-qrng-client.cpp/h- HTTP client for ANU QRNG API (hex16 mode-based sampling)llama-sampling.cpp- Sampling pipeline (integration point for quantum RNG)
Token Logits
↓
Calculate Entropy (normalized 0-1)
↓
[entropy < 0.50?] ─YES─→ GREEDY (no QRNG) → Done
│
NO
↓
Apply EDT Temperature: T = T₀ × 0.8^(θ/entropy)
↓
ANU API call (hex16, length=1024, size=10)
↓
Find mode of ~20K uint8 values
↓
Use mode/256 for inverse CDF sampling
↓
Done
Key principle: Each token selection makes a fresh API call. No buffering - this preserves temporal correlation between consciousness and token selection.
- entropy < 0.50 → Greedy sampling (no API call, saves bandwidth)
- entropy ≥ 0.50 → EDT temperature + QRNG sampling
- Typically saves 50-80% of API calls for predictable text
- Formula:
T = T₀ × 0.8^(θ/entropy) - Defaults: T₀=2.0, θ=1.0
- Higher entropy → higher temperature (more creative exploration)
- Lower entropy → lower temperature (more focused selection)
- At max entropy (1.0): T ≈ 1.6
-
Get your FREE API key at: https://quantumnumbers.anu.edu.au/
-
Set the environment variable:
# Linux/Mac
export ANU_API_KEY="your-api-key-here"
# Windows CMD
set ANU_API_KEY=your-api-key-here
# PowerShell
$env:ANU_API_KEY="your-api-key-here"- Run:
./build/bin/llama-cli -m model.gguf -p "prompt" -n 128 -no-cnv| Argument | Description | Default |
|---|---|---|
--quantum-verbose |
Show entropy/temperature for each token | off |
--quantum-statistics |
Print sampling statistics at end | off |
--quantum-entropy-threshold N |
Entropy cutoff for greedy vs QRNG | 0.40 |
--quantum-edt-t0 N |
EDT upper bound temperature | 2.0 |
--quantum-edt-theta N |
EDT entropy sensitivity | 1.0 |
--no-quantum-adaptive-sampling |
Always use QRNG (no greedy) | - |
--no-quantum-edt |
Use fixed temperature instead of EDT | - |
export PSIRNG_HOST=192.0.2.10
export PSIRNG_GRPC_PORT=50051
export PSIRNG_CERT_PATH=/path/to/cert.pem
./build/bin/llama-cli -m model.gguf -p "prompt"- NEVER buffer quantum random data - Each token selection MUST use fresh quantum data from a new API call. Buffering destroys the temporal correlation between consciousness and token selection.
- Never use
-DLLAMA_CURL=ON- incompatible with quantum features - Clone with
--recurse-submodulesto get libpsirngclient - Avoid adding third-party dependencies
- Use basic C++ patterns, avoid fancy STL constructs
- Vertical alignment for readability
- Tensor storage is row-major (dim 0=columns, 1=rows, 2=matrices)
- Matrix multiplication:
C = ggml_mul_mat(ctx, A, B)means C^T = AB^T
- Pattern:
<class>_<method>where method is<action>_<noun> - Examples:
llama_model_init(),llama_sampler_chain_remove() - Optimize for longest common prefix:
number_small,number_big(notsmall_number)
# Benchmark
./build/bin/llama-bench -m model.gguf
# Evaluate perplexity
./build/bin/llama-perplexity -m model.gguf -f dataset.txt
# Test backend ops
./build/bin/test-backend-ops