A comprehensive, Vulkan-accelerated Large Language Model (LLM) inference platform built on NCNN, providing a complete replacement for tools like llama.cpp, LM Studio, and Ollama.
- Full Tokenization: BPE tokenizer with GGUF metadata extraction, fallback to basic token ID input
- Text Generation: Configurable generation with greedy/top-k/top-p sampling
- KV Caching: Efficient key-value caching for faster inference
- Multi-Model Support: Dynamic layer construction for LLaMA, GPT-2/NeoX/J, Mistral, Qwen, Phi-3, and other architectures
- Multi-Modal Ready: Framework for vision-language models (CLIP integration ready)
- CLI Tool: Command-line interface for chat, completion, and model management
- Tool CLI: Command-line interface for tool execution and management
- GUI Application: Qt-based graphical interface for model loading, chat, and tool management
- REST API: HTTP API for integration with applications (framework provided)
- C++ API: Direct API for embedding in C++ applications
- Vulkan Acceleration: GPU-accelerated inference on Vulkan-compatible devices
- Memory Efficient: Optimized memory usage for large models
- Quantization Support: Handles various quantization formats (Q4_0, Q4_1, Q8_0, etc.)
- Cross-Platform: Linux, Windows, macOS, Android, iOS support
- Plugin Architecture: Extensible tool framework for web access, code execution, file operations
- Tool Calling: Automatic detection and execution of tool calls in LLM responses
- Sandbox Execution: Secure tool execution with resource limits and timeout handling
- Built-in Tools: Calculator, web search, Python code execution, file reading
- Safety Measures: Input validation, path restrictions, memory limits
- Error Handling: Comprehensive error handling and validation
- GPU Detection: Automatic GPU detection and fallback
- Model Validation: Checks model integrity and compatibility
- NCNN library (built with Vulkan support)
- C++11 compatible compiler
- Vulkan-compatible GPU (optional, CPU fallback available)
mkdir build && cd build cmake .. -DNCNN_VULKAN=ON -DBUILD_GUI=ON make -j$(nproc)
make llm_platform_demo llm_cli llm_gui
cd examples ./llm_platform_demo
Output:
NCNN LLM Platform Demo
======================
Prompt: hello world
Response: hello <eos> how you ?
Platform Features Demonstrated:
- Tokenization (BPE fallback)
- Text generation with sampling
- Configurable parameters
- Multi-model architecture support (framework)
- Vulkan acceleration ready
./llm_cli --model models/Phi-3-mini-4k-instruct-Q4_K_M.gguf --prompt "Hello, how are you?" --max-tokens 50 --temperature 0.7
./llm_gui
GUI Features:
- Model Management: Load local models or download from URLs
- Interactive Chat: Real-time conversation with the AI
- Tool Integration: Enable/disable tools, monitor execution
- Parameter Controls: Adjust temperature, max tokens, sampling methods
- Progress Monitoring: Loading progress and generation status
- Error Handling: User-friendly error messages and recovery
./tool_cli
./tool_cli calculate expression="2+3*4" ./tool_cli read_file path="README.md" max_lines=10 ./tool_cli execute_code code="print('Hello from Python!')" ./tool_cli web_search query="latest AI developments"
The web search tool uses DuckDuckGo's Instant Answer API, requiring no API key:
- Provides instant answers for factual queries
- Returns abstracts and summaries
- Includes related topics when available
- Completely free and accessible
The LLM platform automatically detects tool calls in responses and executes them:
// Tool calls are detected in JSON format: {"tool": "name", "args": {...}} std::string response = llm.generate_text("Calculate 15 + 27"); auto tool_call = ToolCallParser::parse(response); if (tool_call.valid) { ToolExecutor executor; ToolResult result = executor.execute_tool(tool_call.tool_name, tool_call.args); }
-
Tokenizer (src/tokenizer.h/cpp)
- Loads BPE tokenizer from GGUF metadata
- Fallback to basic token ID parsing
- Encode/decode text to/from token sequences
-
LLM Engine (src/llm_engine.h/cpp)
- Loads and manages GGUF models
- Architecture detection and dynamic layer construction
- Forward pass implementation for different model types
- Sampling and generation logic
-
GGUF Loader (src/gguf.h/cpp)
- Parses GGUF format files
- Extracts tensors, metadata, and tokenizer information
- Dequantization support for various formats
- Phi-3: Microsoft's Phi-3 models
- LLaMA: Meta's LLaMA models
- Mistral: Mistral AI models
- Qwen: Alibaba's Qwen models
- GPT-2: OpenAI's GPT-2 architecture
- Extensible: Framework for adding new architectures
-
Sampling Methods:
- Greedy decoding
- Top-k sampling
- Top-p (nucleus) sampling
- Temperature scaling
-
Parameters:
- Max tokens
- Temperature
- Top-k, Top-p values
- Repetition penalty
- Stop tokens
#include "llm_engine.h"
ncnn::LLMEngine engine; engine.load_model("model.gguf");
ncnn::GenerationConfig config; config.max_tokens = 100; config.temperature = 0.8f; config.top_p = 0.9f;
std::string response = engine.generate_text("Hello, world!", config);
./llm_server --model model.gguf --port 8080
curl -X POST http://localhost:8080/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "llm",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50,
"temperature": 0.7
}'
- Phi-3 Mini 4K Instruct (Q4_K_M)
- Mistral 7B Instruct v0.1 (Q4_0)
- Qwen 1.5B Instruct (Q4_0)
Important: Users are responsible for downloading their own AI models. This repository does not include model files due to their large size and licensing considerations.
- Hugging Face: https://huggingface.co/models?pipeline_tag=text-generation
- GGUF Models: Search for models with GGUF format
- Recommended Models:
- microsoft/Phi-3-mini-4k-instruct (GGUF format)
- mistralai/Mistral-7B-Instruct-v0.1 (GGUF format)
- Qwen/Qwen2-1.5B-Instruct (GGUF format)
huggingface-cli download microsoft/Phi-3-mini-4k-instruct Phi-3-mini-4k-instruct-Q4_K_M.gguf --local-dir models/
- Add architecture detection in LLMEngine::detect_architecture()
- Implement forward pass in LLMEngine::forward_*() methods
- Update parameter extraction for the new architecture
- Ensure Vulkan SDK is installed
- Use latest GPU drivers
- Set opt.use_vulkan_compute = true in Option
- Use quantized models (Q4_K_M recommended)
- Enable KV caching for sequential generation
- Batch processing for multiple requests
- Automatic fallback when Vulkan unavailable
- Multi-threading support
- AVX/AVX2 optimizations
make test ./test_llm_engine
./benchmark_llm --model model.gguf --prompts test_prompts.txt
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure Vulkan acceleration works
- Submit pull request
- Test on multiple GPUs
- Verify quantization compatibility
- Add model-specific tests
- Update documentation
This project follows the same license as NCNN.
- Tool use system implementation
- GGUF loader fixes and compilation issues resolved
- GUI application with model loading and chat interface
- Web search tool with real API (no key required)
- REST API implementation
- Multi-modal support (CLIP integration)
- Streaming generation
- Model quantization tools
- Performance profiling tools
- Web UI interface with tool integration
- Mobile deployment optimizations