LLM inference in C/C++
- guide : running gpt-oss with llama.cpp
- [FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗
- Support for the
gpt-ossmodel with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Comment - Hot PRs: All | Open
- Multimodal support arrived in
llama-server: #12898 | documentation - VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
- Introducing GGUF-my-LoRA ggml-org/llama.cpp#10123
- Hugging Face Inference Endpoints now support GGUF out of the box! ggml-org/llama.cpp#9669
- Hugging Face GGUF editor: discussion | tool
The quick start guide of the basic llama.cpp follows the original repository.
The main focus of IGNITE for on-device inference is based on llama-cli, guided by the llama-completion.
python downloader.pyThrough this file, you can download models, which are pre-selected to evaluate themselves on IGNITE. If there is no preferred model, you can download and run your own gguf models also.
cd scripts && sh build-android.sh && cd ..chmod +x scripts-termux/run.sh
su -c "sh scripts-termux/run.sh"cd scripts && sh build.sh && cd .../build/bin/ignite \
-m models/qwen-1.5-0.5b-chat-q4k.gguf \
-cnv \
--temp 0 \
--top-k 1 \
--threads 1 \
--output-path outputs/hotpot_0_0.csv \
--json-path dataset/hotpot_qa_30.jsonThis will be filled up. Please wait.
