Releases: AmpereComputingAI/llama.cpp
Releases · AmpereComputingAI/llama.cpp
v3.4.2
Based on ggml-org/llama.cpp b7772 (https://github.com/ggml-org/llama.cpp/releases/tag/b7772)
Also available at: DockerHub
v3.4.1
Based on ggml-org/llama.cpp b7286 (https://github.com/ggml-org/llama.cpp/releases/tag/b7286)
- Automatic Flash Attention selection feature (-fa auto). See ampere.md for more details.
Also available at: DockerHub
v3.4.0
Based on ggml-org/llama.cpp b6735 (https://github.com/ggml-org/llama.cpp/releases/tag/b6735)
- Flash Attention for SWA models fixed
- New Flash Attention algorithm. It is optimized for long contexts (above 1024). See
"Flash Attention algorithm selection" section for details how to select attention algorithm
manually.
Also available at: DockerHub
v3.3.1
v3.3.0
v3.2.1
v3.2.0
v3.1.2
v3.1.0
v2.2.1
Update benchmark.py