Skip to content

Commit 78778b6

Browse files
first commit
0 parents  commit 78778b6

File tree

3 files changed

+161
-0
lines changed

3 files changed

+161
-0
lines changed

README.md

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
<p align="center">
2+
<img src="assets/uno-logo.png" alt="Uno.cpp Logo" width="128" height="128">
3+
</p>
4+
5+
<h1 align="center">Uno.cpp</h1>
6+
7+
<p align="center">
8+
<strong>Un-official llama.cpp — run community-quantized models before they hit upstream</strong>
9+
</p>
10+
11+
<p align="center">
12+
<a href="https://github.com/sumitchatterjee13/uno.cpp/releases">
13+
<img src="https://img.shields.io/github/v/release/sumitchatterjee13/uno.cpp?style=flat-square&label=download" alt="Download">
14+
</a>
15+
<a href="https://github.com/ggml-org/llama.cpp/blob/master/LICENSE">
16+
<img src="https://img.shields.io/badge/license-MIT-blue?style=flat-square" alt="License: MIT">
17+
</a>
18+
<a href="https://huggingface.co/Sumitc13/sarvam-30b-GGUF">
19+
<img src="https://img.shields.io/badge/🤗_Models-Sarvam_30B_GGUF-yellow?style=flat-square" alt="HuggingFace Models">
20+
</a>
21+
</p>
22+
23+
---
24+
25+
## What is Uno.cpp?
26+
27+
Uno.cpp is a ready-to-use desktop application that lets you run GGUF models locally — especially models with new architectures that haven't been merged into the official [llama.cpp](https://github.com/ggml-org/llama.cpp) yet.
28+
29+
**The problem:** When a new model architecture is quantized to GGUF, the architecture support PR can take weeks to get reviewed and merged into llama.cpp. During this time, users have no easy way to run these models — they'd need to manually clone a fork, set up a C++ build environment, compile from source, and use the command line.
30+
31+
**The solution:** Uno.cpp packages a custom-built `llama-server` (with the new architecture support baked in) into a simple Windows installer with a GUI launcher. Download, install, pick your model file, chat. That's it.
32+
33+
## Currently Supported Models
34+
35+
| Model | Architecture | Parameters | HuggingFace | Upstream PR |
36+
|-------|-------------|------------|-------------|-------------|
37+
| [Sarvam-30B](https://huggingface.co/sarvamai/sarvam-30b) | `sarvam_moe` | ~30B (MoE) | [GGUF Downloads](https://huggingface.co/Sumitc13/sarvam-30b-GGUF) | [#20275](https://github.com/ggml-org/llama.cpp/pull/20275) |
38+
39+
> Sarvam-30B is an open-source Mixture-of-Experts model by [Sarvam AI](https://www.sarvam.ai/) with 128 routed experts (top-6 routing) + 1 shared expert, 262K vocabulary, and strong multilingual capabilities across Indian languages.
40+
41+
More models will be added as new architectures are quantized.
42+
43+
## Quick Start
44+
45+
### 1. Download and Install
46+
47+
Download the latest **`Unocpp-Setup-v*.exe`** from [Releases](https://github.com/sumitchatterjee13/uno.cpp/releases) and run the installer.
48+
49+
### 2. Download a Model
50+
51+
Grab a GGUF model file from HuggingFace:
52+
53+
| Quant | Size | VRAM Needed | Best For |
54+
|-------|------|-------------|----------|
55+
| [Q4_K_M](https://huggingface.co/Sumitc13/sarvam-30b-GGUF) | ~19 GB | ~20 GB | Most users with 24GB+ VRAM |
56+
| [Q6_K](https://huggingface.co/Sumitc13/sarvam-30b-GGUF) | ~26 GB | ~27 GB | Higher quality, 32GB VRAM |
57+
| [Q8_0](https://huggingface.co/Sumitc13/sarvam-30b-GGUF) | ~34 GB | ~35 GB | Best quantized quality |
58+
| [BF16](https://huggingface.co/Sumitc13/sarvam-30b-GGUF) | ~64 GB | ~65 GB | Full precision, multi-GPU |
59+
60+
### 3. Launch and Chat
61+
62+
Open **Uno.cpp** from your desktop → select the `.gguf` file → your browser opens with a chat UI. Done.
63+
64+
## Features
65+
66+
- **No terminal required** — GUI launcher with model file picker and settings
67+
- **Remembers your config** — model path, GPU layers, context size saved between sessions
68+
- **Built-in chat UI** — powered by llama.cpp's web interface, opens in your browser
69+
- **GPU accelerated** — CUDA support for NVIDIA GPUs, configurable layer offloading
70+
- **Adjustable settings** — GPU layers, context size (2K–32K), and port configuration
71+
- **Fully local & private** — no cloud, no API keys, everything runs on your machine
72+
- **OpenAI-compatible API**`llama-server` exposes `/v1/chat/completions` so you can connect other tools
73+
74+
## System Requirements
75+
76+
| Component | Minimum | Recommended |
77+
|-----------|---------|-------------|
78+
| OS | Windows 10 64-bit | Windows 11 |
79+
| GPU | NVIDIA with 20+ GB VRAM | RTX 3090 / 4090 / 5090 |
80+
| CUDA | CUDA 12.x drivers | Latest NVIDIA drivers |
81+
| RAM | 16 GB | 32 GB |
82+
| Disk | Space for model files | SSD for faster loading |
83+
84+
> **No NVIDIA GPU?** Set GPU Layers to `0` in the launcher to run on CPU only (significantly slower but works).
85+
86+
## Building From Source
87+
88+
If you prefer to build from source instead of using the installer:
89+
90+
### Prerequisites
91+
92+
- Visual Studio 2022 with C++ workload
93+
- CMake 3.21+
94+
- NVIDIA CUDA Toolkit 12.x (for GPU support)
95+
- Git
96+
97+
### Build Steps
98+
99+
```powershell
100+
# Clone the repo
101+
git clone https://github.com/sumitchatterjee13/uno.cpp.git
102+
cd uno.cpp
103+
104+
# Create build directory
105+
mkdir build && cd build
106+
107+
# Configure with CUDA
108+
cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
109+
110+
# Build the server
111+
cmake --build . --config Release --target llama-server
112+
```
113+
114+
> **Tip:** Use the **Developer Command Prompt for VS 2022** (not regular cmd) so CMake and the compiler are in PATH.
115+
116+
The binary will be at `build/bin/Release/llama-server.exe`. Run it directly:
117+
118+
```powershell
119+
llama-server.exe -m path/to/model.gguf -ngl 99 -c 4096
120+
```
121+
122+
Then open `http://127.0.0.1:8080` in your browser.
123+
124+
## How It Works
125+
126+
Uno.cpp is a fork of [llama.cpp](https://github.com/ggml-org/llama.cpp) with added architecture support for models that haven't been merged upstream yet. The app consists of:
127+
128+
1. **`llama-server.exe`** — custom-built llama.cpp inference server with new architecture support
129+
2. **A GUI launcher** — pick your model, configure GPU layers and context size, hit launch
130+
3. **Built-in chat UI** — opens in your browser at `http://127.0.0.1:8080`
131+
132+
The server also exposes an OpenAI-compatible API, so you can connect tools like Open WebUI, SillyTavern, or any OpenAI-compatible client.
133+
134+
## FAQ
135+
136+
**Q: Is this a fork of llama.cpp?**
137+
Yes. Uno.cpp is llama.cpp with additional architecture support patches. Once those patches are merged upstream, Uno.cpp will sync with the latest llama.cpp and focus on the next set of unsupported models.
138+
139+
**Q: Will my existing models work with this?**
140+
If your model uses a standard llama.cpp architecture (LLaMA, Mistral, Qwen, Gemma, Phi, etc.), it will work out of the box. Uno.cpp adds support on top — it doesn't remove anything.
141+
142+
**Q: Is it safe to install?**
143+
The code is fully open source. The installer bundles only `llama-server.exe`, its runtime DLLs, and launcher scripts. You can verify everything by building from source.
144+
145+
**Q: macOS / Linux support?**
146+
Currently Windows only. macOS and Linux users can build from source — the llama.cpp codebase supports all platforms.
147+
148+
## Credits
149+
150+
- [llama.cpp](https://github.com/ggml-org/llama.cpp) by Georgi Gerganov and contributors — the foundation this project is built on
151+
- [Sarvam AI](https://www.sarvam.ai/) — creators of the Sarvam-30B model
152+
153+
## License
154+
155+
This project is licensed under the [MIT License](LICENSE), same as llama.cpp.
156+
157+
---
158+
159+
<p align="center">
160+
<em>Built by <a href="https://github.com/sumitchatterjee13">Sumit Chatterjee</a> — because nobody should have to wait for a PR merge to run a model.</em>
161+
</p>

Unocpp-Setup-v1.0.0.exe

25.2 MB
Binary file not shown.

assets/uno-logo.png

14.1 KB
Loading

0 commit comments

Comments
 (0)