Next-generation open MoE models: GigaChat 3 Ultra Preview (702B-A36B) and GigaChat 3 Lightning (10B-A1.8B).
This year we already released the GigaChat 2 lineup, added Reasoning to the Web version (giga.chat), open-sourced GigaChat Lite and Giga-Embeddings, and took first place on the ruMTEB benchmark. Now we are publishing open weights for a new generation of MoE models trained from scratch, without relying on foreign base weights.
We are open-sourcing two models under the MIT license with commercial-use permission:
- GigaChat 3 Ultra Preview (702B-A36B) — flagship instruct model.
- GigaChat 3 Lightning (10B-A1.8B) — compact MoE model for local and high-load use.
-
Instruct model:
-
Fine-tuning checkpoint (bf16):
-
Base model (pretrain):
-
Instruct model (for dialogs and instructions):
Both models use a custom Mixture-of-Experts (MoE) architecture with support for:
- Multi-head Latent Attention (MLA)
- Multi-Token Prediction (MTP)
Instead of standard Multi-head Attention, the models use Multi-head Latent Attention:
-
The Key-Value (KV) cache is compressed into a latent representation, which:
- reduces memory requirements;
- decreases KV-cache size;
- speeds up long-context inference.
This is especially noticeable for very long contexts (tens or hundreds of thousands of tokens) and for cluster inference.
Both models are trained with the Multi-Token Prediction (MTP) objective:
- the model learns to predict multiple tokens per forward pass;
- this enables speculative / parallel generation and speeds up inference by roughly +40%;
- in typical setups (vLLM / SGLang), MTP provides throughput comparable to much smaller dense models.
GigaChat3 Ultra Preview is the flagship instruct model in the GigaChat family, built on a Mixture-of-Experts (MoE) architecture.
Key highlights:
- ~702B total parameters, with ~36B active per token thanks to sparsity;
- combines top-tier quality with practical inference speed;
- inspired by DeepSeek V3 (MoE + MLA + MTP), but trained from scratch on our own corpus;
- top-1 on MERA;
- context length up to 131k tokens;
- faster than GigaChat 2 Max.
A bf16 checkpoint is available for fine-tuning:
ai-sage/GigaChat3-702B-A36B-preview-bf16.
More details on the architecture and training will be published in a Habr article (to do).
| Metric | GigaChat 3 Ultra | GigaChat 2 Max |
|---|---|---|
| MERA text | 0.683 | 0.663 |
| MERA industrial | 0.645 / 0.824 | — |
| MERA code | 0.338 | — |
| AUTOLOGI_EN_ZERO_SHOT | 0.6857 | 0.6489 |
| GPQA_COT_ZERO_SHOT | 0.5572 | 0.4714 |
| HUMAN_EVAL_PLUS_ZERO_SHOT | 0.8659 | 0.7805 |
| LBPP_PYTHON_ZERO_SHOT | 0.5247 | 0.4753 |
| MMLU_PRO_EN_FIVE_SHOT | 0.7276 | 0.6655 |
| GSM8K_FIVE_SHOT | 0.9598 | 0.9052 |
| MATH_500_FOUR_SHOT | 0.7840 | 0.7160 |
# lm-eval[api]==0.4.9.1
# sglang[all]==0.5.5
# or
# vllm==0.11.2
export HF_ALLOW_CODE_EVAL=1
# sglang server up
# 700B
python -m sglang.launch_server --model-path <path_to_model> --host 127.0.0.1 --port 30000 --nnodes 2 --node-rank <0/1> --tp 16 --ep 16 --dtype auto --mem-fraction-static 0.7 --trust-remote-code --allow-auto-truncate --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 --dist-init-addr <master_node_ip>:50000
# mmlu pro check
python -m lm_eval --model sglang-generate --output_path <path_to_model> --batch_size 16 --model_args base_url=[http://127.0.0.1:30000/generate,num_concurrent=16,tokenized_requests=True,max_length=131072,tokenizer=<path_to_model>](http://127.0.0.1:30000/generate,num_concurrent=16,tokenized_requests=True,max_length=131072,tokenizer=<path_to_model>) --trust_remote_code --confirm_run_unsafe_code --num_fewshot 5 --tasks mmlu_pro
GigaChat3-10B-A1.8B is a compact next-generation MoE model.
Key highlights:
-
10B total parameters, with 1.8B active per token;
-
reaches the quality level of the leading open-source model in its class, Qwen3-4B;
-
~1.5× faster generation speed, comparable to Qwen3-1.7B;
-
suitable for local use:
- offline assistants;
- prototyping and RAG;
- LLM classifiers and high-load RAG scenarios;
-
a strong base for a high-performance CPU embedding model (thanks to MoE);
-
supported context up to 256k tokens.
- Base (pretrain) —
GigaChat3-10B-A1.8B-baseIntended for fine-tuning and custom tasks. - Instruct —
GigaChat3-10B-A1.8BRecommended for dialog scenarios, assistants, and instruction following.
Even though the model has 10B parameters, its direct analogs are 3–4B models. However, due to high generation speed, we compare it with smaller models too.
| Metric | GigaChat 3 Lightning | Qwen3-1.7B-Instruct | Qwen3-4B-Instruct-2507 | SmolLM3 |
|---|---|---|---|---|
| MMLU_RU_FIVE_SHOT | 0.6833 | 0.4876 | 0.5972 | 0.4998 |
| RUBQ_ZERO_SHOT | 0.6516 | 0.2557 | 0.3170 | 0.6363 |
| MMLU_PRO_EN_FIVE_SHOT | 0.6061 | 0.410 | 0.6849 | 0.5013 |
| MMLU_EN_FIVE_SHOT | 0.7403 | 0.60 | 0.7080 | 0.5992 |
| BBH_THREE_SHOT | 0.4525 | 0.3317 | 0.7165 | 0.4161 |
| SuperGPQA | 0.2731 | 0.2092 | 0.3745 | 0.2459 |
| MATH_500_FOUR_SHOT | 0.7000 | 0.7520 | 0.8880 | 0.8020 |
| GPQA_COT_ZERO_SHOT | 0.3502 | 0.2651 | 0.5370 | 0.3704 |
| LiveCodeBench_ZERO_SHOT | 0.2031 | 0.0794 | 0.3046 | 0.1656 |
| HUMAN_EVAL_PLUS_ZERO_SHOT | 0.6951 | 0.6280 | 0.8780 | 0.7012 |
# lm-eval[api]==0.4.9.1
# sglang[all]==0.5.5
# or
# vllm==0.11.2
export HF_ALLOW_CODE_EVAL=1
# sglang server up
# 10B
python -m sglang.launch_server --model-path <path_to_model> --host 127.0.0.1 --port 30000 --tp 1 --dp 8 --dtype bfloat16 --mem-fraction-static 0.7 --trust-remote-code --allow-auto-truncate --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2
# mmlu pro check
python -m lm_eval --model sglang-generate --output_path <path_to_model> --batch_size 16 --model_args base_url=http://127.0.0.1:30000/generate,num_concurrent=16,tokenized_requests=True,max_length=131072,tokenizer=<path_to_model> --trust_remote_code --confirm_run_unsafe_code --num_fewshot 5 --tasks mmlu_pro
Both models were trained from scratch on a multilingual and diverse corpus.
Shared corpus properties:
-
10+ languages (including Russian, English, Chinese, Arabic, Uzbek, Kazakh, and others);
-
sources:
- books and non-fiction;
- academic data;
- code and math datasets;
- dialog and instruction datasets;
-
preprocessing:
- deduplication;
- language filtering;
- automatic quality checks using heuristics and classifiers.
Synthetic data made a substantial contribution to quality:
-
~5.5T tokens of synthetic data;
-
the corpus includes:
- QA pairs based on texts;
- reverse-prompt chains for data structuring;
- model comments and notes embedded into texts;
- millions of synthetic tasks with solutions for math and competitive programming (with auto-generated tests) based on PromptCot.
GigaChat3 Ultra Preview is designed for cluster and on-prem scenarios with serious infrastructure:
-
popular inference engines are supported:
- vLLM
- SGLang
- LMDeploy
- TensorRT-LLM
- and other frameworks;
-
BF16 and FP8 modes are supported (FP8 requires a separate build and GPU configuration);
-
MLA and MTP reduce KV-cache size and speed up generation;
-
using a proxy/gateway layer is recommended for integration with external services, tools, and agent frameworks.
For a reference configuration, see guides for similarly sized models:
- DeepSeek-V3 — How to run locally: https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file#6-how-to-run-locally
- Kimi-K2-Instruct — deployment guidance (vLLM / SGLang / LMDeploy): https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/docs/deploy_guidance.md
One of the key advantages of GigaChat3-10B-A1.8B is inference speed.
The model (especially in MTP mode) delivers throughput comparable to much smaller dense models while being noticeably stronger in quality.
Measurements were done with vLLM v0.11.0, dtype bfloat16, batch_size=1.
Benchmark code: https://gist.github.com/ajpqs/ce941aa6f0f48ef36a65cb87a2a1d726.
| Model | request_throughput | output_throughput | total_token_throughput | mean_ttft_ms |
|---|---|---|---|---|
Qwen3-1.7B |
1.689 | 357.308 | 726.093 | 11.824 |
mtp-GigaChat3-10B-A1.8B-base |
1.533 | 333.620 | 678.894 | 26.345 |
GigaChat3-10B-A1.8B-base |
1.077 | 234.363 | 476.912 | 31.053 |
Qwen3-4B |
0.978 | 206.849 | 420.341 | 14.947 |
Qwen3-8B |
0.664 | 140.432 | 285.375 | 16.663 |
YandexGPT-5-Lite-8B-pretrain |
0.641 | 147.305 | 300.269 | 16.711 |
Despite having 10B parameters, in terms of speed and inference cost the model can be treated as an alternative to 3–4B dense models, and in MTP mode it can be compared even to smaller ones.
Both models:
-
are not specialized reasoning models (but support a basic level of reasoning);
-
can predict multiple tokens per step (MTP);
-
use MLA, reducing KV-cache size and memory requirements;
-
are trained from scratch without initialization from third-party weights;
-
are compatible with:
- Hugging Face;
- vLLM / SGLang / LMDeploy;
- standard inference and fine-tuning pipelines;
-
are released under the MIT license and can be used in commercial products.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_name = "ai-sage/GigaChat3-10B-A1.8B-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.generation_config = GenerationConfig.from_pretrained(model_name)
prompt = "Ниже я написал подробное доказательство теоремы о неподвижной точке:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=400,
)
result = tokenizer.decode(
outputs[0][inputs.input_ids.shape[1]:],
skip_special_tokens=False,
)
print(result)Start a server:
vllm serve ai-sage/GigaChat3-10B-A1.8B-base \
--disable-sliding-window \
--dtype "auto" \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": false}'Example request:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai-sage/GigaChat3-10B-A1.8B-base",
"prompt": "Ниже я написал подробное доказательство теоремы о неподвижной точке:",
"max_tokens": 400,
"temperature": 0
}'Start a server:
python -m sglang.launch_server \
--model-path ai-sage/GigaChat3-10B-A1.8B-base \
--host 0.0.0.0 \
--port 30000 \
--dtype auto \
--mem-fraction-static 0.88 \
--speculative-algorithm NEXTN \
--speculative-num-steps 1 \
--speculative-eagle-topk 1Example request:
curl http://localhost:30000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai-sage/GigaChat3-10B-A1.8B-base",
"prompt": "Ниже я написал подробное доказательство теоремы о неподвижной точке:",
"max_tokens": 400,
"temperature": 0
}'The models are released under the MIT license. You may use them in research and commercial projects provided that the license text is preserved.
