Sovereign fine-tuning pipeline that takes low-resource languages from raw data to production inference — data collection, audio transcription, QLoRA training, evaluation, and serving — with V100-first GPU optimization and multi-language support out of the box.
Large language models perform well on high-resource languages (English, Chinese, German) but degrade significantly on low-resource languages — Turkish, Azerbaijani, Swahili, Kurdish — where pretraining data is scarce and evaluation benchmarks don't exist. Existing fine-tuning tools assume English data, unlimited GPU budgets, and established evaluation infrastructure.
This creates a structural barrier:
- No training pipelines designed for languages with fragmented, low-quality datasets scattered across HuggingFace, audio archives, and scraped corpora.
- No hardware-aware optimization for teams working with V100s instead of A100/H100 clusters — most QLoRA tools default to bf16, which V100s don't support.
- No evaluation methodology for languages without standardized benchmarks — you can't measure what you can't test.
- No end-to-end path from raw multilingual data to a deployed, benchmarked, published model that others can use.
LowResource-LLM-Forge solves this by providing a single pipeline that handles the entire lifecycle: collect and clean multilingual data, transcribe audio, fine-tune with V100-optimized QLoRA, evaluate with language-specific benchmarks, serve via vLLM, and publish to HuggingFace Hub — all from one CLI.
The project targets NLP researchers, language preservation teams, and engineering groups building sovereign AI capabilities for underserved language communities.
Models trained with this pipeline are published on HuggingFace:
| Model | Base | Method | HuggingFace |
|---|---|---|---|
| Turkish-LLM-7B-Instruct | Turkcell-LLM-7b-v1 | SFT | ogulcanaydogan/Turkish-LLM-7B-Instruct |
| Turkish-LLM-14B-Instruct | Qwen2.5-14B | SFT + DPO | ogulcanaydogan/Turkish-LLM-14B-Instruct |
| Turkish-LLM-32B-Instruct | Qwen2.5-32B | SFT | ogulcanaydogan/Turkish-LLM-32B-Instruct |
GGUF quantized versions are also available for local inference with llama.cpp and Ollama. All published models were trained on A100 GPUs.
| Model | MMLU_TR | XNLI_TR | XCOPA_TR | Notes |
|---|---|---|---|---|
| Qwen2.5-14B (base) | 59.47% | 41.53% | 66.80% | Pre-training baseline |
| + SFT v3 (144K samples) | 59.38% | 42.89% | 65.40% | Instruction tuning |
| + DPO v3 | 59.42% | 43.33% | 66.00% | Preference alignment |
| + SFT v4 (refined data) | 59.77% | 42.14% | 65.60% | Best MMLU_TR |
Honest assessment: SFT and DPO fine-tuning preserved the base model's strong benchmark performance while adding Turkish instruction-following capability. Academic benchmarks like MMLU measure factual knowledge that is largely fixed during pretraining — they are not designed to capture the improvements that SFT actually provides: coherent multi-turn dialogue, proper Turkish character handling, instruction adherence, and reduced hallucination in Turkish responses.
The real value of this pipeline is not in moving benchmark numbers — it is in turning a base model that outputs fragmented, inconsistent Turkish into one that can hold a structured conversation in Turkish.
The pipeline is designed to run on V100 GPUs (Volta architecture, compute capability 7.0), which present real engineering problems that don't exist on A100/H100. These were identified and fixed during V100 training runs:
| Problem | Root Cause | Solution |
|---|---|---|
| NaN loss during training | BitsAndBytes 4-bit + AMP autocast produces NaN on pre-Ampere GPUs when bnb_4bit_compute_dtype=float16 |
Detect GPU compute capability at runtime; fall back to float32 compute dtype on pre-Ampere hardware |
| No bf16 support | V100 only supports fp16, but most QLoRA guides and configs assume bf16 | All configs enforce fp16: true, bf16: false; training scripts validate dtype before launch |
| Loss spikes and overflow | fp16 dynamic range (max ~65504) causes gradient overflow without GradScaler | Added loss spike detection, consecutive-zero guard, and adaptive max_grad_norm tuning |
| OOM on 9B models | 32GB V100 VRAM is tight for 9B QLoRA with batch size > 1 | Gradient checkpointing + per-model memory budgets in config; automatic batch size reduction |
These fixes are baked into the pipeline — users targeting V100 hardware get stable training out of the box without needing to debug dtype and quantization issues themselves.
graph LR
A["Data Sources"] --> B["Data Pipeline"]
B --> C["QLoRA Training"]
C --> D["Evaluation"]
D --> E["Inference Serving"]
D --> F["HF Hub Publish"]
style A fill:#4A90D9,stroke:#2C5F8A,color:#fff
style B fill:#7B68EE,stroke:#5B48CE,color:#fff
style C fill:#FF6B6B,stroke:#CC4444,color:#fff
style D fill:#F5A623,stroke:#C7851A,color:#fff
style E fill:#50C878,stroke:#3AA862,color:#fff
style F fill:#4ECDC4,stroke:#36B5AC,color:#fff
Upload datasets or audio in any supported language — the pipeline normalizes formats, removes duplicates, filters by language, trains a QLoRA adapter on V100-compatible settings, evaluates against configurable benchmarks, and deploys the merged model for inference.
graph TD
subgraph DATA["Data Pipeline"]
HF["HuggingFace\nDatasets"] --> COLLECT["DataCollector\n(alpaca, sharegpt,\nraw_text, dpo)"]
AUDIO["Audio Files"] --> WHISPER["WhisperTranscriber\n(language forcing,\nconfidence filter)"]
WHISPER --> COLLECT
COLLECT --> PREPROC["DataPreprocessor\n(clean, MinHash dedup,\nlang-filter)"]
PREPROC --> BUILD["DatasetBuilder\n(train/eval split)"]
end
subgraph TRAIN["Training Pipeline"]
BUILD --> TRAINER["ForgeTrainer\n(Unsloth QLoRA,\nPEFT fallback)"]
TRAINER --> MERGE["LoRAMerger\n(merge + tokenizer patch)"]
end
subgraph EVAL["Evaluation Pipeline"]
MERGE --> EVALUATOR["ForgeEvaluator"]
EVALUATOR --> MMLU["Turkish MMLU\n(lm-eval)"]
EVALUATOR --> PPL["Perplexity\n(held-out text)"]
EVALUATOR --> GEN["Generation Quality\n(heuristic scoring)"]
end
subgraph SERVE["Serving & Publishing"]
MERGE --> VLLM["vLLM Server\n(systemd, Docker)"]
MERGE --> REPL["Replicate\n(Cog)"]
MERGE --> PUB["HF Hub Publish\n(auto model card)"]
end
style DATA fill:#1a1a2e,stroke:#4A90D9,color:#fff
style TRAIN fill:#1a1a2e,stroke:#FF6B6B,color:#fff
style EVAL fill:#1a1a2e,stroke:#F5A623,color:#fff
style SERVE fill:#1a1a2e,stroke:#50C878,color:#fff
style HF fill:#4A90D9,stroke:#2C5F8A,color:#fff
style AUDIO fill:#4A90D9,stroke:#2C5F8A,color:#fff
style COLLECT fill:#7B68EE,stroke:#5B48CE,color:#fff
style WHISPER fill:#7B68EE,stroke:#5B48CE,color:#fff
style PREPROC fill:#7B68EE,stroke:#5B48CE,color:#fff
style BUILD fill:#7B68EE,stroke:#5B48CE,color:#fff
style TRAINER fill:#FF6B6B,stroke:#CC4444,color:#fff
style MERGE fill:#FF8C42,stroke:#CC6A2E,color:#fff
style EVALUATOR fill:#F5A623,stroke:#C7851A,color:#fff
style MMLU fill:#F5A623,stroke:#C7851A,color:#fff
style PPL fill:#F5A623,stroke:#C7851A,color:#fff
style GEN fill:#F5A623,stroke:#C7851A,color:#fff
style VLLM fill:#50C878,stroke:#3AA862,color:#fff
style REPL fill:#50C878,stroke:#3AA862,color:#fff
style PUB fill:#4ECDC4,stroke:#36B5AC,color:#fff
graph LR
A["HuggingFace\nDatasets"] --> N["Format\nNormalizer"]
B["Audio\nRecordings"] --> W["Whisper\nTranscriber"]
W --> N
N --> C["Text\nCleaner"]
C --> D["MinHash\nDedup"]
D --> E{"Language\nFilter"}
E -->|Pass| F["SFT Dataset\n+ Eval Split"]
E -->|Reject| G["Filtered Out"]
style A fill:#4A90D9,stroke:#2C5F8A,color:#fff
style B fill:#4A90D9,stroke:#2C5F8A,color:#fff
style W fill:#7B68EE,stroke:#5B48CE,color:#fff
style N fill:#7B68EE,stroke:#5B48CE,color:#fff
style C fill:#7B68EE,stroke:#5B48CE,color:#fff
style D fill:#7B68EE,stroke:#5B48CE,color:#fff
style E fill:#F5A623,stroke:#C7851A,color:#fff
style F fill:#50C878,stroke:#3AA862,color:#fff
style G fill:#E74C3C,stroke:#C0392B,color:#fff
| Feature | Description |
|---|---|
| Format normalization | Alpaca, ShareGPT, raw text, DPO — all converted to unified SFT format |
| Whisper transcription | Audio-to-text with language forcing and log-probability confidence filtering |
| MinHash deduplication | Near-duplicate removal across large corpora |
| Language detection | Keyword-overlap heuristic for Turkic languages (Turkish, Azerbaijani) with extensible marker sets |
| Quality filtering | Minimum length enforcement, character validation, configurable thresholds |
graph TD
CFG["YAML Config\n(inherits base.yaml)"] --> LOAD["Load Base Model\n4-bit NF4 Quantization"]
LOAD --> LORA["Attach LoRA Adapters\n(r=32, α=64)"]
LORA --> OPT{"Unsloth\nAvailable?"}
OPT -->|Yes| FAST["Unsloth Optimized\n(2-5x speedup)"]
OPT -->|No| STD["Standard PEFT\n(Fallback)"]
FAST --> TRAIN["Train\n(fp16, gradient checkpoint,\nearly stopping)"]
STD --> TRAIN
TRAIN --> SAVE["Save Adapter\n+ Merge to Base"]
style CFG fill:#4A90D9,stroke:#2C5F8A,color:#fff
style LOAD fill:#7B68EE,stroke:#5B48CE,color:#fff
style LORA fill:#7B68EE,stroke:#5B48CE,color:#fff
style OPT fill:#F5A623,stroke:#C7851A,color:#fff
style FAST fill:#50C878,stroke:#3AA862,color:#fff
style STD fill:#FF8C42,stroke:#CC6A2E,color:#fff
style TRAIN fill:#FF6B6B,stroke:#CC4444,color:#fff
style SAVE fill:#4ECDC4,stroke:#36B5AC,color:#fff
| Constraint | Design Decision |
|---|---|
| fp16 only | V100 (Volta) does not support bf16 — all configs enforce fp16: true, bf16: false |
| 4-bit QLoRA | NF4 quantization enables 7B-9B models on 16-32GB VRAM |
| Gradient checkpointing | Trades compute for memory — critical for 9B models on 32GB V100 |
| Unsloth fast path | 2-5x training speedup when available, transparent PEFT fallback when not |
| Early stopping | EarlyStoppingOnPlateau with configurable patience and min-delta |
| Config inheritance | base.yaml → model-specific YAML via _base key, deep merge semantics |
| Benchmark | Method | Pass Threshold | Scope |
|---|---|---|---|
| Turkish MMLU | lm-evaluation-harness (turkishmmlu task) |
≥0.40 accuracy | Broad academic knowledge |
| Perplexity | Cross-entropy on held-out eval set | <50.0 | Language modeling quality |
| Generation Quality | Heuristic scoring on 10 diverse Turkish prompts | ≥3.5/5.0 | Fluency, coherence, character usage |
graph TD
CMD["forge train / evaluate / serve / publish"] --> GUARD{"Runtime\nGuard"}
GUARD -->|SSH Session| ALLOW["Execute"]
GUARD -->|CI Job| ALLOW
GUARD -->|FORGE_EXECUTION_CONTEXT=remote| ALLOW
GUARD -->|Local Shell| BLOCK["Block with\nRuntimeError"]
BLOCK -->|FORGE_ALLOW_LOCAL=1| ALLOW
style CMD fill:#4A90D9,stroke:#2C5F8A,color:#fff
style GUARD fill:#F5A623,stroke:#C7851A,color:#fff
style ALLOW fill:#50C878,stroke:#3AA862,color:#fff
style BLOCK fill:#E74C3C,stroke:#C0392B,color:#fff
Model-loading commands are blocked on local development machines by default. Only SSH sessions, CI jobs, and explicitly marked remote contexts are allowed — preventing accidental multi-hour GPU processes on laptops.
graph LR
MODEL["Merged Model"] --> VLLM["vLLM\n(systemd, SSH deploy)"]
MODEL --> DOCKER["Docker Compose\n(train + serve)"]
MODEL --> COG["Replicate\n(Cog)"]
MODEL --> HUB["HF Hub\n(auto model card)"]
VLLM --> API["OpenAI-compatible\n/v1/completions"]
DOCKER --> API
COG --> RAPI["Replicate API"]
style MODEL fill:#FF8C42,stroke:#CC6A2E,color:#fff
style VLLM fill:#50C878,stroke:#3AA862,color:#fff
style DOCKER fill:#50C878,stroke:#3AA862,color:#fff
style COG fill:#50C878,stroke:#3AA862,color:#fff
style HUB fill:#4ECDC4,stroke:#36B5AC,color:#fff
style API fill:#7B68EE,stroke:#5B48CE,color:#fff
style RAPI fill:#7B68EE,stroke:#5B48CE,color:#fff
| Target | Method | Features |
|---|---|---|
| vLLM | SSH + user-level systemd | Versioned model dirs, atomic symlink switching, API key enforcement, DGX Spark eager mode |
| Docker | docker compose |
Training + serving containers, GPU passthrough |
| Replicate | Cog build + push | Managed inference API |
| HuggingFace Hub | Auto model card generation | Training config, eval results, usage examples embedded in card |
graph TD
TEMPLATE["configs/data/template.yaml"] --> TR["Turkish\n(primary)"]
TEMPLATE --> AZ["Azerbaijani"]
TEMPLATE --> NEW["Your Language\n(copy template)"]
TR --> MARKERS["Language Markers\n(keyword detection)"]
AZ --> MARKERS
NEW --> MARKERS
style TEMPLATE fill:#4A90D9,stroke:#2C5F8A,color:#fff
style TR fill:#50C878,stroke:#3AA862,color:#fff
style AZ fill:#50C878,stroke:#3AA862,color:#fff
style NEW fill:#F5A623,stroke:#C7851A,color:#fff
style MARKERS fill:#7B68EE,stroke:#5B48CE,color:#fff
| Language | Config | Datasets | Status |
|---|---|---|---|
| Turkish | configs/data/turkish.yaml |
Turkce-Instruct-Merged, turkish-text-data | Primary |
| Azerbaijani | configs/data/azerbaijani.yaml |
AzInstruct_merged | Configured |
| New language | Copy configs/data/template.yaml |
Bring your own datasets | Template ready |
Adding a new language requires:
- A data config YAML with HuggingFace dataset sources
- Optionally, language marker words in
preprocessor.pyfor detection filtering - A model config inheriting from
configs/base.yaml
| Model | Base Architecture | V100 Compatible | Config | Trained Output |
|---|---|---|---|---|
| Turkcell-LLM-7b-v1 | Mistral | Yes (primary target) | configs/models/turkcell_7b.yaml |
Turkish-LLM-7B-Instruct |
| Qwen2.5-14B | Qwen2.5 | Yes (A100 recommended) | configs/models/qwen_14b.yaml |
Turkish-LLM-14B-Instruct |
| Qwen2.5-32B | Qwen2.5 | No (A100 required) | configs/models/qwen_32b.yaml |
Turkish-LLM-32B-Instruct |
| wiroai-turkish-llm-9b | Gemma | Yes (tight on 32GB) | configs/models/wiroai_9b.yaml |
— |
| cere-llama-3-8b-tr | Llama 3 | Yes | configs/models/llama3_8b_tr.yaml |
— |
# 1. Install
uv sync --extra dev
# 2. Download and preprocess Turkish data
forge download --config configs/data/turkish.yaml
# 3. Fine-tune on remote GPU host (SSH session)
forge train --config configs/models/turkcell_7b.yaml
# 4. Merge adapters into base model
forge merge --base-model TURKCELL/Turkcell-LLM-7b-v1 \
--adapter artifacts/training/turkcell-7b-sft-v1/final \
--output artifacts/merged/turkcell-7b-turkish-v1
# 5. Evaluate
forge evaluate --model artifacts/merged/turkcell-7b-turkish-v1
# 6. Publish to HuggingFace Hub
forge publish --model-dir artifacts/merged/turkcell-7b-turkish-v1 \
--hub-repo ogulcanaydogan/turkcell-7b-turkish-sft \
--training-config configs/models/turkcell_7b.yamlAll pipeline stages are accessible via the forge CLI:
| Command | Purpose |
|---|---|
forge download --config ... |
Download and preprocess training data |
forge transcribe --audio-dir ... |
Transcribe audio files via Whisper |
forge train --config ... |
Run QLoRA fine-tuning |
forge evaluate --model ... |
Run evaluation benchmarks |
forge merge --base-model ... --adapter ... --output ... |
Merge LoRA adapters into base model |
forge serve --config ... |
Start vLLM inference server |
forge publish --model-dir ... --hub-repo ... |
Publish model to HuggingFace Hub |
forge benchmark --base-url ... |
Benchmark an OpenAI-compatible endpoint |
All commands support --help for full option documentation. Run make help to see all Makefile targets.
Interactive Jupyter notebooks for exploration and analysis:
| Notebook | Purpose |
|---|---|
notebooks/01_data_exploration.ipynb |
Dataset statistics, language distribution, preprocessing quality, sample inspection |
notebooks/02_training_analysis.ipynb |
Loss curves, learning rate schedules, LoRA weight analysis, WandB integration |
uv sync --extra dev
make help # Show all available targets
make qa # Run all quality gates (lint + typecheck + test)
make test # pytest
make lint # ruff
make typecheck # mypy| Check | Tool | Status |
|---|---|---|
| Lint | ruff | 0 issues |
| Types | mypy (strict) | 0 issues in 25 source files |
| Tests | pytest | 78 passed |
| Coverage | pytest-cov | 47% (threshold: 40%) |
| Stage | Minimum | Recommended |
|---|---|---|
| Training (7B QLoRA) | V100 16GB | V100 32GB |
| Training (9B QLoRA) | V100 32GB | A100 40GB |
| Inference | 16GB VRAM | DGX Spark (GB10) |
| Data Pipeline | CPU only | 16GB RAM |
| Audio Transcription | CPU (slow) | GPU with 8GB VRAM |
Apache-2.0