Skip to content

ogulcanaydogan/LowResource-LLM-Forge

Repository files navigation

LowResource-LLM-Forge

Sovereign fine-tuning pipeline that takes low-resource languages from raw data to production inference — data collection, audio transcription, QLoRA training, evaluation, and serving — with V100-first GPU optimization and multi-language support out of the box.

CI Python License Unsloth vLLM Tests mypy


Why This Exists

Large language models perform well on high-resource languages (English, Chinese, German) but degrade significantly on low-resource languages — Turkish, Azerbaijani, Swahili, Kurdish — where pretraining data is scarce and evaluation benchmarks don't exist. Existing fine-tuning tools assume English data, unlimited GPU budgets, and established evaluation infrastructure.

This creates a structural barrier:

  • No training pipelines designed for languages with fragmented, low-quality datasets scattered across HuggingFace, audio archives, and scraped corpora.
  • No hardware-aware optimization for teams working with V100s instead of A100/H100 clusters — most QLoRA tools default to bf16, which V100s don't support.
  • No evaluation methodology for languages without standardized benchmarks — you can't measure what you can't test.
  • No end-to-end path from raw multilingual data to a deployed, benchmarked, published model that others can use.

LowResource-LLM-Forge solves this by providing a single pipeline that handles the entire lifecycle: collect and clean multilingual data, transcribe audio, fine-tune with V100-optimized QLoRA, evaluate with language-specific benchmarks, serve via vLLM, and publish to HuggingFace Hub — all from one CLI.

The project targets NLP researchers, language preservation teams, and engineering groups building sovereign AI capabilities for underserved language communities.


Training Results

Models trained with this pipeline are published on HuggingFace:

Model Base Method HuggingFace
Turkish-LLM-7B-Instruct Turkcell-LLM-7b-v1 SFT ogulcanaydogan/Turkish-LLM-7B-Instruct
Turkish-LLM-14B-Instruct Qwen2.5-14B SFT + DPO ogulcanaydogan/Turkish-LLM-14B-Instruct
Turkish-LLM-32B-Instruct Qwen2.5-32B SFT ogulcanaydogan/Turkish-LLM-32B-Instruct

GGUF quantized versions are also available for local inference with llama.cpp and Ollama. All published models were trained on A100 GPUs.

Benchmark Results (Turkish MMLU)

Model MMLU_TR XNLI_TR XCOPA_TR Notes
Qwen2.5-14B (base) 59.47% 41.53% 66.80% Pre-training baseline
+ SFT v3 (144K samples) 59.38% 42.89% 65.40% Instruction tuning
+ DPO v3 59.42% 43.33% 66.00% Preference alignment
+ SFT v4 (refined data) 59.77% 42.14% 65.60% Best MMLU_TR

Honest assessment: SFT and DPO fine-tuning preserved the base model's strong benchmark performance while adding Turkish instruction-following capability. Academic benchmarks like MMLU measure factual knowledge that is largely fixed during pretraining — they are not designed to capture the improvements that SFT actually provides: coherent multi-turn dialogue, proper Turkish character handling, instruction adherence, and reduced hallucination in Turkish responses.

The real value of this pipeline is not in moving benchmark numbers — it is in turning a base model that outputs fragmented, inconsistent Turkish into one that can hold a structured conversation in Turkish.

V100 Engineering Challenges Solved

The pipeline is designed to run on V100 GPUs (Volta architecture, compute capability 7.0), which present real engineering problems that don't exist on A100/H100. These were identified and fixed during V100 training runs:

Problem Root Cause Solution
NaN loss during training BitsAndBytes 4-bit + AMP autocast produces NaN on pre-Ampere GPUs when bnb_4bit_compute_dtype=float16 Detect GPU compute capability at runtime; fall back to float32 compute dtype on pre-Ampere hardware
No bf16 support V100 only supports fp16, but most QLoRA guides and configs assume bf16 All configs enforce fp16: true, bf16: false; training scripts validate dtype before launch
Loss spikes and overflow fp16 dynamic range (max ~65504) causes gradient overflow without GradScaler Added loss spike detection, consecutive-zero guard, and adaptive max_grad_norm tuning
OOM on 9B models 32GB V100 VRAM is tight for 9B QLoRA with batch size > 1 Gradient checkpointing + per-model memory budgets in config; automatic batch size reduction

These fixes are baked into the pipeline — users targeting V100 hardware get stable training out of the box without needing to debug dtype and quantization issues themselves.


How It Works

graph LR
    A["Data Sources"] --> B["Data Pipeline"]
    B --> C["QLoRA Training"]
    C --> D["Evaluation"]
    D --> E["Inference Serving"]
    D --> F["HF Hub Publish"]

    style A fill:#4A90D9,stroke:#2C5F8A,color:#fff
    style B fill:#7B68EE,stroke:#5B48CE,color:#fff
    style C fill:#FF6B6B,stroke:#CC4444,color:#fff
    style D fill:#F5A623,stroke:#C7851A,color:#fff
    style E fill:#50C878,stroke:#3AA862,color:#fff
    style F fill:#4ECDC4,stroke:#36B5AC,color:#fff
Loading

Upload datasets or audio in any supported language — the pipeline normalizes formats, removes duplicates, filters by language, trains a QLoRA adapter on V100-compatible settings, evaluates against configurable benchmarks, and deploys the merged model for inference.


Architecture

graph TD
    subgraph DATA["Data Pipeline"]
        HF["HuggingFace\nDatasets"] --> COLLECT["DataCollector\n(alpaca, sharegpt,\nraw_text, dpo)"]
        AUDIO["Audio Files"] --> WHISPER["WhisperTranscriber\n(language forcing,\nconfidence filter)"]
        WHISPER --> COLLECT
        COLLECT --> PREPROC["DataPreprocessor\n(clean, MinHash dedup,\nlang-filter)"]
        PREPROC --> BUILD["DatasetBuilder\n(train/eval split)"]
    end

    subgraph TRAIN["Training Pipeline"]
        BUILD --> TRAINER["ForgeTrainer\n(Unsloth QLoRA,\nPEFT fallback)"]
        TRAINER --> MERGE["LoRAMerger\n(merge + tokenizer patch)"]
    end

    subgraph EVAL["Evaluation Pipeline"]
        MERGE --> EVALUATOR["ForgeEvaluator"]
        EVALUATOR --> MMLU["Turkish MMLU\n(lm-eval)"]
        EVALUATOR --> PPL["Perplexity\n(held-out text)"]
        EVALUATOR --> GEN["Generation Quality\n(heuristic scoring)"]
    end

    subgraph SERVE["Serving & Publishing"]
        MERGE --> VLLM["vLLM Server\n(systemd, Docker)"]
        MERGE --> REPL["Replicate\n(Cog)"]
        MERGE --> PUB["HF Hub Publish\n(auto model card)"]
    end

    style DATA fill:#1a1a2e,stroke:#4A90D9,color:#fff
    style TRAIN fill:#1a1a2e,stroke:#FF6B6B,color:#fff
    style EVAL fill:#1a1a2e,stroke:#F5A623,color:#fff
    style SERVE fill:#1a1a2e,stroke:#50C878,color:#fff
    style HF fill:#4A90D9,stroke:#2C5F8A,color:#fff
    style AUDIO fill:#4A90D9,stroke:#2C5F8A,color:#fff
    style COLLECT fill:#7B68EE,stroke:#5B48CE,color:#fff
    style WHISPER fill:#7B68EE,stroke:#5B48CE,color:#fff
    style PREPROC fill:#7B68EE,stroke:#5B48CE,color:#fff
    style BUILD fill:#7B68EE,stroke:#5B48CE,color:#fff
    style TRAINER fill:#FF6B6B,stroke:#CC4444,color:#fff
    style MERGE fill:#FF8C42,stroke:#CC6A2E,color:#fff
    style EVALUATOR fill:#F5A623,stroke:#C7851A,color:#fff
    style MMLU fill:#F5A623,stroke:#C7851A,color:#fff
    style PPL fill:#F5A623,stroke:#C7851A,color:#fff
    style GEN fill:#F5A623,stroke:#C7851A,color:#fff
    style VLLM fill:#50C878,stroke:#3AA862,color:#fff
    style REPL fill:#50C878,stroke:#3AA862,color:#fff
    style PUB fill:#4ECDC4,stroke:#36B5AC,color:#fff
Loading

Key Capabilities

Multi-Format Data Pipeline

graph LR
    A["HuggingFace\nDatasets"] --> N["Format\nNormalizer"]
    B["Audio\nRecordings"] --> W["Whisper\nTranscriber"]
    W --> N
    N --> C["Text\nCleaner"]
    C --> D["MinHash\nDedup"]
    D --> E{"Language\nFilter"}
    E -->|Pass| F["SFT Dataset\n+ Eval Split"]
    E -->|Reject| G["Filtered Out"]

    style A fill:#4A90D9,stroke:#2C5F8A,color:#fff
    style B fill:#4A90D9,stroke:#2C5F8A,color:#fff
    style W fill:#7B68EE,stroke:#5B48CE,color:#fff
    style N fill:#7B68EE,stroke:#5B48CE,color:#fff
    style C fill:#7B68EE,stroke:#5B48CE,color:#fff
    style D fill:#7B68EE,stroke:#5B48CE,color:#fff
    style E fill:#F5A623,stroke:#C7851A,color:#fff
    style F fill:#50C878,stroke:#3AA862,color:#fff
    style G fill:#E74C3C,stroke:#C0392B,color:#fff
Loading
Feature Description
Format normalization Alpaca, ShareGPT, raw text, DPO — all converted to unified SFT format
Whisper transcription Audio-to-text with language forcing and log-probability confidence filtering
MinHash deduplication Near-duplicate removal across large corpora
Language detection Keyword-overlap heuristic for Turkic languages (Turkish, Azerbaijani) with extensible marker sets
Quality filtering Minimum length enforcement, character validation, configurable thresholds

V100-Optimized QLoRA Training

graph TD
    CFG["YAML Config\n(inherits base.yaml)"] --> LOAD["Load Base Model\n4-bit NF4 Quantization"]
    LOAD --> LORA["Attach LoRA Adapters\n(r=32, α=64)"]
    LORA --> OPT{"Unsloth\nAvailable?"}
    OPT -->|Yes| FAST["Unsloth Optimized\n(2-5x speedup)"]
    OPT -->|No| STD["Standard PEFT\n(Fallback)"]
    FAST --> TRAIN["Train\n(fp16, gradient checkpoint,\nearly stopping)"]
    STD --> TRAIN
    TRAIN --> SAVE["Save Adapter\n+ Merge to Base"]

    style CFG fill:#4A90D9,stroke:#2C5F8A,color:#fff
    style LOAD fill:#7B68EE,stroke:#5B48CE,color:#fff
    style LORA fill:#7B68EE,stroke:#5B48CE,color:#fff
    style OPT fill:#F5A623,stroke:#C7851A,color:#fff
    style FAST fill:#50C878,stroke:#3AA862,color:#fff
    style STD fill:#FF8C42,stroke:#CC6A2E,color:#fff
    style TRAIN fill:#FF6B6B,stroke:#CC4444,color:#fff
    style SAVE fill:#4ECDC4,stroke:#36B5AC,color:#fff
Loading
Constraint Design Decision
fp16 only V100 (Volta) does not support bf16 — all configs enforce fp16: true, bf16: false
4-bit QLoRA NF4 quantization enables 7B-9B models on 16-32GB VRAM
Gradient checkpointing Trades compute for memory — critical for 9B models on 32GB V100
Unsloth fast path 2-5x training speedup when available, transparent PEFT fallback when not
Early stopping EarlyStoppingOnPlateau with configurable patience and min-delta
Config inheritance base.yaml → model-specific YAML via _base key, deep merge semantics

Evaluation Framework

Benchmark Method Pass Threshold Scope
Turkish MMLU lm-evaluation-harness (turkishmmlu task) ≥0.40 accuracy Broad academic knowledge
Perplexity Cross-entropy on held-out eval set <50.0 Language modeling quality
Generation Quality Heuristic scoring on 10 diverse Turkish prompts ≥3.5/5.0 Fluency, coherence, character usage

Remote-First Execution Safety

graph TD
    CMD["forge train / evaluate / serve / publish"] --> GUARD{"Runtime\nGuard"}
    GUARD -->|SSH Session| ALLOW["Execute"]
    GUARD -->|CI Job| ALLOW
    GUARD -->|FORGE_EXECUTION_CONTEXT=remote| ALLOW
    GUARD -->|Local Shell| BLOCK["Block with\nRuntimeError"]
    BLOCK -->|FORGE_ALLOW_LOCAL=1| ALLOW

    style CMD fill:#4A90D9,stroke:#2C5F8A,color:#fff
    style GUARD fill:#F5A623,stroke:#C7851A,color:#fff
    style ALLOW fill:#50C878,stroke:#3AA862,color:#fff
    style BLOCK fill:#E74C3C,stroke:#C0392B,color:#fff
Loading

Model-loading commands are blocked on local development machines by default. Only SSH sessions, CI jobs, and explicitly marked remote contexts are allowed — preventing accidental multi-hour GPU processes on laptops.

Deployment Options

graph LR
    MODEL["Merged Model"] --> VLLM["vLLM\n(systemd, SSH deploy)"]
    MODEL --> DOCKER["Docker Compose\n(train + serve)"]
    MODEL --> COG["Replicate\n(Cog)"]
    MODEL --> HUB["HF Hub\n(auto model card)"]

    VLLM --> API["OpenAI-compatible\n/v1/completions"]
    DOCKER --> API
    COG --> RAPI["Replicate API"]

    style MODEL fill:#FF8C42,stroke:#CC6A2E,color:#fff
    style VLLM fill:#50C878,stroke:#3AA862,color:#fff
    style DOCKER fill:#50C878,stroke:#3AA862,color:#fff
    style COG fill:#50C878,stroke:#3AA862,color:#fff
    style HUB fill:#4ECDC4,stroke:#36B5AC,color:#fff
    style API fill:#7B68EE,stroke:#5B48CE,color:#fff
    style RAPI fill:#7B68EE,stroke:#5B48CE,color:#fff
Loading
Target Method Features
vLLM SSH + user-level systemd Versioned model dirs, atomic symlink switching, API key enforcement, DGX Spark eager mode
Docker docker compose Training + serving containers, GPU passthrough
Replicate Cog build + push Managed inference API
HuggingFace Hub Auto model card generation Training config, eval results, usage examples embedded in card

Multi-Language Support

graph TD
    TEMPLATE["configs/data/template.yaml"] --> TR["Turkish\n(primary)"]
    TEMPLATE --> AZ["Azerbaijani"]
    TEMPLATE --> NEW["Your Language\n(copy template)"]

    TR --> MARKERS["Language Markers\n(keyword detection)"]
    AZ --> MARKERS
    NEW --> MARKERS

    style TEMPLATE fill:#4A90D9,stroke:#2C5F8A,color:#fff
    style TR fill:#50C878,stroke:#3AA862,color:#fff
    style AZ fill:#50C878,stroke:#3AA862,color:#fff
    style NEW fill:#F5A623,stroke:#C7851A,color:#fff
    style MARKERS fill:#7B68EE,stroke:#5B48CE,color:#fff
Loading
Language Config Datasets Status
Turkish configs/data/turkish.yaml Turkce-Instruct-Merged, turkish-text-data Primary
Azerbaijani configs/data/azerbaijani.yaml AzInstruct_merged Configured
New language Copy configs/data/template.yaml Bring your own datasets Template ready

Adding a new language requires:

  1. A data config YAML with HuggingFace dataset sources
  2. Optionally, language marker words in preprocessor.py for detection filtering
  3. A model config inheriting from configs/base.yaml

Supported Models

Model Base Architecture V100 Compatible Config Trained Output
Turkcell-LLM-7b-v1 Mistral Yes (primary target) configs/models/turkcell_7b.yaml Turkish-LLM-7B-Instruct
Qwen2.5-14B Qwen2.5 Yes (A100 recommended) configs/models/qwen_14b.yaml Turkish-LLM-14B-Instruct
Qwen2.5-32B Qwen2.5 No (A100 required) configs/models/qwen_32b.yaml Turkish-LLM-32B-Instruct
wiroai-turkish-llm-9b Gemma Yes (tight on 32GB) configs/models/wiroai_9b.yaml
cere-llama-3-8b-tr Llama 3 Yes configs/models/llama3_8b_tr.yaml

Quick Start

# 1. Install
uv sync --extra dev

# 2. Download and preprocess Turkish data
forge download --config configs/data/turkish.yaml

# 3. Fine-tune on remote GPU host (SSH session)
forge train --config configs/models/turkcell_7b.yaml

# 4. Merge adapters into base model
forge merge --base-model TURKCELL/Turkcell-LLM-7b-v1 \
    --adapter artifacts/training/turkcell-7b-sft-v1/final \
    --output artifacts/merged/turkcell-7b-turkish-v1

# 5. Evaluate
forge evaluate --model artifacts/merged/turkcell-7b-turkish-v1

# 6. Publish to HuggingFace Hub
forge publish --model-dir artifacts/merged/turkcell-7b-turkish-v1 \
    --hub-repo ogulcanaydogan/turkcell-7b-turkish-sft \
    --training-config configs/models/turkcell_7b.yaml

CLI Reference

All pipeline stages are accessible via the forge CLI:

Command Purpose
forge download --config ... Download and preprocess training data
forge transcribe --audio-dir ... Transcribe audio files via Whisper
forge train --config ... Run QLoRA fine-tuning
forge evaluate --model ... Run evaluation benchmarks
forge merge --base-model ... --adapter ... --output ... Merge LoRA adapters into base model
forge serve --config ... Start vLLM inference server
forge publish --model-dir ... --hub-repo ... Publish model to HuggingFace Hub
forge benchmark --base-url ... Benchmark an OpenAI-compatible endpoint

All commands support --help for full option documentation. Run make help to see all Makefile targets.


Notebooks

Interactive Jupyter notebooks for exploration and analysis:

Notebook Purpose
notebooks/01_data_exploration.ipynb Dataset statistics, language distribution, preprocessing quality, sample inspection
notebooks/02_training_analysis.ipynb Loss curves, learning rate schedules, LoRA weight analysis, WandB integration

Development

uv sync --extra dev
make help          # Show all available targets
make qa            # Run all quality gates (lint + typecheck + test)
make test          # pytest
make lint          # ruff
make typecheck     # mypy

Quality Gate Status

Check Tool Status
Lint ruff 0 issues
Types mypy (strict) 0 issues in 25 source files
Tests pytest 78 passed
Coverage pytest-cov 47% (threshold: 40%)

Hardware Requirements

Stage Minimum Recommended
Training (7B QLoRA) V100 16GB V100 32GB
Training (9B QLoRA) V100 32GB A100 40GB
Inference 16GB VRAM DGX Spark (GB10)
Data Pipeline CPU only 16GB RAM
Audio Transcription CPU (slow) GPU with 8GB VRAM

License

Apache-2.0

About

Sovereign LLM fine-tuning pipeline for low-resource languages. End-to-end data collection, Whisper transcription, QLoRA training (Unsloth/PEFT), evaluation benchmarks, vLLM serving, and HuggingFace Hub publishing. Turkish out of the box, extensible to any low-resource language.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors