A minimal fine-tuning repo for LFM2, fully built on Open Source.
β οΈ Important
- Hardware: We tested this tool on H100 80GB GPU. Multi-GPU parallelization has been tested up to 8 such GPUs.
- Operating system: This tool currently supports Linux machines with the x86_64 architecture.
- Python: Make sure you are running Python >= 3.12.
- Access token: Make sure you are logged in on Hugging Face to access models and datasets.
For feature requests or if you have a different setup, reach out to support@liquid.ai and tell us about your specific configuration.
curl -LsSf https://astral.sh/uv/install.sh | shgit clone <repository-url>
cd leap_finetuneuv syncGo to config.py and follow the instructions there.
- Use
DatasetLoaderto load datasets from HuggingFace Hub or local files (you can also add custom data loading logic here as long as it's TRL compatible) - Pick a default
TrainingConfigand optionally override some of the config parameters. Pick aPeftConfig. - Create a
JobConfigwith your desired settings (model, dataset, etc.)
Run locally:
uv run leap-finetuneIt uses Ray Train + Accelerate for distributed training.
Unless you overwrote output_dir, results will be stored in outputs/training_type/job_name/
To enable experiment tracking (using Weights & Biases):
- Set
wandb_logging=Trueinconfig.pyin youruser_configoverrides or default configs. - Offline mode (default): If no
WANDB_API_KEYis set, wandb logs locally to./wandb/directory. No API key needed! - Online mode: Set the
WANDB_API_KEYenvironment variable to sync to wandb.ai dashboard:
export WANDB_API_KEY=your_api_key # optional; for online syncing to wandb.aiYou can also customize the project name (defaults to "leap-finetune"):
export WANDB_PROJECT=my-custom-project # optional; defaults to "leap-finetune"After training, view your metrics:
- Online mode: View at
https://wandb.ai/<your-entity>/<project-name>/runs/<run-name> - Offline mode: Sync later with
wandb sync ./wandb/offline-run-*or view locally
Runs are named after your job_name and metrics are reported via TRL/Transformers. Training metrics (loss, learning rate, etc.) are logged every logging_steps (default: 10), and evaluation metrics are logged at the end of each epoch.
When training is done, you can bundle your output checkpoint with leap-bundle to use it directly within LEAP. Checkout our Quick Start guide.
{
"messages": [
{ "role": "user", "content": "What is the capital of France?" },
{ "role": "assistant", "content": "The capital of France is Paris." }
]
}{
"prompt": "What is the capital of France?",
"chosen": "The capital of France is Paris.",
"rejected": "The capital of France is London."
}{
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an image-based assistant. Answer questions based on the provided image."
}
]
},
{
"role": "user",
"content": [
{ "type": "image", "image": "/path/to/image.jpg" },
{ "type": "text", "text": "What do you see in this image?" }
]
},
{
"role": "assistant",
"content": [{ "type": "text", "text": "I see a car in the image." }]
}
]
}Note: VLM datasets commonly have images in a separate row and are referenced in the messages column. If your image URLs or Paths are in a separate column from your messages, you'll need to merge the images into the 'messages' section like above.
The default configurations are located in:
- SFT Training:
src/leap_finetune/configs/sft_configs.py - DPO Training:
src/leap_finetune/configs/dpo_configs.py - PEFT/LoRA:
src/leap_finetune/configs/peft_configs.py
To add a new training configuration add it to the respective file and then reference it in src/leap_finetune/configs/__init__.py in the TrainingConfig and/or PeftConfig enum.
We also support Liger Kernel and it comes pre-installed.
Just add "use_liger_kernel": True" to your user_config
DatasetLoader supports multiple data sources with automatic format detection and validation.
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset_path |
str |
required | Path to dataset (local, cloud, or HuggingFace Hub ID) |
dataset_type |
"sft" | "dpo" | "vlm_sft" |
required | Training format type |
limit |
int |
None |
Limit number of samples (useful for testing) |
split |
str |
"train" |
Dataset split to use |
test_size |
float |
0.2 |
Fraction of data for evaluation |
subset |
str |
None |
Dataset subset (for HuggingFace datasets with configs) |
# JSONL file
DatasetLoader("/path/to/data.jsonl", "sft")
# Parquet file (faster for large datasets)
DatasetLoader("/path/to/data.parquet", "sft")# Public dataset
DatasetLoader("HuggingFaceTB/smoltalk", "sft", subset="all")
# Private dataset (requires HF login)
DatasetLoader("your-org/private-dataset", "sft")Requires appropriate credentials configured (AWS credentials, GCP service account, Azure credentials).
# Amazon S3
DatasetLoader("s3://bucket/path/to/data.parquet", "sft")
DatasetLoader("s3://bucket/path/to/data.jsonl", "sft")
# Google Cloud Storage
DatasetLoader("gs://bucket/path/to/data.parquet", "sft")
# Azure Blob Storage
DatasetLoader("az://container/path/to/data.parquet", "sft")
DatasetLoader("abfs://container@account.dfs.core.windows.net/path/data.parquet", "sft")Use limit to quickly test your pipeline with a subset of data:
# Test with 100 samples
DatasetLoader("HuggingFaceTB/smoltalk", "sft", subset="all", limit=100)
# Full dataset
DatasetLoader("HuggingFaceTB/smoltalk", "sft", subset="all")| Format | Best For | Notes |
|---|---|---|
| Parquet | Large datasets (>100K rows) | Columnar format, fast reads, smaller file size |
| JSONL | Smaller datasets, human-readable | Line-delimited JSON, easy to inspect |
| HuggingFace | Public datasets | Automatic streaming, no local storage needed |
For datasets that need reformatting, filtering, or joining before training, use the preprocess_fn parameter. This function receives a Ray Dataset and must return a Ray Dataset in the expected format.
import ray.data
def my_preprocess(ds: ray.data.Dataset) -> ray.data.Dataset:
"""Custom preprocessing - runs before validation."""
# Example: Filter rows where content length > 100
ds = ds.filter(lambda row: len(row.get("content", "")) > 100)
# Example: Transform column names
ds = ds.map(lambda row: {
"messages": [
{"role": "user", "content": row["input"]},
{"role": "assistant", "content": row["output"]}
]
})
# Example: Sample 10% of data
ds = ds.random_sample(0.1)
return ds
# Use with DatasetLoader
DatasetLoader(
"path/to/raw-data.jsonl",
"sft",
preprocess_fn=my_preprocess
)Common preprocessing operations:
| Operation | Ray Data Method |
|---|---|
| Filter rows | ds.filter(lambda row: condition) |
| Transform rows | ds.map(lambda row: new_row) |
| Batch transform | ds.map_batches(fn, batch_format="pandas") |
| Sample data | ds.random_sample(fraction) |
| Drop columns | ds.drop_columns(["col1", "col2"]) |
| Rename columns | ds.map(lambda row: {new_name: row[old_name], ...}) |
See Ray Data documentation for all available operations.
- Hook
pre-committo git:uv run pre-commit install - Open a PR with your changes
Pre-commit will now run automatically on commits, or run manually:
uv run pre-commit run --all-filesPlease include a thorough description of changes and additions in your PR.