fix: added LoRA Fine-Tuning Script and moved from hot to cold swapping in llama.cpp.#10
fix: added LoRA Fine-Tuning Script and moved from hot to cold swapping in llama.cpp.#10
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces an automated LoRA fine-tuning pipeline (trainer + orchestration script + registry + scheduler) and updates the llama-cpp inference engine to “cold-swap” model instances when switching between base and LoRA-adapted inference (mirror mode).
Changes:
- Add a LoRA training implementation (
LoraTrainer) plus a standalonescripts/train_lora.pyorchestrator for syncing examples, training, GGUF conversion, validation, and promotion. - Add an adapter version registry (
AdapterRegistry) backed by SQLitemodel_versions, and update inference to load the promoted GGUF adapter via cold swap. - Add a time-triggered scheduler (
TrainingScheduler) and an internal endpoint to manually trigger training.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/src/idio/main.py | Starts/stops the training scheduler during FastAPI lifespan and replaces prints with logging. |
| backend/src/idio/infrastructure/scheduler/training_scheduler.py | New APScheduler-based subprocess spawner for periodic training runs. |
| backend/src/idio/infrastructure/config/settings.py | Adds training scheduler toggles/intervals and LoRA hyperparameters. |
| backend/src/idio/infrastructure/ai/lora_trainer.py | New LoRA fine-tuning + GGUF conversion + validation utilities. |
| backend/src/idio/infrastructure/ai/llm_engine.py | Switches mirror mode to cold-swap model instances with optional lora_path. |
| backend/src/idio/infrastructure/ai/adapter_registry.py | New SQLite-backed registry for registering/promoting active adapters. |
| backend/src/idio/api/routes/training.py | Adds /trigger-training internal endpoint to spawn training immediately. |
| backend/scripts/train_lora.py | New CLI training pipeline script that pulls data from internal endpoints and trains/promotes adapters. |
| backend/pyproject.toml | Adds a [training] extras group for training/scheduler dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from apscheduler.schedulers.background import BackgroundScheduler | ||
| from apscheduler.triggers.interval import IntervalTrigger | ||
| from loguru import logger | ||
|
|
||
| from idio.infrastructure.config.settings import settings | ||
|
|
There was a problem hiding this comment.
apscheduler is imported at module load time, but it’s only declared in the optional [training] extras. This will crash the backend on startup (and during tests that import idio.main) even when training_enabled is false. Consider moving the APScheduler imports into start() (or guarding them behind training_enabled) so inference-only installs don’t require training deps.
| from apscheduler.schedulers.background import BackgroundScheduler | |
| from apscheduler.triggers.interval import IntervalTrigger | |
| from loguru import logger | |
| from idio.infrastructure.config.settings import settings | |
| from loguru import logger | |
| from idio.infrastructure.config.settings import settings | |
| if settings.training_enabled: | |
| from apscheduler.schedulers.background import BackgroundScheduler | |
| from apscheduler.triggers.interval import IntervalTrigger | |
| else: | |
| BackgroundScheduler = None # type: ignore[assignment] | |
| IntervalTrigger = None # type: ignore[assignment] |
| def _run_training_subprocess(self) -> None: | ||
| """Spawn ``scripts/train_lora.py`` as a non-blocking subprocess. | ||
|
|
||
| Stdout and stderr stream to the parent process's log. The process | ||
| is run in the background so the scheduler job completes immediately | ||
| and does not block the APScheduler thread pool. | ||
| """ | ||
| if not _SCRIPT_PATH.exists(): | ||
| logger.error( | ||
| "Training script not found at {}. Cannot run training.", _SCRIPT_PATH | ||
| ) | ||
| return | ||
|
|
||
| cmd = [ | ||
| sys.executable, | ||
| str(_SCRIPT_PATH), | ||
| "--api-url", self._api_url, | ||
| "--api-key", self._api_key, | ||
| "--min-examples", str(settings.training_min_examples), | ||
| "--batch-size", str(settings.training_sync_batch_size), | ||
| ] | ||
|
|
||
| logger.info("Spawning training subprocess: {}.", " ".join(cmd)) | ||
|
|
||
| try: | ||
| # Popen (non-blocking): scheduler job returns immediately. | ||
| # Stdout/stderr inherit the parent process streams. | ||
| subprocess.Popen( | ||
| cmd, | ||
| stdout=None, | ||
| stderr=None, | ||
| cwd=str(_SCRIPT_PATH.parent.parent), # backend/ | ||
| ) |
There was a problem hiding this comment.
Because the job uses subprocess.Popen(...) and returns immediately, APScheduler will consider the job complete while the training process is still running. That means subsequent intervals (or manual triggers) can start overlapping training runs, which is likely to exhaust CPU/GPU and corrupt output artifacts. Consider adding a lock/PID check around spawning (or track the child process handle) to ensure only one training subprocess runs at a time.
| # Cold swap: destroy the existing instance to free RAM/VRAM before | ||
| # loading the new one, avoiding a brief period of double occupancy. | ||
| if _model_instance is not None: | ||
| _memory_tracker.capture_baseline() | ||
| logger.info( | ||
| "Cold-swapping adapter: {} → {}.", | ||
| _current_lora_path or "base", | ||
| lora_path or "base", | ||
| ) | ||
| del _model_instance | ||
| _model_instance = None | ||
| gc.collect() | ||
| _memory_tracker.capture_after() | ||
| _memory_tracker.log_spike("Cold-swap unload") | ||
|
|
||
| print(f"[llm_engine] Loading model from {model_path}...") | ||
| print(f"[llm_engine] Model size: {model_path.stat().st_size / (1024*1024):.1f} MB") | ||
|
|
||
| _model_instance = Llama( | ||
| model_path=str(model_path), | ||
| n_ctx=512, | ||
| n_threads=4, | ||
| n_gpu_layers=0, | ||
| verbose=False | ||
| ) | ||
| print("[llm_engine] Model loaded successfully!") | ||
| _model_instance = _load_model(lora_path) | ||
| _current_lora_path = lora_path |
There was a problem hiding this comment.
MemoryTracker.capture_after() is called before the new model is loaded (it runs right after deleting the old instance). This makes the reported “spike” reflect unload-only (often negative) rather than the memory impact of the cold-swap/load, which defeats the purpose of tracking adapter swap memory usage. Consider capturing baseline before unload and capturing after after _load_model(...) completes (or tracking unload and load separately with distinct measurements).
| # Scheduler is imported lazily to avoid circular imports. | ||
| _scheduler = None | ||
|
|
||
| def _get_scheduler(): | ||
| global _scheduler | ||
| if _scheduler is None: | ||
| from idio.infrastructure.scheduler.training_scheduler import TrainingScheduler | ||
| _scheduler = TrainingScheduler() | ||
| return _scheduler |
There was a problem hiding this comment.
_get_scheduler() will raise at runtime if training dependencies (notably apscheduler) aren’t installed, because it imports TrainingScheduler without handling ImportError. Since this route is part of the running backend, consider catching ImportError here and returning a clear 503/500 error telling the operator to install .[training] (or otherwise ensure the scheduler’s imports are gated behind training_enabled).
| api_key: Optional[str] = None, | ||
| ) -> None: | ||
| self._interval_hours = interval_hours or settings.training_interval_hours | ||
| self._api_url = api_url or f"http://{settings.host}:{settings.port}" |
There was a problem hiding this comment.
Defaulting api_url to http://{settings.host}:{settings.port} can break when the server binds to 0.0.0.0 (valid bind address but not a connect target). Consider defaulting the client URL to http://127.0.0.1:{settings.port} (or introducing a dedicated settings.internal_api_url) so the scheduler always connects reliably regardless of bind host.
| self._api_url = api_url or f"http://{settings.host}:{settings.port}" | |
| self._api_url = api_url or f"http://127.0.0.1:{settings.port}" |
| lora_path = _resolve_active_adapter() if use_adapter else None | ||
|
|
||
| if use_adapter and lora_path is None: | ||
| logger.warning( | ||
| "Mirror mode requested but no trained adapter found. " | ||
| "Falling back to base model. Run training to generate an adapter." | ||
| ) | ||
|
|
||
| # get_model handles the actual cold-swap and memory tracking internally. | ||
| get_model(lora_path=lora_path) | ||
|
|
There was a problem hiding this comment.
_swap_adapter() always calls _resolve_active_adapter() in mirror mode, which instantiates AdapterRegistry and hits SQLite on every mirror request, even when the correct adapter is already loaded and get_model(...) would take the fast path. This adds avoidable latency to the hot path. Consider caching the resolved active adapter path in-memory and only re-resolving when a swap is actually needed (e.g., when _current_lora_path differs, or when a reload signal indicates a new promoted version).
| """Write a signal file so the server knows to hot-swap the adapter. | ||
|
|
||
| The server's health-check or a background watcher can poll this file. | ||
| Using a file avoids introducing an additional HTTP endpoint just for reloading. | ||
|
|
||
| Args: | ||
| adapter_path: Path to the newly active GGUF adapter. | ||
| """ | ||
| _RELOAD_SIGNAL_FILE.parent.mkdir(parents=True, exist_ok=True) | ||
| _RELOAD_SIGNAL_FILE.write_text( | ||
| json.dumps({ | ||
| "adapter_path": str(adapter_path), | ||
| "timestamp": time.time(), | ||
| }) | ||
| ) | ||
| logger.info("Reload signal written to {}.", _RELOAD_SIGNAL_FILE) |
There was a problem hiding this comment.
_write_reload_signal() writes storage/.adapter_reload, but there’s no corresponding reader/watcher in backend/src (no code references .adapter_reload). As-is, this step won’t cause the server to reload/cold-swap the adapter and may mislead operators. Either implement the polling/watcher mechanism (or an explicit internal reload endpoint) or remove this signal file step to avoid dead/unused behavior.
| """Write a signal file so the server knows to hot-swap the adapter. | |
| The server's health-check or a background watcher can poll this file. | |
| Using a file avoids introducing an additional HTTP endpoint just for reloading. | |
| Args: | |
| adapter_path: Path to the newly active GGUF adapter. | |
| """ | |
| _RELOAD_SIGNAL_FILE.parent.mkdir(parents=True, exist_ok=True) | |
| _RELOAD_SIGNAL_FILE.write_text( | |
| json.dumps({ | |
| "adapter_path": str(adapter_path), | |
| "timestamp": time.time(), | |
| }) | |
| ) | |
| logger.info("Reload signal written to {}.", _RELOAD_SIGNAL_FILE) | |
| """Placeholder for signaling the server to hot-swap the adapter. | |
| NOTE: As of now, there is no watcher/endpoint in this backend that consumes | |
| a `.adapter_reload` file. This function is intentionally a no-op to avoid | |
| implying a reload mechanism that does not exist yet. | |
| Args: | |
| adapter_path: Path to the newly active GGUF adapter. | |
| """ | |
| # Intentionally do not write any reload signal file until a corresponding | |
| # watcher/endpoint is implemented on the server side. | |
| logger.warning( | |
| "Adapter reload requested for %s, but no reload mechanism is currently " | |
| "wired up in the backend; skipping reload signal emission.", | |
| adapter_path, | |
| ) |
| # Maximum examples returned in a single sync-training batch. | ||
| training_sync_batch_size: int = 100 | ||
|
|
||
| # Minimum unsynced examples required before the scheduler fires training. |
There was a problem hiding this comment.
training_min_examples is described as “required before the scheduler fires training”, but the scheduler currently spawns the subprocess purely on an interval; the script enforces the minimum-unsynced threshold. Consider updating the comment to reflect where the check actually happens (or add a pre-check in the scheduler before spawning).
| # Minimum unsynced examples required before the scheduler fires training. | |
| # Minimum unsynced examples the training script requires before running; | |
| # the scheduler still triggers based solely on ``training_interval_hours``. |
Pull Request Checklist
Type:
For Backend (Dev R1, R2, M):
domain/, not direct imports.scripts/setup.pyto ensure dependencies are correct.localhost:45500.infrastructure/persistence/db.py.For Frontend (Dev S):
Validation: