The backend serves as a central hub for the WARP (Weapon & Armor Recognition Program) ecosystem. It facilitates community-driven data collection and automated model training.
- FastAPI Service (main.py):
- Handles contributions from mobile clients.
- Serves the merged knowledge base and model version metadata.
- Manages rate limiting and basic data validation.
- Model Trainer (admin_train.py):
- A standalone script for democratic voting on contributions.
- Trains two PyTorch models:
icon_classifier(EfficientNet-B0) andscreen_classifier(MobileNetV3-Small). - Builds
ship_type_corrections.jsonfrom Ship Type / Ship Tier OCR correction pairs. - Uploads trained models and correction map to Hugging Face Hub.
- Data Storage (Hugging Face Dataset):
- Stores raw contributions, crops, and screenshots.
- Hosts the final models, metadata, and correction maps.
- Client sends a base64 encoded PNG crop + metadata to
/contribute. - Backend validates the image (size, uniformity) and uploads it to the
contributions/folder in the HF Dataset.
- Triggered via CLI, GitHub Actions Schedule, or Webhook Dispatcher.
admin_train.pyscansstaging/folders in the dataset.- Applies democratic voting on icon crops: 1 unique
install_id= 1 vote. - Downloads winning samples and trains/fine-tunes models using PyTorch.
- Uploads updated models and
model_version.jsonto the model repository.
collect_text_corrections()filters staging annotations whereslotisShip TypeorShip Tier.- For each
(ml_name, name)pair whereml_name != name: votes are cast perinstall_id. - Majority wins per
ml_namekey →ship_type_corrections.json:{"F1eet Support Cruiser": "Fleet Support Cruiser", ...} - Uploaded to
sets-sto/warp-knowledge/models/ship_type_corrections.json. - Clients download and apply corrections in
text_extractor.pybefore ShipDB lookup. - Community anchors threshold: groups with n=1 contributor are accepted as tentative truth; n≥2 contributors use median aggregation. (Changed from n≥3 in 2026-03-31.)
- admin_merge.py (or
/admin/mergeendpoint) consolidates raw contributions into a unifiedknowledge.jsonmapping.
- Hosting: Render (FastAPI service).
- CI/CD: GitHub Actions (for automated training triggers). See .github/workflows/train_central_model.yml.
- Storage: Hugging Face Hub (Dataset repo).
The workflow (.github/workflows/train_central_model.yml) runs on a schedule (every hour) and on manual dispatch. Key characteristics:
- Runner:
ubuntu-latest— no GPU. PyTorch is installed with--index-url https://download.pytorch.org/whl/cpu. Training runs on CPU. - Timeout: 60 minutes hard cap.
- Skip logic:
--skip-if-unchangedexits early (~60s) if no new crops have arrived since the last training run.
| Issue | Root cause | Fix applied |
|---|---|---|
AssertionError: Torch not compiled with CUDA enabled |
Nested torch.device() in condition was always truthy → always selected cuda |
Use torch.cuda.is_available() directly |
AttributeError: 'RepoFolder' object has no attribute 'type' |
list_repo_tree() returns RepoFolder objects which have no .type attribute |
Use isinstance(e, RepoFolder) |
| Download timeout >1h | Per-contributor snapshot_download loop caused N full dataset metadata scans |
Single bulk snapshot_download call with all patterns |
| CPU training exceeds 60 min limit | EfficientNet-B0 on CPU with ~3000 samples: one epoch >8 min × 30 epochs = hours | deadline parameter in both train() functions; main() sets now + 50 min |
snapshot_download hangs indefinitely (~1h) on a specific file |
httpx (used internally) is async — socket.setdefaulttimeout has no effect on it; one file stalls the entire download silently |
Replaced with urllib.request + ThreadPoolExecutor(16) — urllib uses blocking sockets, socket timeout of 120 s kills any stalled read |
Training budget is set at 50 minutes from when main() enters the training block. This leaves approximately 10 minutes for model upload to HF Hub. Both train() and train_screen_classifier() accept a deadline: float | None parameter (a time.monotonic() timestamp); each epoch checks the deadline before starting and exits early if exceeded. The best model state accumulated so far is still saved and uploaded.
Files are downloaded in parallel using urllib.request in a ThreadPoolExecutor(max_workers=16). Only the exact files needed (known SHA + install_id pairs from the voting step) are fetched — no full-repo metadata scan. socket.setdefaulttimeout(120) is set globally before downloads begin; any stalled TCP read raises socket.timeout after 2 minutes, which is caught per-file and logged as a skip. snapshot_download (httpx/async) was abandoned because httpx async I/O bypasses Python's socket timeout mechanism entirely.
_list_staging_folders() uses list_repo_tree(..., recursive=False) to fetch only the top-level staging/ directory — O(1) API call instead of a full recursive scan. Falls back to list_repo_files() only on exception.
The FastAPI service is deployed on Render as a web service (render.yaml). Key notes:
- Python version: 3.12 (specified in
render.yaml) - Start command:
uvicorn main:app --host 0.0.0.0 --port $PORT - Health check:
/health - Dependencies:
requirements.txtwith pinned exact versions (==) — unpinned deps causedstarlette-1.0.0to be installed, which is incompatible withfastapi 0.135.xand caused uvicorn to start and immediately exit (port scan timeout on Render).
| Issue | Root cause | Fix |
|---|---|---|
Port scan timeout reached, no open ports detected |
starlette-1.0.0 (major release, breaking changes) installed via unpinned starlette>=... |
Pin all deps with == in requirements.txt |
| Deploy picks up wrong package versions | No upper bounds in requirements → new releases break compatibility | Use == pins, update deliberately |