-
Notifications
You must be signed in to change notification settings - Fork 11
External non-hf model URL download in embedders concerns #63
Description
Three models in our download pipeline are hosted on third-party servers outside HuggingFace. These links are fragile — they can go offline without notice, break silently in CI, and are outside our control. We should either ask the authors to publish to HuggingFace, or mirror the weights ourselves.
Affected Models
| Model | Current Host | URL | Risk |
|---|---|---|---|
| AgriFM | HKU Glass lab server | https://glass.hku.hk/casual/AgriFM/AgriFM.pth | Institutional server — no uptime guarantee |
| FoMo | Dropbox | https://www.dropbox.com/scl/fi/4ckmxlcbc0tcod8hknp7c/... | Shared links can expire or be disabled |
| WildSAT | Google Drive | https://drive.google.com/uc?export=download&id=1IxBpf3nbEMzny4YJWS6stMBxel6gMiYE | Requires confirmation flow; subject to download quotas |
All other models use HuggingFace Hub and are unaffected.
Why This Is a Problem
- Link rot — Dropbox links expire, Drive links hit quotas, and institutional servers go offline during maintenance or staff transitions.
- Silent CI failures — a broken
wgetorgdowncall fails late in a pipeline run with no clear error. - No versioning — unlike HuggingFace Hub, none of these hosts offer commit hashes or changelogs, so we can't pin to a known-good version.
- Reproducibility — anyone running the code from scratch is blocked with no fallback if a link is dead.
Proposed Solutions
Option A — Ask authors to upload to HuggingFace Hub ✅ Preferred
Contact the authors of each model and request they create a HuggingFace repo. Once uploaded, our code can use hf_hub_download() like everything else.
- AgriFM → HKU team (e.g.
hku-glass/AgriFM) - FoMo → check paper repo for contact info
- WildSAT → replace Drive link with HF upload
Option B — Mirror on HuggingFace ourselves
If authors are unresponsive, we can host the weights under a community namespace (e.g. geofm-bench/AgriFM-mirror) with a model card crediting the original authors. Common practice in open-source ML.
Option C — CI health check (stopgap)
Add a lightweight periodic CI job that checks each non-HF URL returns HTTP 200 and isn't redirecting to a quota/confirmation page. Gives early warning before users hit a dead link.
Interim Code Change
Add a comment in the download logic so failures are obvious and traceable:
NON_HF_MODELS = {
"AgriFM": "https://glass.hku.hk/casual/AgriFM/AgriFM.pth",
"FoMo": "https://www.dropbox.com/scl/fi/4ckmxlcbc0tcod8hknp7c/fomo_single_embedding_layer_weights.pt?rlkey=26tlf3yaz93vvcosr0qrvklub&dl=1",
"WildSAT": "https://drive.google.com/uc?export=download&id=1IxBpf3nbEMzny4YJWS6stMBxel6gMiYE",
}
# These URLs are hosted outside HuggingFace and may go offline without notice.
# See: https://github.com/<your-org>/<your-repo>/issues/<this-issue>
Action Items
- Contact AgriFM authors to request HuggingFace upload
- Contact FoMo authors to request HuggingFace upload
- Contact WildSAT authors to request HuggingFace upload
- If no response within ~4 weeks, evaluate Option B (community mirror)
- Add CI health check for non-HF URLs as a stopgap
- Update download code and docs once stable URLs are confirmed
References: HuggingFace — uploading models · gdown (documents Drive quota/confirmation issues)