Skip to content

External non-hf model URL download in embedders concerns #63

@danielkiv

Description

@danielkiv

Three models in our download pipeline are hosted on third-party servers outside HuggingFace. These links are fragile — they can go offline without notice, break silently in CI, and are outside our control. We should either ask the authors to publish to HuggingFace, or mirror the weights ourselves.


Affected Models

Model Current Host URL Risk
AgriFM HKU Glass lab server https://glass.hku.hk/casual/AgriFM/AgriFM.pth Institutional server — no uptime guarantee
FoMo Dropbox https://www.dropbox.com/scl/fi/4ckmxlcbc0tcod8hknp7c/... Shared links can expire or be disabled
WildSAT Google Drive https://drive.google.com/uc?export=download&id=1IxBpf3nbEMzny4YJWS6stMBxel6gMiYE Requires confirmation flow; subject to download quotas

All other models use HuggingFace Hub and are unaffected.


Why This Is a Problem

  • Link rot — Dropbox links expire, Drive links hit quotas, and institutional servers go offline during maintenance or staff transitions.
  • Silent CI failures — a broken wget or gdown call fails late in a pipeline run with no clear error.
  • No versioning — unlike HuggingFace Hub, none of these hosts offer commit hashes or changelogs, so we can't pin to a known-good version.
  • Reproducibility — anyone running the code from scratch is blocked with no fallback if a link is dead.

Proposed Solutions

Option A — Ask authors to upload to HuggingFace Hub ✅ Preferred

Contact the authors of each model and request they create a HuggingFace repo. Once uploaded, our code can use hf_hub_download() like everything else.

  • AgriFM → HKU team (e.g. hku-glass/AgriFM)
  • FoMo → check paper repo for contact info
  • WildSAT → replace Drive link with HF upload

Option B — Mirror on HuggingFace ourselves

If authors are unresponsive, we can host the weights under a community namespace (e.g. geofm-bench/AgriFM-mirror) with a model card crediting the original authors. Common practice in open-source ML.

Option C — CI health check (stopgap)

Add a lightweight periodic CI job that checks each non-HF URL returns HTTP 200 and isn't redirecting to a quota/confirmation page. Gives early warning before users hit a dead link.


Interim Code Change

Add a comment in the download logic so failures are obvious and traceable:

NON_HF_MODELS = {
    "AgriFM": "https://glass.hku.hk/casual/AgriFM/AgriFM.pth",
    "FoMo": "https://www.dropbox.com/scl/fi/4ckmxlcbc0tcod8hknp7c/fomo_single_embedding_layer_weights.pt?rlkey=26tlf3yaz93vvcosr0qrvklub&dl=1",
    "WildSAT": "https://drive.google.com/uc?export=download&id=1IxBpf3nbEMzny4YJWS6stMBxel6gMiYE",
}
# These URLs are hosted outside HuggingFace and may go offline without notice.
# See: https://github.com/<your-org>/<your-repo>/issues/<this-issue>

Action Items

  • Contact AgriFM authors to request HuggingFace upload
  • Contact FoMo authors to request HuggingFace upload
  • Contact WildSAT authors to request HuggingFace upload
  • If no response within ~4 weeks, evaluate Option B (community mirror)
  • Add CI health check for non-HF URLs as a stopgap
  • Update download code and docs once stable URLs are confirmed

References: HuggingFace — uploading models · gdown (documents Drive quota/confirmation issues)

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions