Releases: forecast-bio/atdata-app
Releases · forecast-bio/atdata-app
v0.4.0b1
Added
sendInteractionsXRPC procedure for anonymous usage telemetry — fire-and-forget reporting of download, citation, and derivative events on datasets (#21)- Skeleton/hydration pattern for third-party dataset indexes —
getIndexSkeleton,getIndex,listIndexes, andpublishIndexendpoints following Bluesky's feed generator model (#20) subscribeChangesWebSocket endpoint for real-time change streaming — in-memory event bus broadcasts create/update/delete events to subscribers with cursor-based replay (#22)- Array format type recognition (
sparseBytes,structuredBytes,arrowTensor,safetensors) and ndarray v1.1.0 annotation display (dtype,shape,dimensionNames) in frontend templates (#30) atdata-lexicongit submodule atlexicons/pinned to v0.2.1b1 for reference and CI validation (#27)- CI checkout steps now initialize submodules
Changed
- Ingestion processor refactored to use
UPSERT_FNSdispatch dict instead of if/elif chain - Index provider records (
science.alt.dataset.index) added toCOLLECTION_TABLE_MAPfor firehose ingestion
Security
- SSRF protection: Skeleton fetch now validates endpoint URLs with DNS resolution and blocks private/reserved IP ranges at both fetch time and firehose ingestion time
- Auth:
sendInteractionsendpoint now requires ATProto service auth (was previously unauthenticated) - XSS: Storage URLs in dataset detail pages are only rendered as clickable links when using
http(s)://schemes, preventingjavascript:URI injection - Input validation:
publishIndexrejects endpoint URLs containing embedded credentials or fragments;sendInteractionsvalidates that URIs reference thescience.alt.dataset.entrycollection
Fixed
- ChangeStream backpressure: Subscribers that fall behind are now tracked and explicitly disconnected with WebSocket close code 4000, instead of silently dropping events
- ChangeStream subscriber limit: Capped at 1000 concurrent subscribers; new connections receive close code 1013 when full
- WebSocket keepalive: Restructured the
subscribeChangesevent loop so the 30-second idle keepalive correctly re-enters the processing loop (was previously broken) - Replay deduplication: Track last replayed sequence number to prevent duplicate events when replay buffer overlaps with the live queue
- Task GC: Fire-and-forget analytics tasks now retain references to prevent garbage collection before completion
- Skeleton response cap: Enforce the requested
limiton items returned by external index providers, and cap response body size to 1 MiB - Skeleton item sanitization: Whitelist upstream skeleton items to only the
urifield; validate cursor strings for length and null bytes - Query guard:
query_get_entriesnow rejects requests with more than 100 keys to prevent unbounded OR-clause queries - Template robustness:
shapeanddimensionNamesjoin filters now guard against non-iterable data from malformed firehose records - Removed dead
_validate_iso8601timestamp validation code fromsendInteractions
v0.3.0b1
Changed
- Breaking: Rename collection NSID from
science.alt.dataset.recordtoscience.alt.dataset.entryto align with upstream lexicon v0.2.1b1 — avoids ambiguity with ATProto's "record" concept
v0.2.3b1
Changed
- Breaking: Rename lexicon namespace from
ac.foundation.dataset.*toscience.alt.dataset.*across all XRPC endpoints, firehose filters, SQL schema, and configuration (#17) - DID document service entry updated from
#atproto_appview/AtprotoAppViewto#atdata_appview/AtdataAppView
Added
- Dual-hostname DID document support — serve different
did:webdocuments forapi.atdata.app(appview identity) andatdata.app(atproto account identity) based on theHostheader (#19) - Host-based route gating middleware — frontend HTML routes are only served on the frontend hostname; the API subdomain serves only XRPC, health, and DID endpoints
- Optional
verificationMethod(Multikey) in DID documents when signing keys are configured - New config vars:
ATDATA_FRONTEND_HOSTNAME,ATDATA_PDS_ENDPOINT,ATDATA_SIGNING_KEY,ATDATA_FRONTEND_SIGNING_KEY - Startup validation requiring
ATDATA_PDS_ENDPOINTwhenATDATA_FRONTEND_HOSTNAMEis set
v0.2.2b1
Added
- PostgreSQL integration test suite (58 tests) covering schema validation, upserts, queries, search, analytics, pagination, and edge cases
- Docker auto-start for local integration testing —
conftest.pyspins up a PostgreSQL container whenTEST_DATABASE_URLis not set integration-testCI job running against PostgreSQL 15, 16, and 17
Fixed
- Schema:
array_to_string()isSTABLE, notIMMUTABLE— addedimmutable_array_to_string()wrapper so thesearch_tsvgenerated column works on all PostgreSQL versions - Database: cursor pagination passed
indexed_atas string instead ofdatetime, causing asyncpgDataErrorwith extended query protocol - Database: analytics interval queries passed string literals instead of
timedeltaobjects, causing asyncpg encoding failures - CI:
schema-checkjob silently ignored SQL errors — addedON_ERROR_STOP=1topsqlinvocations
v0.2.1b1
[0.2.1b1] - 2026-02-17
Security
- Validate
lxmclaim in service auth JWT to prevent cross-endpoint token reuse
Added
- PostgreSQL version matrix in CI (
schema-checkjob testing against PG 15, 16, 17)
Fixed
- Schema: use explicit
'english'::regconfigcast insearch_tsvgenerated column for PostgreSQL 17 compatibility - Fix
last_time_usUnboundLocalError in Jetstream consumer on early cancellation - Return 400 instead of 500 for invalid AT-URIs in
getEntry,getEntries,getSchema - Add missing database index on
labels.dataset_uriforquery_labels_for_dataset - Deduplicate cursor pagination helpers into
models.py
v0.2.0b1
[0.2.0b1] - 2026-02-17
Added
- Server-rendered dataset browser frontend with Jinja2 templates, HTMX, and PicoCSS — home/search, dataset detail, schema detail, schemas list, publisher profile, and about pages
- MCP (Model Context Protocol) server for agent-based dataset queries — exposes search, list, get, and describe tools for LLM agents (
mcp_server.py) atdata-mcpCLI entry point for running the MCP server- Lightweight server-side analytics:
analytics_eventstable,analytics_counterssummary table, fire-and-forget event recording viaasyncio.create_task() - XRPC analytics endpoints:
getAnalytics(service-wide stats by period) andgetEntryStats(per-dataset view/search counts) - Analytics summary in
describeServiceresponse (total views, searches, active publishers) query_labels_for_datasetdatabase helper for retrieving labels by dataset URI- PyPI publish workflow via GitHub Actions with OIDC trusted publishing
Fixed
- Dockerfile: added
--no-editabletouv syncso the package installs intosite-packagesinstead of using a dangling.pthreference in the runtime stage
v0.1.0b1
First beta release of the ATProto AppView for ac.foundation.dataset.
Added
- ATProto AppView serving XRPC endpoints for schemas, dataset entries, labels, and lenses
- Jetstream firehose ingestion via WebSocket with backfill support
- XRPC query endpoints:
listSchemas,listEntries,listLabels,listLenses,getSchema,getEntry,getLabel,getLens - XRPC procedure endpoints:
publishSchema,publishEntry,publishLabel,publishLenswith ATProto service auth and PDS proxying - Keyset cursor pagination using
(indexed_at, did, rkey)tuples did:webidentity resolution for the AppView service- Dockerfile with multi-stage uv build, non-root user, and Railway
PORTenv var support .dockerignoreandrailway.tomlfor Railway deployment- PostgreSQL schema with migrations (
sql/schema.sql) - Comprehensive test suite (33 tests) with full mock coverage
- GitHub Actions CI (lint + test on Python 3.12/3.13)
- PyPI publish workflow on release