Last Modified: 2026-03-28
Web crawl, scrape, extract, embed, and query — all in one binary backed by a self-hosted RAG stack.
Local dev mode:
axon servesupervises the local app stack (bridge backend, MCP HTTP, workers, shell server, Next.js). Infrastructure (Postgres, Redis, RabbitMQ, Qdrant, Chrome, TEI) runs via a separate Docker Compose file (docker-compose.services.yaml).
# 1. Start infrastructure only
docker compose -f docker-compose.services.yaml up -d
# or: just services-up
# 2. Recommended: use the wrapper script (auto-sources .env)
./scripts/axon doctor
./scripts/axon scrape https://example.com --wait true
# 3. Run the local app stack supervisor
cargo run --bin axon -- serve # starts bridge backend, MCP HTTP, workers, shell server, Next.js
# MCP server via CLI subcommand
./scripts/axon mcp
# Or build and run the binary directly
cargo build --release --bin axon
./target/release/axon --help
# Or build + run in one shot (does NOT auto-source .env)
cargo run --bin axon -- scrape https://example.com --wait trueNote: The binary is named
axon. Build withcargo build --bin axon.
Axon ships an MCP server subcommand that exposes a single tool (axon) with action/subaction routing for crawl/extract/embed/ingest/RAG/discovery/ops workflows.
cargo build --release --bin axon
./target/release/axon mcpMCP docs:
docs/MCP.md(runtime/design guide)docs/MCP-TOOL-SCHEMA.md(wire contract schema source of truth)
| Command | Purpose | Async? |
|---|---|---|
scrape <url>... |
Scrape one or more URLs to markdown | No |
crawl <url>... |
Full site crawl for one or more start URLs | Yes (default) |
map <url> |
Discover all URLs without scraping | No |
extract <urls...> |
LLM-powered structured data extraction | Yes (default) |
search <query> |
Web search via Tavily, auto-queues crawl jobs for results | No |
research <query> |
Web research via Tavily AI search with LLM synthesis | No |
embed [input] |
Embed file/dir/URL into Qdrant | Yes (default) |
export |
Export full index manifest (jobs + ingest targets + refresh schedules + Qdrant summary) to JSON | No |
query <text> |
Semantic vector search | No |
retrieve <url> |
Fetch stored document chunks from Qdrant | No |
ask <question> |
RAG: search + LLM answer. Use --graph to inject Neo4j graph context when configured. |
No |
evaluate <question> |
RAG vs baseline + independent LLM judge (accuracy, relevance, completeness, specificity, verdict) | No |
suggest [focus] |
Suggest new docs URLs to crawl | No |
ingest <target> |
Ingest external source (GitHub repo, Reddit subreddit/thread, YouTube video/playlist/channel) — auto-detects source type from target. GitHub: source code indexed by default with tree-sitter AST chunking; use --no-source to skip. |
Yes (default) |
sessions [format] |
Ingest AI session exports (Claude/Codex/Gemini) into Qdrant | No |
sources |
List all indexed URLs + chunk counts | No |
domains |
List indexed domains + stats | No |
stats |
Qdrant collection stats | No |
status |
Show async job queue status | No |
doctor |
Diagnose service connectivity | No |
debug |
Run doctor + LLM-assisted troubleshooting | No |
mcp |
Start MCP stdio server | No |
refresh <url> |
Periodic URL re-indexing (schedule, status, cancel, list). Supports github:owner/repo schedules with pushed_at gating. |
Yes (default) |
graph <sub> |
Knowledge graph operations: build, status, explore, stats, worker. Requires AXON_NEO4J_URL. |
Depends |
serve |
Start web UI server (axum + WebSocket + Docker stats) | No |
watch <sub> |
Scheduled task management: create, list, get, update, run-now, pause, resume, delete, history, artifacts. Scheduler automation requires full mode (AXON_LITE=1 disables watch_scheduler). |
Depends |
migrate --from <src> --to <dst> |
Copy all points from an unnamed-vector collection to a new named-mode collection (dense + bm42 sparse), enabling RRF hybrid search. No re-embedding needed. | No |
axon crawl status <job_id>
axon crawl cancel <job_id>
axon crawl errors <job_id>
axon crawl list
axon crawl cleanup
axon crawl clear
axon crawl recover # reclaim stale/interrupted jobs
axon crawl worker # run a worker inlineAll flags are --global (usable with any subcommand).
| Flag | Type | Default | Description |
|---|---|---|---|
--wait <bool> |
bool | false |
Run synchronously and block until completion. Without this, async commands enqueue and return immediately. |
--yes |
flag | false |
Skip confirmation prompts (non-interactive mode). |
--json |
flag | false |
Machine-readable JSON output on stdout. |
--graph |
flag | false |
Enable graph-enhanced retrieval for ask (requires Neo4j). |
| Flag | Type | Default | Description |
|---|---|---|---|
--max-pages <n> |
u32 | 0 |
Page cap for crawl (0 = uncapped, default). |
--max-depth <n> |
usize | 5 |
Maximum crawl depth from start URL. |
--render-mode <mode> |
enum | auto-switch |
http, chrome, or auto-switch. Auto-switch tries HTTP first, falls back to Chrome if >60% thin pages. |
--format <fmt> |
enum | markdown |
Output format: markdown, html, rawHtml, json. |
--include-subdomains <bool> |
bool | false |
Crawl all subdomains of the start URL's parent domain. Disabled by default — enable with --include-subdomains true. |
--respect-robots <bool> |
bool | false |
Respect robots.txt directives. Note: defaults false — legal/ethical implications. |
--discover-sitemaps <bool> |
bool | true |
Discover and backfill URLs from sitemap.xml after crawl. |
--max-sitemaps <n> |
usize | 512 |
Maximum sitemap URLs to backfill per crawl. |
--sitemap-since-days <n> |
u32 | 0 |
Only backfill sitemap URLs with <lastmod> within the last N days (0 = no filter). URLs without <lastmod> are always included. |
--min-markdown-chars <n> |
usize | 200 |
Minimum markdown character count; pages below this are flagged as "thin". |
--drop-thin-markdown <bool> |
bool | true |
Skip thin pages — do not save or embed them. |
--delay-ms <ms> |
u64 | 0 |
Delay between requests in milliseconds. Useful for polite crawling. |
--header <HEADER> |
string | — | Custom HTTP header in Key: Value format. Repeatable (--header "Auth: Bearer ..." --header "X-Custom: val"). Applied to crawl, scrape, extract, and Chrome re-fetch paths. |
| Flag | Type | Default | Description |
|---|---|---|---|
--output-dir <dir> |
path | .cache/axon-rust/output |
Directory for saved markdown/HTML output files. |
--output <path> |
path | — | Explicit output file path (overrides --output-dir for single-file commands). |
| Flag | Type | Default | Description |
|---|---|---|---|
--collection <name> |
string | cortex |
Qdrant collection name. Also settable via AXON_COLLECTION env var. |
--embed <bool> |
bool | true |
Auto-embed scraped content into Qdrant. |
--limit <n> |
usize | 10 |
Result limit for search/query commands. |
--query <text> |
string | — | Query text (alternative to positional argument for some commands). |
--urls <csv> |
string | — | Comma-separated URL list (alternative to positional arguments). |
| Flag | Type | Default | Description |
|---|---|---|---|
--performance-profile <p> |
enum | high-stable |
high-stable, extreme, balanced, max. Sets defaults for concurrency, timeouts, retries. |
--batch-concurrency <n> |
usize | 16 |
Concurrent connections for batch operations (clamped 1–512). |
--concurrency-limit <n> |
usize | — | Override all three concurrency limits (crawl, sitemap, backfill) at once. |
--crawl-concurrency-limit <n> |
usize | profile | Override crawl concurrency (profile default: CPUs x multiplier). |
--sitemap-concurrency-limit <n> |
usize | profile | Override sitemap backfill concurrency. |
--backfill-concurrency-limit <n> |
usize | profile | Override backfill concurrency. |
--request-timeout-ms <ms> |
u64 | profile | Per-request timeout in milliseconds. |
--fetch-retries <n> |
usize | profile | Number of retries on failed fetches. |
--retry-backoff-ms <ms> |
u64 | profile | Backoff between retries in milliseconds. |
| Flag | Type | Env Var | Fallback |
|---|---|---|---|
--pg-url <url> |
string | AXON_PG_URL |
postgresql://axon:postgres@127.0.0.1:53432/axon |
--redis-url <url> |
string | AXON_REDIS_URL |
redis://127.0.0.1:53379 |
--amqp-url <url> |
string | AXON_AMQP_URL |
amqp://axon:axonrabbit@127.0.0.1:45535/%2f |
--qdrant-url <url> |
string | QDRANT_URL |
http://127.0.0.1:53333 |
--tei-url <url> |
string | TEI_URL |
(empty) |
--openai-base-url <url> |
string | OPENAI_BASE_URL |
(empty) |
--openai-api-key <key> |
string | OPENAI_API_KEY |
(empty) |
--openai-model <name> |
string | OPENAI_MODEL |
(empty) |
| Flag | Type | Env Var | Default |
|---|---|---|---|
--shared-queue <bool> |
bool | — | true |
--crawl-queue <name> |
string | AXON_CRAWL_QUEUE |
axon.crawl.jobs |
--extract-queue <name> |
string | AXON_EXTRACT_QUEUE |
axon.extract.jobs |
--embed-queue <name> |
string | AXON_EMBED_QUEUE |
axon.embed.jobs |
Canonical architecture and data-flow diagrams live in docs/ARCHITECTURE.md.
High-level subsystem map:
- Entrypoint and dispatch:
main.rsloads environment and callsaxon::run()lib.rsownsrun/run_onceand command dispatch
- Command + config:
crates/cli/*command handlerscrates/core/config/{cli,parse,types}.rsflag/env parsing and runtime config resolution
- Crawl + content:
crates/crawl/engine.rscrates/core/http.rsandcrates/core/content.rs
- Async jobs:
crates/jobs/crawl/(manifest, processor, repo, sitemap, watchdog, worker, runtime)crates/jobs/{extract,embed}/modules,crates/jobs/ingest.rscrates/jobs/common/*andcrates/jobs/worker_lane.rs- job states in
crates/jobs/status.rs
- Vector + RAG:
crates/vector/ops/*(TEI embedding, Qdrant upsert/search, ask/evaluate/query)- Hybrid search: new collections use named
dense+bm42sparse vectors with Reciprocal Rank Fusion (RRF) via Qdrant/querywhen hybrid search is active; falls back to dense-only when the sparse query is empty or hybrid is disabled. Legacy collections use dense-only. Seecrates/vector/CLAUDE.md.
- Services layer (services-first contract) — see
crates/services/CLAUDE.md:crates/services/— typed entry points consumed by both CLI handlers and MCP/web routes- CLI commands call
crates/services::{query,retrieve,ask,sources,domains,stats,system}— not rawrun_*_native()functions (those public call-site entry points are removed from the API surface; callers must go through the services layer) - Each service function returns a typed result struct (defined in
crates/services/types/service.rs) — no raw JSON printing or stdout side-effects - MCP handlers and web routes call the same service functions, mapping typed results to wire format
- ACP orchestration lives in
crates/services/acp/(session lifecycle, permission bridge, adapter subprocess) - ACP-backed LLM completions (fire-and-forget, pre-warmed) live in
crates/services/acp_llm/— used by ask synthesis, research, extract fallback; seedocs/ACP.mdfor full protocol reference
- MCP server:
crates/mcp/(schema, server routing, handler modules, config)- Single
axontool withaction/subactionrouting
- Web runtimes:
- WebSocket execution bridge:
crates/web.rs - Active UI:
apps/web/(Next.js — omnibox, Pulse workspace, port 49010)
- WebSocket execution bridge:
The stack is split into two compose files sharing a named axon bridge network:
| File | Contents | Env file |
|---|---|---|
docker-compose.services.yaml |
Infrastructure (postgres, redis, rabbitmq, qdrant, chrome, TEI) | services.env |
docker-compose.yaml |
App containers (workers, web) | .env |
docker-compose.gpu.yaml |
GPU override — NVIDIA reservations for axon-tei and axon-ollama |
(none) |
Start infra first, then app containers. Both compose files read .env for YAML ${VAR} interpolation (Docker Compose default). Container environment is injected via env_file:.
GPU acceleration: On NVIDIA hosts, layer the GPU override on top of the services file:
docker compose -f docker-compose.services.yaml -f docker-compose.gpu.yaml up -dCPU-only hosts use docker-compose.services.yaml alone — no GPU block, no startup failure.
| Service | Image | Exposed Port | Purpose |
|---|---|---|---|
axon-postgres |
postgres:17-alpine | 53432 |
Job persistence |
axon-redis |
redis:8.2-alpine | 53379 |
Queue state / caching |
axon-rabbitmq |
rabbitmq:4.0-management | 45535 |
AMQP job queue |
axon-qdrant |
qdrant/qdrant:v1.13.1 | 53333, 53334 (gRPC) |
Vector store |
axon-tei |
ghcr.io/huggingface/text-embeddings-inference:latest | 52000 |
Embedding generation (GPU, NVIDIA) |
axon-chrome |
built from docker/chrome/Dockerfile | 6000 (management), 9222 (CDP proxy) |
headless_browser + chrome-headless-shell |
| Service | Image | Exposed Port | Purpose |
|---|---|---|---|
axon-workers |
built from docker/Dockerfile | 49000, 8001 |
Workers + axon serve + MCP HTTP |
axon-web |
built from docker/web/Dockerfile | 49010 |
Next.js dashboard |
For local dev, workers and web run as local processes instead:
| Service | Local dev | Command |
|---|---|---|
axon-workers |
local supervisor | cargo run --bin axon -- serve |
axon-web |
supervised child | started by axon serve (port 49010) |
All services live on the axon bridge network. Data volumes use ${AXON_DATA_DIR:-./data}/axon/... (override with AXON_DATA_DIR).
# Start infrastructure (postgres, redis, rabbitmq, qdrant, tei, chrome)
just services-up
# or: docker compose -f docker-compose.services.yaml up -d
# Run the local app stack supervisor
cargo run --bin axon -- serve
# Check infra health
docker compose -f docker-compose.services.yaml ps
# Tail infra logs
docker compose -f docker-compose.services.yaml logs -f
# Full local dev (infra + axon serve supervisor)
just dev
# Stop everything
just down-allTwo env files:
.env— App runtime vars for workers/web + shared compose interpolation vars (credentials,AXON_DATA_DIR). Docker Compose reads this automatically for${VAR}substitution in both compose files.services.env— Infrastructure credentials injected into service containers viaenv_file:.
Copy .env.example → .env and services.env, then fill in values:
# === .env (app runtime + shared interpolation) ===
# Compose persistent data root on host
AXON_DATA_DIR=/home/yourname/appdata
# Postgres
AXON_PG_URL=postgresql://axon:CHANGE_ME@127.0.0.1:53432/axon
# Redis
AXON_REDIS_URL=redis://:CHANGE_ME@axon-redis:6379
# RabbitMQ
AXON_AMQP_URL=amqp://axon:CHANGE_ME@axon-rabbitmq:5672
# Qdrant
QDRANT_URL=http://axon-qdrant:6333
# TEI embeddings (on axon network — container DNS)
TEI_URL=http://axon-tei:80
# LLM / ACP completion settings
# ACP adapter is required for ask/evaluate/suggest/extract fallback/debug/research synthesis.
AXON_ACP_ADAPTER_CMD=codex
AXON_ACP_ADAPTER_ARGS=
# OPENAI_MODEL is used as ACP model override (compatibility key name retained).
OPENAI_BASE_URL=http://YOUR_LLM_HOST/v1
OPENAI_API_KEY=your-key-or-empty
OPENAI_MODEL=your-model-name
# CDP endpoint for headless_browser (axon-chrome management API)
AXON_CHROME_REMOTE_URL=http://axon-chrome:6000
# Optional queue name overrides
AXON_CRAWL_QUEUE=axon.crawl.jobs
AXON_EXTRACT_QUEUE=axon.extract.jobs
AXON_EMBED_QUEUE=axon.embed.jobs
AXON_INGEST_QUEUE=axon.ingest.jobs
AXON_GRAPH_QUEUE=axon.graph.jobs
AXON_COLLECTION=cortex # Qdrant collection (default: cortex)
# Neo4j / GraphRAG (optional — graph features are disabled when AXON_NEO4J_URL is empty)
AXON_NEO4J_URL=http://localhost:7474
AXON_NEO4J_USER=neo4j
AXON_NEO4J_PASSWORD=
AXON_GRAPH_CONCURRENCY=4
AXON_GRAPH_LLM_URL=http://localhost:11434
AXON_GRAPH_LLM_MODEL=qwen3.5:2b
AXON_GRAPH_SIMILARITY_THRESHOLD=0.75
AXON_GRAPH_SIMILARITY_LIMIT=20
AXON_GRAPH_CONTEXT_MAX_CHARS=2000
AXON_GRAPH_TAXONOMY_PATH=
# Search and research (required for search/research commands)
TAVILY_API_KEY=your-tavily-api-key
# Ingest credentials (Reddit required; GitHub optional for higher rate limits)
GITHUB_TOKEN= # optional — raises GitHub rate limits
REDDIT_CLIENT_ID= # required for Reddit ingest targets
REDDIT_CLIENT_SECRET= # required for Reddit ingest targets
# Worker tuning (optional, defaults shown)
AXON_INGEST_LANES=2 # parallel ingest worker lanes
AXON_EMBED_DOC_TIMEOUT_SECS=300 # per-document embed timeout
AXON_EMBED_STRICT_PREDELETE=true # delete existing points before re-embedding
AXON_JOB_STALE_TIMEOUT_SECS=300 # seconds before a running job is considered stale
AXON_JOB_STALE_CONFIRM_SECS=60 # additional grace period before stale reclaimThree auth tokens cover two surfaces (/api/* and /ws):
| Token | Scope | Required |
|---|---|---|
AXON_WEB_API_TOKEN |
Primary token. Server-only — do NOT expose to the browser. Gates both /api/* (proxy.ts) and /ws (Rust WS gate via ?token=). The ?token= query param is a necessary limitation: WebSocket upgrade requests cannot carry custom headers. |
Yes |
AXON_WEB_BROWSER_API_TOKEN |
Optional second-tier token for /api/* routes only. Does not gate /ws. If unset, AXON_WEB_API_TOKEN is used for all /api/* routes. Use this to keep the browser-exposed token separate from the primary WS gate token. |
No |
NEXT_PUBLIC_AXON_API_TOKEN |
Browser-exposed token. apiFetch() sends it as x-api-key on /api/*; use-axon-ws.ts appends it as ?token= on the WS URL. Must equal AXON_WEB_BROWSER_API_TOKEN when that is set, or AXON_WEB_API_TOKEN otherwise. Do not set this to AXON_WEB_API_TOKEN when AXON_WEB_BROWSER_API_TOKEN is configured. |
Yes (when AXON_WEB_API_TOKEN is set) |
MCP OAuth (atk_ tokens) is a separate auth system for MCP clients only — it does not touch /ws or /api/*.
# Primary token — gates both /api/* (proxy.ts) and /ws (Rust WS gate)
AXON_WEB_API_TOKEN=CHANGE_ME
# Optional second-tier token — gates /api/* only (does NOT gate /ws).
# If unset, AXON_WEB_API_TOKEN is used for all routes.
AXON_WEB_BROWSER_API_TOKEN=
# Browser-exposed token — must equal AXON_WEB_BROWSER_API_TOKEN when set, else AXON_WEB_API_TOKEN
# apiFetch() sends it as x-api-key on /api/*; use-axon-ws.ts sends it as ?token= on /ws
NEXT_PUBLIC_AXON_API_TOKEN=
AXON_WEB_ALLOWED_ORIGINS=
AXON_WEB_ALLOW_INSECURE_DEV=false
# Optional shell websocket auth/origin overrides
AXON_SHELL_WS_TOKEN=
AXON_SHELL_ALLOWED_ORIGINS=
# Optional client-side shell websocket token
NEXT_PUBLIC_SHELL_WS_TOKEN=
# Optional allowlist for Pulse chat --betas values
AXON_ALLOWED_CLAUDE_BETAS=interleaved-thinkingThe CLI auto-detects whether it's running inside Docker:
- Inside Docker (
/.dockerenvexists): uses container-internal DNS (axon-postgres:5432, etc.) - Outside Docker (local dev): rewrites to localhost with mapped ports (
127.0.0.1:53432, etc.)
So .env can use container DNS — normalize_local_service_url() in config.rs handles translation transparently.
Lite mode runs axon without Postgres, Redis, or RabbitMQ. Jobs are stored in SQLite and workers run in-process inside the same tokio runtime.
AXON_LITE=1 axon scrape https://example.com # no external services needed
# or
axon --lite scrape https://example.comWhat works in lite mode: scrape, crawl (sync), map, embed, query, ask, extract, ingest, search, research, sources, stats, doctor, MCP server.
Unsupported in lite mode: graph, refresh (including scheduling), watch scheduler, export.
# Env vars for lite mode
AXON_LITE=1 # enable lite mode
AXON_SQLITE_PATH=/path/to/jobs.db # optional; default: $AXON_DATA_DIR/axon/jobs.dbThe ServiceContext (in crates/services/context.rs) is constructed at startup and carries a ServiceCapabilities struct that gates unsupported operations. MCP handlers check ctx.capabilities.<cap>.supported before executing.
See crates/jobs/CLAUDE.md for the JobBackend trait and backend selection details.
By default, crawl, extract, embed, and ingest enqueue jobs and return immediately. Use --wait true to block until completion. Without workers running, enqueued jobs will pend forever.
The default mode. Runs an HTTP crawl first; if >60% of pages are thin (<200 chars) or total coverage is too low, automatically retries with Chrome. Chrome requires a running Chrome instance — if none is available, the HTTP result is kept.
When Chrome feature is compiled in, crawl() expects a Chrome instance. crawl_raw() is pure HTTP and always works. engine.rs calls crawl_raw() for RenderMode::Http and crawl() for Chrome/AutoSwitch.
ask, evaluate, suggest, extract fallback, debug, and research synthesis run through ACP (AXON_ACP_ADAPTER_CMD).
OPENAI_MODEL remains the model override knob for ACP-backed calls.
tei_embed() in vector/ops/tei.rs auto-splits batches on HTTP 413 (Payload Too Large). Set TEI_MAX_CLIENT_BATCH_SIZE env var to control default chunk size (default: 64, max: 128).
On HTTP 429, any 5xx status, transport errors, or response decode failures, tei_embed() makes up to 5 attempts (1 initial + 4 retries) with exponential backoff starting at 1s (1s, 2s, 4s, 8s) plus jitter (up to 500ms each). Override with TEI_MAX_RETRIES env var. Worst-case retry budget: 4 backoff sleeps (15s) + 5 request timeouts (5x30s=150s) + jitter (2s) = ~167s, well inside the 300s doc timeout.
--exclude-path-prefix (and the default locale list) treats both / and - as word boundaries. This means /ja blocks both /ja/docs and /ja-jp/docs. Pass none to disable all locale filtering.
chunk_text() splits at 2000 chars with 200-char overlap. Each chunk = one Qdrant point. Very long pages produce many points.
Pages with fewer than --min-markdown-chars (default: 200) are flagged as thin. If --drop-thin-markdown true (default), thin pages are skipped — not saved to disk or embedded.
build_transform_config() in crates/core/content.rs sets readability: false. Changing this to true causes Mozilla Readability to score VitePress/sidebar doc layouts as low-quality and strip them to just the page title — produces ~97% thin pages on most documentation sites. main_content: true handles structural extraction without the scoring penalty. This setting is the result of a confirmed production regression; do not "improve" it.
ensure_collection() does a GET first; only issues PUT on 404 (collection not found). This means it's safe on existing collections — no 409 Conflict. Safe to call on every embed.
axon migrate --from cortex --to cortex_v2 scrolls all points from the source, computes BM42 sparse vectors locally from chunk_text payload fields (no TEI calls), and upserts named-mode points to the destination. After migration, set AXON_COLLECTION=cortex_v2 in .env.
- Source must be an unnamed collection (
"vectors": {"size": N}schema); named collections are rejected with a clear error. - Destination is created automatically if it doesn't exist; if it already exists as a named collection, migration is idempotent (re-runs upsert existing points with fresh sparse vectors).
- Progress is logged every 100 pages (~25,600 points). At 256 points/page over 2.57M points, expect 1–2 hours.
- The scroll loop uses the raw Qdrant
/points/scrollAPI directly (not the sharedqdrant_scroll_pages_whilehelper) to enable async upserts after each page.
After migration, restart all worker processes. The process-wide VectorMode cache is not invalidated on migration — workers that embedded to the source collection before migration will retain stale Unnamed mode in memory and fall back to dense-only search even for the new named-mode destination collection.
After a crawl, append_sitemap_backfill() discovers URLs via sitemap.xml that the crawler missed and fetches them individually. Respects --max-sitemaps (default: 512) and --include-subdomains. Use --sitemap-since-days N to restrict backfill to URLs whose <lastmod> falls within the last N days; URLs without <lastmod> are always included.
The Dockerfile builds from docker/Dockerfile. The build command inside the container is:
cargo build --release --bin axonBoth compose files set context: . — run docker compose build from this directory, not from a parent workspace.
Cargo.toml uses spider_agent = { path = "../spider/spider_agent", ... } for local dev with a sibling spider/ checkout. In CI or any environment without that sibling repo, switch to the registry version:
spider = { version = "2", default-features = false, features = [
"basic", "chrome", "regex", "sitemap", "adblock",
"chrome_stealth", "chrome_screenshot", "chrome_store_page",
"chrome_headless_new", "chrome_simd",
"simd", "inline-more", "cache_mem",
"ua_generator", "headers", "time", "control",
"firewall",
] }
spider_agent = { version = "2.45", default-features = false, features = ["search_tavily", "openai"] }firewall: Blocks known-bad domains (malware, phishing, spam) before fetch viaspider_firewallcrate. Some URLs may be rejected that weren't before — this is defense-in-depth on top ofvalidate_url().chrome_headless_new: Uses--headless=newinstead of legacy headless. Better DOM fidelity but slightly different rendering behavior on some sites.balance: NOT enabled — silently throttles concurrency with zero logging. We manage concurrency explicitly via performance profiles.glob: NOT enabled — glob URL patterns ({a,b},[0-9]) changecrawl_establishto useis_allowed()(budget-aware) instead ofis_allowed_default(). Withwith_limit(1), the budget check immediately returnsBudgetExceededfor the FIRST URL, producing 0 pages from Chrome crawls. axon doesn't use URL glob patterns in its CLI, so this feature is excluded. Do NOT add it back.- Full flag inventory:
docs/SPIDER-FEATURE-FLAGS.md
CLI commands output JSON data to stdout and progress/logs to stderr (Spinner via indicatif, tracing via log_info/log_done). The web UI streams both: stdout as "type": "output", stderr as "type": "log". ANSI codes stripped via console::strip_ansi_codes().
New crawl job submissions check the count of pending jobs before inserting. If the count is ≥ AXON_MAX_PENDING_CRAWL_JOBS (default 100, 0 = unlimited), the submission is rejected with a human-readable error. Set to 0 to disable. Implemented in crates/jobs/crawl/runtime/db.rs via check_pending_cap().
After an uncapped crawl completes (--max-pages 0, the default), if the total pages crawled exceeds AXON_CRAWL_SIZE_WARN_THRESHOLD (default 10,000), a warning is logged suggesting the user add --max-pages. Set to 0 to disable the warning.
When crawling a URL with ≥2 path segments and no explicit --url-whitelist, the crawl is automatically scoped to the directory subtree of the start URL via a derived whitelist regex. For example, crawling https://ai.google.dev/api/python/google/generativeai/GenerativeModel auto-scopes to ^https?://ai\.google\.dev/api/python/google/generativeai(/|$). Root paths (/) and single-segment paths (/docs) are not scoped — they're already broad enough. Pass --url-whitelist <pattern> to override auto-scoping.
When a worker's AMQP channel dies (broker restart, consumer_timeout, network blip), the lane reconnects automatically with exponential backoff: starts at 2s, doubles each attempt, capped at 60s. On successful reconnect, the backoff resets to 2s only if the connection was alive for >=60 seconds (ran_for_secs >= AMQP_RECONNECT_MAX_SECS in worker_lane.rs). Short-lived connections that reconnect quickly retain their current backoff value. This prevents rapid reconnect loops from hammering the broker after a transient failure. The current job is not lost — it holds no AMQP reference and completes normally before the reconnect loop fires.
Note: The crawl worker's reconnect loop in crates/jobs/crawl/runtime/worker/loops.rs has different semantics: it resets backoff to RECONNECT_BACKOFF_INITIAL_SECS (2s) on every successful reconnect (i.e., when run_amqp_worker_lane returns Ok(())), regardless of how long the connection was alive.
When adding a new non-Option field to Config in crates/core/config.rs, you must also update the inline Config { .. } struct literals used in test helpers:
crates/cli/commands/research.rscrates/cli/commands/search.rs- Any
make_test_config()helpers incrates/jobs/common/
These are struct literals — the compiler will fail if a new field is missing, but only at test compilation time, not cargo check.
Concurrency tuned relative to available CPU cores:
| Profile | Crawl concurrency | Sitemap concurrency | Backfill concurrency | Timeout | Retries | Backoff |
|---|---|---|---|---|---|---|
high-stable (default) |
CPUs×8 (64–192) | CPUs×12 (64–256) | CPUs×6 (32–128) | 20s | 2 | 250ms |
balanced |
CPUs×4 (32–96) | CPUs×6 (32–128) | CPUs×3 (16–64) | 30s | 2 | 300ms |
extreme |
CPUs×16 (128–384) | CPUs×20 (128–512) | CPUs×10 (64–256) | 15s | 1 | 100ms |
max |
CPUs×24 (256–1024) | CPUs×32 (256–1536) | CPUs×20 (128–1024) | 12s | 1 | 50ms |
cargo build --bin axon # debug
cargo build --release --bin axon # release
cargo check # fast type checkcargo test # run all tests
cargo test http # SSRF / URL validation tests (21)
cargo test engine # crawl engine tests (8)
cargo test chunk_text # text chunking tests (7)
cargo test -- --nocapture # show println! outputcargo clippy
cargo fmt --checkjust verify # fmt-check + clippy + check + test (pre-PR gate)
just fix # cargo fmt + clippy --fix (auto-repair)
just precommit # full pre-commit: monolith check + verify
just watch-check # cargo watch: check + test-lib on every file save
just rebuild # check + test + docker-build (pre-deploy gate)
just services-up # start infra (postgres, redis, rabbitmq, qdrant, tei, chrome)
just services-down # stop infra
just up # build + start app containers (workers + web)
just down # stop app containers
just down-all # stop everything (app + infra)
just dev # full local dev (infra + axon serve supervisor)# Debug binary
./target/debug/axon scrape https://example.com
# With env overrides
QDRANT_URL=http://localhost:53333 \
TEI_URL=http://myserver:52000 \
./target/release/axon query "embedding pipeline" --collection my_colChanged .rs files are enforced at CI and via lefthook pre-commit:
- File size: ≤ 500 lines (hard fail)
- Function size: warn at 80 lines, hard fail at 120 lines
- Exempt:
tests/**,benches/**,config/**,**/config.rs - Exceptions: add to
.monolith-allowlist
./scripts/install-git-hooks.sh # install lefthook onceaxon doctorChecks: Postgres, Redis, RabbitMQ, Qdrant, TEI, LLM endpoint reachability.
Tables are auto-created via ensure_schema() in each *_jobs.rs. Full column detail: docs/SCHEMA.md.
| Table | Key columns |
|---|---|
axon_crawl_jobs |
id, url, status, config_json, result_json — index on status |
axon_extract_jobs |
id, status, urls_json, config_json, result_json |
axon_embed_jobs |
id, status, input_text, config_json, result_json |
axon_ingest_jobs |
id, source_type, target, status, config_json, result_json — partial index on pending |
All tables share: created_at, updated_at, started_at, finished_at (TIMESTAMPTZ), error_text (TEXT).
axon_ingest_jobs differs from the others: it uses source_type (github/reddit/youtube) + target instead of url or urls_json to identify the ingest target.
- Rust standard style — run
cargo fmtbefore committing cargo clippyclean before committing- Errors bubble via
Box<dyn Error>at command boundaries; internal helpers return typed errors - Structured log output via
log_info/log_warn(notprintln!in library code) --jsonflag enables machine-readable output on all commands that print results
Never use mod.rs. Use the Rust 2018+ file-per-module layout:
# WRONG — do not do this
foo/
└── mod.rs ← forbidden
# CORRECT
foo.rs ← module root lives here
foo/
├── bar.rs ← submodule
└── baz.rs ← submodule
- Module root always lives in
foo.rs, neverfoo/mod.rs - Submodules live in
foo/bar.rs, declared withmod bar;insidefoo.rs - When splitting an existing
foo/mod.rs: copy it tofoo.rs, deletefoo/mod.rs— the submodule files stay infoo/unchanged - This applies everywhere:
crates/,crates/*/, nested modules — no exceptions
This project uses bd (beads) for issue tracking. Run bd prime to see full workflow context and commands.
bd ready # Find available work
bd show <id> # View issue details
bd update <id> --claim # Claim work
bd close <id> # Complete work- Use
bdfor ALL task tracking — do NOT use TodoWrite, TaskCreate, or markdown TODO lists - Run
bd primefor detailed command reference and session close protocol - Use
bd rememberfor persistent knowledge — do NOT use MEMORY.md files
When ending a work session, you MUST complete ALL steps below. Work is NOT complete until git push succeeds.
MANDATORY WORKFLOW:
- File issues for remaining work - Create issues for anything that needs follow-up
- Run quality gates (if code changed) - Tests, linters, builds
- Update issue status - Close finished work, update in-progress items
- PUSH TO REMOTE - This is MANDATORY:
git pull --rebase bd dolt push git push git status # MUST show "up to date with origin" - Clean up - Clear stashes, prune remote branches
- Verify - All changes committed AND pushed
- Hand off - Provide context for next session
CRITICAL RULES:
- Work is NOT complete until
git pushsucceeds - NEVER stop before pushing - that leaves work stranded locally
- NEVER say "ready to push when you are" - YOU must push
- If push fails, resolve and retry until it succeeds
Every feature branch push MUST bump the version in ALL version-bearing files.
Bump type is determined by the commit message prefix:
feat!:orBREAKING CHANGE→ major (X+1.0.0)featorfeat(...)→ minor (X.Y+1.0)- Everything else (
fix,chore,refactor,test,docs, etc.) → patch (X.Y.Z+1)
Files to update (if they exist in this repo):
Cargo.toml—version = "X.Y.Z"in[package]package.json—"version": "X.Y.Z"pyproject.toml—version = "X.Y.Z"in[project].claude-plugin/plugin.json—"version": "X.Y.Z".codex-plugin/plugin.json—"version": "X.Y.Z"gemini-extension.json—"version": "X.Y.Z"README.md— version badge or headerCHANGELOG.md— new entry under the bumped version
All files MUST have the same version. Never bump only one file. CHANGELOG.md must have an entry for every version bump.