Batch-seed the local OpenSonarX search index from known documentation sources. Designed to build a day-0 corpus of high-quality technical content for AI agents.
# Seed from a sitemap
oi seed sitemap https://docs.rs/sitemap/tokio/sitemap.xml
# Seed from a URL list
oi seed urls ./my-urls.txt
# Seed top 500 Rust crates from docs.rs
oi seed registry docs-rs --top 500
# Seed all English MDN Web Docs
oi seed registry mdnoi seed registry docs-rs --top <N> [-c <concurrency>]Fetches the top N crates by download count from the crates.io API, then builds https://docs.rs/crate/{name}/latest entry URLs for each.
- Content: Rust API documentation (auto-generated from doc comments)
- Scale: ~160k crates available;
--top 500covers the most-used ecosystem - Rate limiting: 500ms delay between crates.io API pages
- Example crates at top-500: serde, tokio, rand, reqwest, clap, hyper, axum, tracing, regex, chrono, anyhow, thiserror, futures, bytes, log, syn, quote, proc-macro2, libc, etc.
oi seed registry mdn [-c <concurrency>]Fetches the official English sitemap from https://developer.mozilla.org/sitemaps/en-us/sitemap.xml.gz, decompresses it, and extracts all documentation URLs.
- Content: Web platform reference — HTML, CSS, JavaScript, Web APIs, HTTP, accessibility, SVG, MathML
- Scale: ~12,000+ pages covering the full web platform
- Source: Mozilla's canonical sitemap (gzipped XML)
- Covers: MDN reference docs, guides, tutorials, and glossary entries
oi seed sitemap <url> [-c <concurrency>]Fetches a sitemap.xml (or sitemap.xml.gz) and crawls all listed URLs. Supports:
- Standard
<urlset>sitemaps - Sitemap index files (
<sitemapindex>) with one level of nested sitemap resolution - Gzipped sitemaps (
.xml.gz) via binary fetch + gzip decompression
A curated list of 43 verified sitemap URLs is available at examples/seed-sitemaps.txt. To seed all of them:
while IFS= read -r url; do
[[ "$url" =~ ^#.*$ || -z "$url" ]] && continue
oi seed sitemap "$url" -c 8
done < examples/seed-sitemaps.txtThe list covers:
| Category | Sources | Pages |
|---|---|---|
| Languages | Python, Node.js, Go, TypeScript, C#/.NET, Kotlin, PHP, Ruby | ~10k+ |
| Frameworks | Next.js, Vue, Angular, Django, FastAPI, Flask, Laravel, Rails, Spring | ~3k+ |
| Databases | PostgreSQL, Redis, MongoDB, Elasticsearch | ~2k+ |
| Cloud/Infra | AWS, GCP, Docker, Kubernetes, Cloudflare, Vercel, GitLab, Supabase, Firebase, Nginx | ~20k+ |
| AI/ML | PyTorch, TensorFlow, Hugging Face, LangChain, OpenAI, Anthropic | ~5k+ |
| DevTools | Git, Webpack, Vite, ESLint, gRPC, Grafana, Ansible | ~2k+ |
Not available (no sitemap found): Apple Developer, cppreference, React, Svelte, Tailwind CSS, MySQL, SQLite, HashiCorp, GitHub Docs, GraphQL, Prometheus.
oi seed urls <file> [-c <concurrency>]Reads URLs from a text file, one per line. Lines starting with # are treated as comments and blank lines are skipped.
# Core Rust docs
https://doc.rust-lang.org/std/index.html
https://doc.rust-lang.org/book/title-page.html
https://doc.rust-lang.org/cargo/index.html
# Tokio ecosystem
https://docs.rs/tokio/latest/tokio/
https://docs.rs/hyper/latest/hyper/
https://docs.rs/axum/latest/axum/
| Flag | Default | Description |
|---|---|---|
-c, --concurrency |
4 | Number of concurrent crawl workers |
-v, --verbose |
off | Print per-URL progress ([OK], [SKIP], [ERR]) |
--top |
100 | (docs-rs only) Number of top crates to seed |
- Resolve URLs from the chosen source (sitemap, file, or registry)
- Check robots.txt for each domain (fetched once, cached per-domain)
- Crawl each URL through the
Distillerpipeline: fetch HTML, extract to markdown, quality-gate, chunk, embed - Ingest chunks into the local HybridIndex (HNSW + BM25) at
~/.opensonarx/index/ - Save periodically (every 500 indexed URLs) and on completion
- Print summary: total, indexed, failed, robots-skipped, final doc count
Seeded content is stored at ~/.opensonarx/index/ — the same path used by oi network node and oi sentinel start. This means:
- Running
oi network nodeafter seeding will serve the seeded content to P2P peers - Re-running
oi seedadds to the existing index (duplicates are skipped viachunk_iddedup) - Use
oi index infoto inspect the index - Use
oi index exportto create a portable archive for sharing with other nodes
For a comprehensive technical docs corpus:
# 1. Built-in sources
oi seed registry docs-rs --top 500 -c 8
oi seed registry mdn -c 8
# 2. All 43 verified sitemaps
while IFS= read -r url; do
[[ "$url" =~ ^#.*$ || -z "$url" ]] && continue
oi seed sitemap "$url" -c 8
done < examples/seed-sitemaps.txt
# 3. Verify
oi index info