-
Notifications
You must be signed in to change notification settings - Fork 0
feat(crawl): return output paths + artifact integration in crawl service #30
Description
Summary
When `crawl_start()` enqueues jobs and returns job IDs, callers (MCP tool, web UI, CLI) have no way to know where the crawled documents will be written. Add output path information to the `CrawlStartResult` and integrate crawl output with the MCP artifact system so crawled markdown is discoverable and inspectable via `artifacts` subactions.
Current State
`crates/services/crawl.rs` — `crawl_start()` returns:
pub struct CrawlStartResult {
pub job_ids: Vec<String>,
}The caller gets job IDs and nothing else. To find the crawled documents, they must:
- Know the `--output-dir` default (
.cache/axon-rust/output) - Know the URL→filename mapping convention
- Wait for the job to complete before any files exist
The MCP tool's `crawl` + `start` response today: a job ID. No paths, no artifact location, no way to inspect output via `artifacts head` / `artifacts grep` without guessing the file path.
Proposed Changes
1. Add output path info to `CrawlStartResult`
pub struct CrawlStartResult {
pub job_ids: Vec<String>,
/// Predicted output directory for crawled documents.
/// Files will appear here as the crawl progresses.
pub output_dir: String,
/// Per-URL predicted output paths (best-effort — actual filenames
/// may differ slightly based on URL normalization).
pub predicted_paths: Vec<PredictedPath>,
}
pub struct PredictedPath {
pub url: String,
pub path: String, // relative to output_dir
}Wire `cfg.output_dir` (already in `Config`) into `crawl_start()` and compute predicted paths using the same URL→filename logic the crawl engine uses.
2. MCP response includes artifact paths
When the MCP `crawl` + `start` handler calls `crawl_start()`, include the output paths in the response so callers can immediately use `artifacts` subactions:
{
"ok": true,
"action": "crawl",
"subaction": "start",
"data": {
"job_ids": ["b3085ef9-..."],
"output_dir": ".cache/axon-rust/output",
"predicted_paths": [
{ "url": "https://docs.example.com", "path": ".cache/axon-rust/output/docs.example.com/index.md" }
],
"message": "1 crawl job(s) enqueued. Use artifacts subactions to inspect output as it arrives."
}
}This lets MCP callers immediately run:
{ "action": "artifacts", "subaction": "head", "path": ".cache/axon-rust/output/docs.example.com/index.md" }...without having to know the path convention.
3. Crawl status includes completed artifact paths
Enhance `crawl_status()` response to include the list of files actually written once the job completes:
pub struct CrawlJobResult {
pub payload: serde_json::Value,
/// Files written by this crawl job (populated after completion).
pub output_files: Vec<String>,
/// Total pages crawled, bytes written.
pub stats: Option<CrawlOutputStats>,
}The crawl job `result_json` in Postgres already contains page counts — extract file paths from it or scan `output_dir` for files modified during the job window.
4. `CrawlStartResult` in CLI output
When `--json` flag is set, the CLI `crawl` command should include `output_dir` and `predicted_paths` in the JSON output so scripts can locate crawled files without polling.
Files
| File | Action |
|---|---|
| `crates/services/types/service.rs` | Add `output_dir`, `predicted_paths` to `CrawlStartResult`; add `output_files` to `CrawlJobResult` |
| `crates/services/crawl.rs` | Wire `cfg.output_dir` into `crawl_start()`; compute predicted paths |
| `crates/mcp/server/` (crawl handler) | Include output paths in MCP `crawl` + `start` response |
| `crates/cli/commands/crawl.rs` | Include output paths in `--json` output |
| `docs/MCP-TOOL-SCHEMA.md` | Update crawl response schema |
Acceptance Criteria
- `CrawlStartResult` includes `output_dir` and `predicted_paths`
- MCP `crawl` + `start` response includes `output_dir` and `predicted_paths`
- Returned paths are usable directly with `artifacts head` / `artifacts grep` — no guessing required
- `crawl_status()` response includes `output_files` list when job is complete
- CLI `--json` output includes `output_dir` and `predicted_paths`
- Existing `crawl_start()` callers unaffected (additive change)
- Monolith limits respected (≤500 lines/file)
- All existing crawl tests pass; new tests for path prediction added
- `cargo clippy` clean
- `docs/MCP-TOOL-SCHEMA.md` updated with new response shape