Skip to content

feat(crawl): return output paths + artifact integration in crawl service #30

@jmagar

Description

@jmagar

Summary

When `crawl_start()` enqueues jobs and returns job IDs, callers (MCP tool, web UI, CLI) have no way to know where the crawled documents will be written. Add output path information to the `CrawlStartResult` and integrate crawl output with the MCP artifact system so crawled markdown is discoverable and inspectable via `artifacts` subactions.

Current State

`crates/services/crawl.rs` — `crawl_start()` returns:

pub struct CrawlStartResult {
    pub job_ids: Vec<String>,
}

The caller gets job IDs and nothing else. To find the crawled documents, they must:

  1. Know the `--output-dir` default (.cache/axon-rust/output)
  2. Know the URL→filename mapping convention
  3. Wait for the job to complete before any files exist

The MCP tool's `crawl` + `start` response today: a job ID. No paths, no artifact location, no way to inspect output via `artifacts head` / `artifacts grep` without guessing the file path.

Proposed Changes

1. Add output path info to `CrawlStartResult`

pub struct CrawlStartResult {
    pub job_ids: Vec<String>,
    /// Predicted output directory for crawled documents.
    /// Files will appear here as the crawl progresses.
    pub output_dir: String,
    /// Per-URL predicted output paths (best-effort — actual filenames
    /// may differ slightly based on URL normalization).
    pub predicted_paths: Vec<PredictedPath>,
}

pub struct PredictedPath {
    pub url: String,
    pub path: String,   // relative to output_dir
}

Wire `cfg.output_dir` (already in `Config`) into `crawl_start()` and compute predicted paths using the same URL→filename logic the crawl engine uses.

2. MCP response includes artifact paths

When the MCP `crawl` + `start` handler calls `crawl_start()`, include the output paths in the response so callers can immediately use `artifacts` subactions:

{
  "ok": true,
  "action": "crawl",
  "subaction": "start",
  "data": {
    "job_ids": ["b3085ef9-..."],
    "output_dir": ".cache/axon-rust/output",
    "predicted_paths": [
      { "url": "https://docs.example.com", "path": ".cache/axon-rust/output/docs.example.com/index.md" }
    ],
    "message": "1 crawl job(s) enqueued. Use artifacts subactions to inspect output as it arrives."
  }
}

This lets MCP callers immediately run:

{ "action": "artifacts", "subaction": "head", "path": ".cache/axon-rust/output/docs.example.com/index.md" }

...without having to know the path convention.

3. Crawl status includes completed artifact paths

Enhance `crawl_status()` response to include the list of files actually written once the job completes:

pub struct CrawlJobResult {
    pub payload: serde_json::Value,
    /// Files written by this crawl job (populated after completion).
    pub output_files: Vec<String>,
    /// Total pages crawled, bytes written.
    pub stats: Option<CrawlOutputStats>,
}

The crawl job `result_json` in Postgres already contains page counts — extract file paths from it or scan `output_dir` for files modified during the job window.

4. `CrawlStartResult` in CLI output

When `--json` flag is set, the CLI `crawl` command should include `output_dir` and `predicted_paths` in the JSON output so scripts can locate crawled files without polling.

Files

File Action
`crates/services/types/service.rs` Add `output_dir`, `predicted_paths` to `CrawlStartResult`; add `output_files` to `CrawlJobResult`
`crates/services/crawl.rs` Wire `cfg.output_dir` into `crawl_start()`; compute predicted paths
`crates/mcp/server/` (crawl handler) Include output paths in MCP `crawl` + `start` response
`crates/cli/commands/crawl.rs` Include output paths in `--json` output
`docs/MCP-TOOL-SCHEMA.md` Update crawl response schema

Acceptance Criteria

  • `CrawlStartResult` includes `output_dir` and `predicted_paths`
  • MCP `crawl` + `start` response includes `output_dir` and `predicted_paths`
  • Returned paths are usable directly with `artifacts head` / `artifacts grep` — no guessing required
  • `crawl_status()` response includes `output_files` list when job is complete
  • CLI `--json` output includes `output_dir` and `predicted_paths`
  • Existing `crawl_start()` callers unaffected (additive change)
  • Monolith limits respected (≤500 lines/file)
  • All existing crawl tests pass; new tests for path prediction added
  • `cargo clippy` clean
  • `docs/MCP-TOOL-SCHEMA.md` updated with new response shape

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions