Skip to content

Commit e764737

Browse files
authored
chore: periodic maintenance — deps update and spec sync (#83)
## What Periodic maintenance per `specs/maintenance.md` checklist. ## Why Keep dependencies current and specs aligned with code. ## How - `cargo update` to bump minor/patch deps (aws-lc-sys, cc, cmake, iri-string, jni-sys, mio, simd-adler32) - Sync `specs/initial.md`: MCP tool name (`web_fetch` not `fetchkit`), updated description and input schema to match implementation - Sync `specs/fetchers.md`: document all 13 built-in fetchers (was 3), update module structure and response format values ## Risk - Low - Dependency updates are minor/patch only; spec changes are documentation-only ### Checklist - [x] Unit tests are passed - [x] Smoke tests are passed - [x] Documentation is updated - [x] Specs are up to date and not in conflict - [x] All 8 maintenance sections reviewed
1 parent 9e4ea8c commit e764737

File tree

3 files changed

+124
-19
lines changed

3 files changed

+124
-19
lines changed

Cargo.lock

Lines changed: 37 additions & 15 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

specs/fetchers.md

Lines changed: 84 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,14 +69,87 @@ Central dispatcher that:
6969
- Quoted tweets rendered as blockquotes
7070
- Both APIs are unauthenticated; syndication API is undocumented but widely used
7171

72+
#### GitHubCodeFetcher
73+
74+
- Matches: `https://github.com/{owner}/{repo}/blob/{ref}/{path}`
75+
- Excludes: Reserved owner paths (settings, issues, pulls, etc.)
76+
- Behavior: Fetches raw source files via GitHub API, detects language from extension, handles base64 decoding, returns metadata for files >1MB or binary
77+
- Response format field: `"github_file"`
78+
79+
#### GitHubIssueFetcher
80+
81+
- Matches: `https://github.com/{owner}/{repo}/issues/{number}` and `https://github.com/{owner}/{repo}/pull/{number}`
82+
- Excludes: Reserved owner paths, non-numeric IDs
83+
- Behavior: Fetches issue/PR metadata, labels, assignees, milestone, and up to 100 comments; PRs include diff stats and merge status
84+
- Response format field: `"github_issue"` or `"github_pull_request"`
85+
86+
#### StackOverflowFetcher
87+
88+
- Matches: `https://{stackoverflow.com|serverfault.com|superuser.com|askubuntu.com|mathoverflow.net|*.stackexchange.com}/questions/{id}`
89+
- Behavior: Fetches question and top 10 answers sorted by votes via Stack Exchange API
90+
- Response format field: `"stackoverflow_qa"`
91+
92+
#### PackageRegistryFetcher
93+
94+
- Matches: `https://pypi.org/project/{name}`, `https://crates.io/crates/{name}`, `https://www.npmjs.com/package/{name}` (including @scope/name)
95+
- Behavior: Fetches package metadata from respective registry APIs
96+
- Response format field: `"package_registry"`
97+
98+
#### WikipediaFetcher
99+
100+
- Matches: `https://{lang}.wikipedia.org/wiki/{title}`
101+
- Behavior: Fetches article summary via MediaWiki REST API and full HTML, converts to markdown
102+
- Response format field: `"wikipedia"`
103+
104+
#### YouTubeFetcher
105+
106+
- Matches: `https://youtube.com/watch?v={id}`, `https://youtu.be/{id}`
107+
- Behavior: Fetches video metadata via oEmbed API
108+
- Response format field: `"youtube_video"`
109+
110+
#### ArXivFetcher
111+
112+
- Matches: `https://arxiv.org/abs/{id}` and `https://arxiv.org/pdf/{id}`
113+
- Behavior: Fetches paper metadata via arXiv Atom XML API
114+
- Response format field: `"arxiv_paper"`
115+
116+
#### HackerNewsFetcher
117+
118+
- Matches: `https://news.ycombinator.com/item?id={id}`
119+
- Behavior: Fetches item via HN Firebase API with top 20 comments and one level of replies
120+
- Response format field: `"hackernews"`
121+
122+
#### RSSFeedFetcher
123+
124+
- Matches: URLs ending with `/feed`, `/rss`, `/atom`, `.rss`, `.xml` variants
125+
- Behavior: Detects RSS 2.0 or Atom 1.0, parses up to 20 entries
126+
- Response format field: `"rss_feed"`
127+
128+
#### DocsSiteFetcher
129+
130+
- Matches: Direct `/llms.txt` or `/llms-full.txt` URLs, or known docs sites (ReadTheDocs, docs.rs, GitBook, etc.)
131+
- Behavior: Probes for llms-full.txt/llms.txt at origin; if not found, fetches page and converts HTML to markdown
132+
- Response format field: `"documentation"` or `"markdown"`
133+
72134
### Response Extensions
73135

74136
`FetchResponse.format` values:
75137
- `"markdown"` - HTML converted to markdown
76138
- `"text"` - HTML converted to plain text
77139
- `"raw"` - Original content unchanged
78140
- `"github_repo"` - GitHub repository metadata + README
141+
- `"github_file"` - GitHub source file content
142+
- `"github_issue"` - GitHub issue content
143+
- `"github_pull_request"` - GitHub pull request content
79144
- `"twitter_tweet"` - Twitter/X tweet content with metadata
145+
- `"stackoverflow_qa"` - Stack Overflow Q&A
146+
- `"package_registry"` - Package registry metadata
147+
- `"wikipedia"` - Wikipedia article
148+
- `"youtube_video"` - YouTube video metadata
149+
- `"arxiv_paper"` - arXiv paper metadata
150+
- `"hackernews"` - Hacker News item with comments
151+
- `"rss_feed"` - RSS/Atom feed entries
152+
- `"documentation"` - Documentation site content
80153

81154
### Configuration
82155

@@ -127,9 +200,19 @@ crates/fetchkit/src/
127200
├── file_saver.rs # FileSaver trait, LocalFileSaver, SaveResult, FileSaveError
128201
├── fetchers/
129202
│ ├── mod.rs # Fetcher trait, FetcherRegistry
203+
│ ├── arxiv.rs # ArXivFetcher
130204
│ ├── default.rs # DefaultFetcher (with binary-aware fetch_to_file override)
205+
│ ├── docs_site.rs # DocsSiteFetcher
206+
│ ├── github_code.rs # GitHubCodeFetcher
207+
│ ├── github_issue.rs # GitHubIssueFetcher
131208
│ ├── github_repo.rs # GitHubRepoFetcher
132-
│ └── twitter.rs # TwitterFetcher
209+
│ ├── hackernews.rs # HackerNewsFetcher
210+
│ ├── package_registry.rs # PackageRegistryFetcher
211+
│ ├── rss_feed.rs # RSSFeedFetcher
212+
│ ├── stackoverflow.rs # StackOverflowFetcher
213+
│ ├── twitter.rs # TwitterFetcher
214+
│ ├── wikipedia.rs # WikipediaFetcher
215+
│ └── youtube.rs # YouTubeFetcher
133216
```
134217

135218
## API

specs/initial.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -137,10 +137,10 @@ Provide a builder to configure tool options, including:
137137

138138
### MCP Server
139139

140-
- Expose a single `fetchkit` tool over MCP.
141-
- Input schema: `{ url: string }` (required).
140+
- Expose a single `web_fetch` tool over MCP.
141+
- Input schema: derived from `FetchRequest` via tool builder (disabled options omitted).
142142
- Output: Markdown with YAML frontmatter (same format as CLI `--output md`).
143-
- Tool description: "Fetch URL and return markdown with metadata frontmatter. Optimized for LLM consumption."
143+
- Tool description: "Fetch URL content as text or markdown; return metadata for binary responses or save bytes to file."
144144

145145
### Python Bindings
146146

0 commit comments

Comments
 (0)