-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Context
Spike PR: #88 (commit 5 of 7)
Merge order: 5/7 — Depends on: #91 (enrichment filter/clear API), #92 (query download/fulltext endpoints)
Reference branch: spike/validate-docker-compose-stack — commit b5666dd
Overview
Major CLI expansion with new utility modules, a new collections command, comprehensive query command enhancements (summaries, downloads, multi-collection search, deduplication), and enrichment command updates (filter, clear, rename).
Detailed Specification
1. New Utility: Environment Loading (cli/src/lib/env.ts)
loadDotEnvFromCwd(cwd?: string): Promise<void>
- Reads
.envfromcwd(defaults toprocess.cwd()) - Parses
KEY=valuewith: quoted values (single/double with escapes),exportprefix stripping, inline comments (after#) - Never overrides existing
process.envvalues - Silently skips if
.envdoesn't exist
getDefaultApiUrl(): string
Precedence: RAGED_URL → API_HOST_PORT (if numeric: http://localhost:{port}) → http://localhost:8080
2. New Utility: URL Validation (cli/src/lib/url-check.ts)
checkUrl(url, apiKey, baseUrl?, model?): Promise<UrlCheckResult>
- Fetches URL (10s timeout, user-agent header)
- Skips binary content types (images, video, audio, archives)
- Strips HTML tags (removes scripts, styles, nav, footer)
- Requires >30 chars of text content
- Calls OpenAI chat completion to classify as "meaningful" or not
- Filters: login walls, cookie consent, empty pages, paywalls, error pages, redirect pages, <50 words
- Default model:
gpt-4o-mini - Graceful degradation: defaults to
meaningful=trueif OpenAI response unparseable
checkUrls(urls, apiKey, ...): Promise<UrlCheckResult[]>
Batch processing with configurable concurrency (default 5), progress logging.
3. New Command: raged collections
File: cli/src/commands/collections.ts
raged collections [--api <url>] [--token <token>] [--json]- Fetches from
GET /collections - Text output:
{name} docs={N} chunks={N} enriched={N} - JSON output:
{ collections: [...] }
API client method: getCollections(api, token?): Promise<CollectionStats[]>
4. Query Command Enhancements (cli/src/commands/query.ts)
4.1 Positional Query Support
raged query invoice INV89909018 # positional args joined
raged query --q "invoice" # flag-based (takes precedence)4.2 New Flags
| Flag | Type | Default | Description |
|---|---|---|---|
--minScore <n|auto> |
string | "auto" |
Auto-calculates or accepts 0-1 value |
--summary [level] |
optional string | "medium" when flag present |
Show document summary (short/medium/long) |
--keywords |
boolean | false | Extract keywords from tier2/tier3 metadata |
--unique |
boolean | false | Deduplicate by payloadChecksum |
--collections <names> |
string | — | Comma-separated collection names for multi-search |
--allCollections |
boolean | false | Auto-discover all collections, search all |
--full |
boolean | false | Download extracted text to ~/Downloads/{source}.txt |
--stdout |
boolean | false | With --full: print to stdout instead of file |
--download |
boolean | false | Download raw payload to ~/Downloads/ |
--open |
boolean | false | Open first result (URL or downloaded file) |
4.3 Multi-Collection Search
When --collections a,b,c or --allCollections:
- Query each collection independently
- Merge results into single list, each tagged with
collection - Re-sort by score (descending)
- Apply
--uniquededuplication (bypayloadChecksum, keeps highest score) - Apply
--topKlimit after merge
--allCollections calls GET /collections, falls back to ["docs"] if empty.
4.4 Download Behavior
--full: Calls POST /query/fulltext-first, saves to ~/Downloads/{source-basename}.txt
--download: Calls POST /query/download-first, saves to ~/Downloads/{filename}
Conflict resolution: file (1).ext, file (2).ext, etc.
--open: For HTTP(S) sources → opens URL directly. For files → downloads to {tmpdir}/raged-open/{filename} then opens. Cross-platform: open (macOS), xdg-open (Linux), cmd /c start (Windows).
4.5 Summary/Keywords Display
Summary: Selects best available: docSummary{Level} → docSummaryMedium → docSummaryShort → docSummary
Keywords: Checks tier3Meta.keywords → tier3Meta.key_entities → tier2Meta.keywords (handles both string[] and {text: string}[] formats)
4.6 API Client Changes
// Modified: added minScore parameter
export async function query(api, collection, q, topK, minScore, filter?, token?)
// New methods:
export async function downloadFirstQueryMatch(api, collection, q, topK, minScore, filter?, token?)
→ { data: Buffer, fileName: string, source: string, mimeType: string }
export async function downloadFirstQueryMatchText(api, collection, q, topK, minScore, filter?, token?)
→ { text: string, fileName: string, source: string }Response header extraction: content-disposition → filename, x-raged-source → source, content-type → mimeType.
5. Enrich Command Updates (cli/src/commands/enrich.ts)
| Change | Old | New |
|---|---|---|
| Stats flag | --stats-only |
--stats |
| Filter | — | --filter <text> |
| Clear | — | --clear |
--filter: Passed to stats endpoint as query param and to enqueue endpoint in body. Enables selective enrichment.
--clear: Calls POST /enrichment/clear with { collection, filter? }. Shows stats first, then clears.
API client updates:
getEnrichmentStats(api, collection?, filter?, token?)enqueueEnrichment(api, collection, force, filter?, token?)clearEnrichmentQueue(api, collection, filter?, token?)(new)
6. Ingest Command Enhancements (cli/src/commands/ingest.ts)
| New Flag | Type | Description |
|---|---|---|
--urls-file <path> |
string | File with URLs (one per line, # comments) |
--url-check |
boolean | Validate URL meaningfulness via OpenAI before ingest |
--url-check-model |
string | OpenAI model for URL validation (default: gpt-4o-mini) |
--ignore <patterns> |
string | Comma-separated glob patterns to exclude |
--ignore-file <path> |
string | File with ignore patterns |
--batchSize <n> |
string | Files per batch (default: 10) |
Batch error handling:
- HTTP 413 → split batch in half, retry both halves
- Other multi-file error → retry each file individually
- Single-file 413 → skip with advice message
Directory filtering: --doc-type with --dir filters during scan (before --maxFiles).
Source derivation: deriveIngestSource(filePath, { rootDir?, singleFile? }) — relative path for directories, basename for single files.
7. CLI Initialization (cli/src/index.ts)
- Calls
loadDotEnvFromCwd()at startup - Registers
collectionscommand viaregisterCollectionsCommand(program)
8. Type Additions (cli/src/lib/types.ts)
interface CollectionStats {
collection: string;
documentCount: number;
chunkCount: number;
enrichedChunkCount: number;
lastSeenAt: string | null;
}Payload type additions for query results: payloadChecksum, docSummary*, tier2Meta, tier3Meta.
Files (19 files)
| File | Status | Purpose |
|---|---|---|
cli/src/lib/env.ts |
New | .env loading, API URL resolution |
cli/src/lib/env.test.ts |
New | Tests for env parsing |
cli/src/lib/url-check.ts |
New | URL meaningfulness validation |
cli/src/lib/url-check.test.ts |
New | Tests for URL checking |
cli/src/lib/api-client.ts |
Modified | New API methods, minScore param |
cli/src/lib/types.ts |
Modified | CollectionStats, payload types |
cli/src/lib/utils.ts |
Modified | deriveIngestSource, listFiles enhancements |
cli/src/lib/utils.test.ts |
Modified | Tests for new utilities |
cli/src/commands/collections.ts |
New | Collections command |
cli/src/commands/collections.test.ts |
New | Collections tests |
cli/src/commands/query.ts |
Modified | All query enhancements |
cli/src/commands/query.test.ts |
Modified | Extensive query tests |
cli/src/commands/enrich.ts |
Modified | Filter, clear, stats rename |
cli/src/commands/enrich.test.ts |
Modified | Enrich tests |
cli/src/commands/ingest.ts |
Modified | Batch, URL, ignore enhancements |
cli/src/commands/ingest.test.ts |
Modified | Ingest tests |
cli/src/commands/graph.ts |
Modified | Minor adjustments |
cli/src/commands/index.ts |
Modified | Register collections command |
cli/src/index.ts |
Modified | .env loading at startup |
Acceptance Criteria
-
raged collectionslists collections with correct stats -
raged query "text" --summary mediumdisplays document summaries -
raged query --downloaddownloads raw payload to ~/Downloads -
raged query --fulldownloads extracted text;--stdoutprints to terminal -
raged query --openopens URL or temp file -
raged query --uniquededuplicates cross-collection results by checksum -
raged query --allCollectionsauto-discovers and searches all collections -
raged enrich --filter "invoice"filters enrichment targets -
raged enrich --clearremoves queued tasks -
raged ingest --urls-file urls.txtbulk ingests URLs -
raged ingest --ignore "tmp/**"skips matching files -
.envfile loading works at CLI startup - All tests pass