Skip to content

feat(cli): Comprehensive CLI enhancements #93

@mfittko

Description

@mfittko

Context

Spike PR: #88 (commit 5 of 7)
Merge order: 5/7 — Depends on: #91 (enrichment filter/clear API), #92 (query download/fulltext endpoints)
Reference branch: spike/validate-docker-compose-stack — commit b5666dd


Overview

Major CLI expansion with new utility modules, a new collections command, comprehensive query command enhancements (summaries, downloads, multi-collection search, deduplication), and enrichment command updates (filter, clear, rename).


Detailed Specification

1. New Utility: Environment Loading (cli/src/lib/env.ts)

loadDotEnvFromCwd(cwd?: string): Promise<void>

  • Reads .env from cwd (defaults to process.cwd())
  • Parses KEY=value with: quoted values (single/double with escapes), export prefix stripping, inline comments (after #)
  • Never overrides existing process.env values
  • Silently skips if .env doesn't exist

getDefaultApiUrl(): string

Precedence: RAGED_URLAPI_HOST_PORT (if numeric: http://localhost:{port}) → http://localhost:8080

2. New Utility: URL Validation (cli/src/lib/url-check.ts)

checkUrl(url, apiKey, baseUrl?, model?): Promise<UrlCheckResult>

  • Fetches URL (10s timeout, user-agent header)
  • Skips binary content types (images, video, audio, archives)
  • Strips HTML tags (removes scripts, styles, nav, footer)
  • Requires >30 chars of text content
  • Calls OpenAI chat completion to classify as "meaningful" or not
  • Filters: login walls, cookie consent, empty pages, paywalls, error pages, redirect pages, <50 words
  • Default model: gpt-4o-mini
  • Graceful degradation: defaults to meaningful=true if OpenAI response unparseable

checkUrls(urls, apiKey, ...): Promise<UrlCheckResult[]>

Batch processing with configurable concurrency (default 5), progress logging.

3. New Command: raged collections

File: cli/src/commands/collections.ts

raged collections [--api <url>] [--token <token>] [--json]
  • Fetches from GET /collections
  • Text output: {name} docs={N} chunks={N} enriched={N}
  • JSON output: { collections: [...] }

API client method: getCollections(api, token?): Promise<CollectionStats[]>

4. Query Command Enhancements (cli/src/commands/query.ts)

4.1 Positional Query Support

raged query invoice INV89909018   # positional args joined
raged query --q "invoice"          # flag-based (takes precedence)

4.2 New Flags

Flag Type Default Description
--minScore <n|auto> string "auto" Auto-calculates or accepts 0-1 value
--summary [level] optional string "medium" when flag present Show document summary (short/medium/long)
--keywords boolean false Extract keywords from tier2/tier3 metadata
--unique boolean false Deduplicate by payloadChecksum
--collections <names> string Comma-separated collection names for multi-search
--allCollections boolean false Auto-discover all collections, search all
--full boolean false Download extracted text to ~/Downloads/{source}.txt
--stdout boolean false With --full: print to stdout instead of file
--download boolean false Download raw payload to ~/Downloads/
--open boolean false Open first result (URL or downloaded file)

4.3 Multi-Collection Search

When --collections a,b,c or --allCollections:

  1. Query each collection independently
  2. Merge results into single list, each tagged with collection
  3. Re-sort by score (descending)
  4. Apply --unique deduplication (by payloadChecksum, keeps highest score)
  5. Apply --topK limit after merge

--allCollections calls GET /collections, falls back to ["docs"] if empty.

4.4 Download Behavior

--full: Calls POST /query/fulltext-first, saves to ~/Downloads/{source-basename}.txt
--download: Calls POST /query/download-first, saves to ~/Downloads/{filename}
Conflict resolution: file (1).ext, file (2).ext, etc.
--open: For HTTP(S) sources → opens URL directly. For files → downloads to {tmpdir}/raged-open/{filename} then opens. Cross-platform: open (macOS), xdg-open (Linux), cmd /c start (Windows).

4.5 Summary/Keywords Display

Summary: Selects best available: docSummary{Level}docSummaryMediumdocSummaryShortdocSummary
Keywords: Checks tier3Meta.keywordstier3Meta.key_entitiestier2Meta.keywords (handles both string[] and {text: string}[] formats)

4.6 API Client Changes

// Modified: added minScore parameter
export async function query(api, collection, q, topK, minScore, filter?, token?)

// New methods:
export async function downloadFirstQueryMatch(api, collection, q, topK, minScore, filter?, token?)
   { data: Buffer, fileName: string, source: string, mimeType: string }

export async function downloadFirstQueryMatchText(api, collection, q, topK, minScore, filter?, token?)
   { text: string, fileName: string, source: string }

Response header extraction: content-disposition → filename, x-raged-source → source, content-type → mimeType.

5. Enrich Command Updates (cli/src/commands/enrich.ts)

Change Old New
Stats flag --stats-only --stats
Filter --filter <text>
Clear --clear

--filter: Passed to stats endpoint as query param and to enqueue endpoint in body. Enables selective enrichment.

--clear: Calls POST /enrichment/clear with { collection, filter? }. Shows stats first, then clears.

API client updates:

  • getEnrichmentStats(api, collection?, filter?, token?)
  • enqueueEnrichment(api, collection, force, filter?, token?)
  • clearEnrichmentQueue(api, collection, filter?, token?) (new)

6. Ingest Command Enhancements (cli/src/commands/ingest.ts)

New Flag Type Description
--urls-file <path> string File with URLs (one per line, # comments)
--url-check boolean Validate URL meaningfulness via OpenAI before ingest
--url-check-model string OpenAI model for URL validation (default: gpt-4o-mini)
--ignore <patterns> string Comma-separated glob patterns to exclude
--ignore-file <path> string File with ignore patterns
--batchSize <n> string Files per batch (default: 10)

Batch error handling:

  • HTTP 413 → split batch in half, retry both halves
  • Other multi-file error → retry each file individually
  • Single-file 413 → skip with advice message

Directory filtering: --doc-type with --dir filters during scan (before --maxFiles).

Source derivation: deriveIngestSource(filePath, { rootDir?, singleFile? }) — relative path for directories, basename for single files.

7. CLI Initialization (cli/src/index.ts)

  • Calls loadDotEnvFromCwd() at startup
  • Registers collections command via registerCollectionsCommand(program)

8. Type Additions (cli/src/lib/types.ts)

interface CollectionStats {
  collection: string;
  documentCount: number;
  chunkCount: number;
  enrichedChunkCount: number;
  lastSeenAt: string | null;
}

Payload type additions for query results: payloadChecksum, docSummary*, tier2Meta, tier3Meta.


Files (19 files)

File Status Purpose
cli/src/lib/env.ts New .env loading, API URL resolution
cli/src/lib/env.test.ts New Tests for env parsing
cli/src/lib/url-check.ts New URL meaningfulness validation
cli/src/lib/url-check.test.ts New Tests for URL checking
cli/src/lib/api-client.ts Modified New API methods, minScore param
cli/src/lib/types.ts Modified CollectionStats, payload types
cli/src/lib/utils.ts Modified deriveIngestSource, listFiles enhancements
cli/src/lib/utils.test.ts Modified Tests for new utilities
cli/src/commands/collections.ts New Collections command
cli/src/commands/collections.test.ts New Collections tests
cli/src/commands/query.ts Modified All query enhancements
cli/src/commands/query.test.ts Modified Extensive query tests
cli/src/commands/enrich.ts Modified Filter, clear, stats rename
cli/src/commands/enrich.test.ts Modified Enrich tests
cli/src/commands/ingest.ts Modified Batch, URL, ignore enhancements
cli/src/commands/ingest.test.ts Modified Ingest tests
cli/src/commands/graph.ts Modified Minor adjustments
cli/src/commands/index.ts Modified Register collections command
cli/src/index.ts Modified .env loading at startup

Acceptance Criteria

  • raged collections lists collections with correct stats
  • raged query "text" --summary medium displays document summaries
  • raged query --download downloads raw payload to ~/Downloads
  • raged query --full downloads extracted text; --stdout prints to terminal
  • raged query --open opens URL or temp file
  • raged query --unique deduplicates cross-collection results by checksum
  • raged query --allCollections auto-discovers and searches all collections
  • raged enrich --filter "invoice" filters enrichment targets
  • raged enrich --clear removes queued tasks
  • raged ingest --urls-file urls.txt bulk ingests URLs
  • raged ingest --ignore "tmp/**" skips matching files
  • .env file loading works at CLI startup
  • All tests pass

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions