Skip to content

Add indexed query support for field existence, regex, and full-text search (FTS) while staying local-first #38

@mchurichi

Description

@mchurichi

Summary

Implement Lucene-style query string support for Peek so users can query logs with syntax that is as close to Lucene QueryParser as possible, focusing on:

  • Full-text search (FTS) on message and selected fields (analyzed)
  • Field existence queries (field:*)
  • Regex queries (field:/.../)
  • Wildcards (*, ?), phrases ("..."), required/prohibited (+, -), boosting (^), and boolean logic

Maintain local-first, single-binary distribution and the no-build-step UI model.

Motivation

Peek currently supports a small Lucene-like subset evaluated via scanning (with time-range key seeking). Users want Lucene-like expressiveness, specifically:

  • field:* existence
  • field:/regex/
  • real full-text search behavior (analysis/tokenization), not substring contains

This needs to work both for querying historical logs and for realtime filtering in the UI.

Goals

  1. Accept Lucene-style query string syntax in the UI and API, staying as close to Lucene QueryParser as practical.
  2. Add FTS with an analyzer-driven inverted index (default field behavior like Lucene).
  3. Add field existence query semantics compatible with Lucene (field:*).
  4. Add regex query semantics compatible with Lucene query string (field:/.../).
  5. Keep single binary, local-only, embedded UI in pkg/server/index.html, no new frontend dependencies, immutable VanJS updates.
  6. Add Playwright E2E tests for the new query features.

Non-goals (for this issue)

  • Remote collectors or multi-user deployments
  • Distributed search or external services
  • Full Solr/Elasticsearch feature parity (faceting, aggregations, scoring explanations, etc.)
  • Perfect Lucene scoring parity (ranking differences are acceptable; correctness of filtering is the priority)

Proposed approach (recommended)

Use an embedded Go search index to avoid implementing a full Lucene parser + inverted index from scratch.

Recommendation:

  • Use Bleve's query string query support as the parsing and execution engine for Lucene-like syntax.
  • Keep BadgerDB as the source of truth for stored log entries.

Rationale:

  • Query string syntax supports phrases, field scoping, regex, required/excluded operators, and boosting.
  • Bleve supports query types we need (regexp, wildcard, fuzzy, numeric/date ranges, query string).
  • Keeps local-first and single-binary (just adds a Go dependency and an on-disk index directory).

User-visible query syntax (Lucene-style)

Default field behavior (FTS)

  • Unfielded terms query the default field (configurable), recommended default: message (and optionally a composite field).
    • timeout refused
    • "connection refused"

Field scoping

  • service:api-gateway
  • level:ERROR

Field existence (Lucene semantics)

  • request_id:*
  • user_id:*

Semantics: field is present and has at least one term indexed.

Regex (Lucene query string style)

  • service:/^api-(gateway|edge)$/
  • user_id:/^usr-[0-9]{4}$/

Semantics: regex applies to indexed terms for that field.
Important note:

  • For keyword fields (not analyzed), the term is the full field value, so regex behaves like "regex over the full value".
  • For analyzed fields (like message), regex is term-level, not substring-over-full-text, consistent with Lucene behavior.

Wildcards

  • service:api*
  • request_id:req-?????? (if ? is supported)
  • message:*timeout* (term-level wildcard implications apply)

Boolean and required/prohibited clauses

  • level:ERROR AND service:api
  • +level:ERROR -service:auth

Boosting

  • error^2 timeout

Architecture changes

Storage remains unchanged

  • BadgerDB key format remains: log:{timestamp_nano}:{id}
  • LogEntry JSON stays as-is.

Add embedded index

Introduce an index directory (default under Peek data dir):

  • ~/.peek/index (or ${db_path}/index)

Add configuration:

  • [search] enabled = true|false (default false initially)
  • [search] index_path = "~/.peek/index"
  • [search] default_field = "message"
  • [search] include_in_all = ["message", "raw"] (optional)
  • [search] field_mapping_mode = "dynamic|strict"

CLI flags:

  • --search (enable embedded index)
  • --search-index-path
  • --search-default-field

Index document model

Index one document per log entry with a stable doc id:

  • docID = "{timestamp_nano}:{id}"
  • Badger key can be derived: log:{timestamp_nano}:{id}

Indexed fields (suggested):

  • timestamp (datetime)
  • level (keyword)
  • message (text, analyzed)
  • raw (text or keyword, optional)
  • fields.* (dynamic)
    • strings: keyword by default
    • numbers: numeric
    • booleans: boolean
    • optional: allow marking specific fields as analyzed text via config (eg fields.stacktrace)

Query execution path

When search index is enabled:

  • /query executes the query string against the index to obtain matching docIDs (sorted by timestamp desc if possible).
  • Fetch corresponding LogEntry values from BadgerDB and return them.

When search index is disabled:

  • Use current scan-based filtering behavior (existing query engine), preserving backward compatibility.

Realtime filtering (WS /logs)

Requirement: subscriptions should use the same query semantics as /query.

Preferred implementation:

  • Compile subscription query once.
  • For each new entry, evaluate match without running a full index query per entry per client.

Options:
A) Fast path (recommended):

  • Implement a lightweight per-entry matcher for the supported query subset (existence, term, wildcard, regex, phrase on message) using the same analyzers as indexing.
  • Use the index for historical queries, and the matcher for streaming.

B) Simpler but potentially expensive:

  • Index the new entry, then run a docID-restricted query against the index to decide whether to push to each client.
  • Add guardrails (max clients, rate limits) if this path is used.

Pick A if performance matters for 1k+ logs/sec.

Migration and operational tooling

Add DB command to build/rebuild index:

  • peek db reindex (scans existing Badger logs, builds index)

Add DB command to verify index health:

  • peek db index-stats (doc count, size, last indexed timestamp)

Retention and deletes:

  • Ensure when logs are deleted (db clean, retention), the corresponding documents are removed from the index.
  • If implementing incremental deletes is complex, document that reindex is needed after bulk deletes for v1, but aim to support deletes properly.

UI changes (pkg/server/index.html only)

  • Update syntax highlighting to recognize:
    • regex literals: field:/.../
    • required/prohibited prefixes + and -
    • boosting ^n
    • existence field:*
  • Autocomplete remains based on /fields. No new UI dependencies.

Critical invariants:

  • No scroll resets when queries run, columns change, or state restores.
  • Immutable VanJS state updates.

Testing plan

Unit tests (Go)

  • Query parsing acceptance tests for:
    • field:* exists
    • field:/regex/
    • phrases "..." and field-scoped phrases message:"..."
    • required/prohibited + / -
    • wildcards * and ? (if supported)
  • Indexing tests:
    • Correct docID mapping
    • Dynamic field mapping for string/number/bool
  • Query execution tests:
    • Results match expected docIDs
    • Time range filters (timestamp:[start TO end]) behave correctly
  • Delete/retention tests:
    • Deleting logs removes index docs (or documented reindex requirement)

E2E tests (Playwright)

Add: e2e/lucene-query.spec.mjs

Test cases (minimum):

  1. Field existence
  • Seed logs where some have request_id, others do not.
  • Query request_id:*
  • Assert only logs with that field are shown.
  1. Regex on keyword field
  • Seed services: api-gateway, api-edge, auth-service
  • Query service:/^api-(gateway|edge)$/
  • Assert only the api services match.
  1. FTS on message (default field)
  • Seed messages: "connection timeout", "connection refused", "all good"
  • Query timeout
  • Assert only timeout logs match.
  • Query "connection refused"
  • Assert phrase match returns the correct entry.
  1. Required/prohibited clauses
  • Seed mixed logs
  • Query +level:ERROR -service:auth
  • Assert results include only ERROR and exclude auth.
  1. Wildcard
  • Query service:api*
  • Assert correct matches.
  1. Backward compatibility path (optional)
  • Run same dataset with search index disabled and confirm old behavior still works (or document differences if unavoidable).

Follow existing E2E conventions:

  • Use e2e/helpers.mjs startServer/stopServer pattern
  • Isolated ports and temp DB path
  • Polling assertions, avoid timing flakiness

Acceptance criteria

  • Query string syntax supports: unfielded terms (FTS), field scoping, phrases, regex, existence field:*, wildcards, boolean, + and -, boosting syntax accepted.
  • /query returns correct results using the embedded index when enabled.
  • WS subscriptions apply the same query semantics for streaming.
  • peek db reindex builds an index for an existing DB.
  • Index stays consistent with deletes/retention, or reindex requirement is clearly documented for v1.
  • E2E tests added and passing in CI.
  • AGENTS.md updated in the same PR to reflect new commands, files, and dependencies.
  • /docs/README.md updated with technical details; /README.md updated only with user-facing query syntax and flags.

Implementation checklist

  • Add embedded search index (config, path, enable flag)
  • Define mapping (keyword vs analyzed vs numeric/datetime)
  • Index on ingest and on reindex
  • Implement /query execution via index + Badger fetch
  • Implement WS per-entry matching strategy (prefer compiled matcher)
  • Update UI syntax highlighter (index.html only)
  • Add Go unit tests for parsing/matching and index integration
  • Add e2e/lucene-query.spec.mjs
  • Update docs and AGENTS.md

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions