-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Implement Lucene-style query string support for Peek so users can query logs with syntax that is as close to Lucene QueryParser as possible, focusing on:
- Full-text search (FTS) on message and selected fields (analyzed)
- Field existence queries (
field:*) - Regex queries (
field:/.../) - Wildcards (
*,?), phrases ("..."), required/prohibited (+,-), boosting (^), and boolean logic
Maintain local-first, single-binary distribution and the no-build-step UI model.
Motivation
Peek currently supports a small Lucene-like subset evaluated via scanning (with time-range key seeking). Users want Lucene-like expressiveness, specifically:
field:*existencefield:/regex/- real full-text search behavior (analysis/tokenization), not substring contains
This needs to work both for querying historical logs and for realtime filtering in the UI.
Goals
- Accept Lucene-style query string syntax in the UI and API, staying as close to Lucene QueryParser as practical.
- Add FTS with an analyzer-driven inverted index (default field behavior like Lucene).
- Add field existence query semantics compatible with Lucene (
field:*). - Add regex query semantics compatible with Lucene query string (
field:/.../). - Keep single binary, local-only, embedded UI in
pkg/server/index.html, no new frontend dependencies, immutable VanJS updates. - Add Playwright E2E tests for the new query features.
Non-goals (for this issue)
- Remote collectors or multi-user deployments
- Distributed search or external services
- Full Solr/Elasticsearch feature parity (faceting, aggregations, scoring explanations, etc.)
- Perfect Lucene scoring parity (ranking differences are acceptable; correctness of filtering is the priority)
Proposed approach (recommended)
Use an embedded Go search index to avoid implementing a full Lucene parser + inverted index from scratch.
Recommendation:
- Use Bleve's query string query support as the parsing and execution engine for Lucene-like syntax.
- Keep BadgerDB as the source of truth for stored log entries.
Rationale:
- Query string syntax supports phrases, field scoping, regex, required/excluded operators, and boosting.
- Bleve supports query types we need (regexp, wildcard, fuzzy, numeric/date ranges, query string).
- Keeps local-first and single-binary (just adds a Go dependency and an on-disk index directory).
User-visible query syntax (Lucene-style)
Default field behavior (FTS)
- Unfielded terms query the default field (configurable), recommended default: message (and optionally a composite field).
timeout refused"connection refused"
Field scoping
service:api-gatewaylevel:ERROR
Field existence (Lucene semantics)
request_id:*user_id:*
Semantics: field is present and has at least one term indexed.
Regex (Lucene query string style)
service:/^api-(gateway|edge)$/user_id:/^usr-[0-9]{4}$/
Semantics: regex applies to indexed terms for that field.
Important note:
- For keyword fields (not analyzed), the term is the full field value, so regex behaves like "regex over the full value".
- For analyzed fields (like message), regex is term-level, not substring-over-full-text, consistent with Lucene behavior.
Wildcards
service:api*request_id:req-??????(if?is supported)message:*timeout*(term-level wildcard implications apply)
Boolean and required/prohibited clauses
level:ERROR AND service:api+level:ERROR -service:auth
Boosting
error^2 timeout
Architecture changes
Storage remains unchanged
- BadgerDB key format remains:
log:{timestamp_nano}:{id} - LogEntry JSON stays as-is.
Add embedded index
Introduce an index directory (default under Peek data dir):
~/.peek/index(or${db_path}/index)
Add configuration:
[search] enabled = true|false(default false initially)[search] index_path = "~/.peek/index"[search] default_field = "message"[search] include_in_all = ["message", "raw"](optional)[search] field_mapping_mode = "dynamic|strict"
CLI flags:
--search(enable embedded index)--search-index-path--search-default-field
Index document model
Index one document per log entry with a stable doc id:
docID = "{timestamp_nano}:{id}"- Badger key can be derived:
log:{timestamp_nano}:{id}
Indexed fields (suggested):
timestamp(datetime)level(keyword)message(text, analyzed)raw(text or keyword, optional)fields.*(dynamic)- strings: keyword by default
- numbers: numeric
- booleans: boolean
- optional: allow marking specific fields as analyzed text via config (eg
fields.stacktrace)
Query execution path
When search index is enabled:
/queryexecutes the query string against the index to obtain matching docIDs (sorted by timestamp desc if possible).- Fetch corresponding LogEntry values from BadgerDB and return them.
When search index is disabled:
- Use current scan-based filtering behavior (existing query engine), preserving backward compatibility.
Realtime filtering (WS /logs)
Requirement: subscriptions should use the same query semantics as /query.
Preferred implementation:
- Compile subscription query once.
- For each new entry, evaluate match without running a full index query per entry per client.
Options:
A) Fast path (recommended):
- Implement a lightweight per-entry matcher for the supported query subset (existence, term, wildcard, regex, phrase on message) using the same analyzers as indexing.
- Use the index for historical queries, and the matcher for streaming.
B) Simpler but potentially expensive:
- Index the new entry, then run a docID-restricted query against the index to decide whether to push to each client.
- Add guardrails (max clients, rate limits) if this path is used.
Pick A if performance matters for 1k+ logs/sec.
Migration and operational tooling
Add DB command to build/rebuild index:
peek db reindex(scans existing Badger logs, builds index)
Add DB command to verify index health:
peek db index-stats(doc count, size, last indexed timestamp)
Retention and deletes:
- Ensure when logs are deleted (db clean, retention), the corresponding documents are removed from the index.
- If implementing incremental deletes is complex, document that
reindexis needed after bulk deletes for v1, but aim to support deletes properly.
UI changes (pkg/server/index.html only)
- Update syntax highlighting to recognize:
- regex literals:
field:/.../ - required/prohibited prefixes
+and- - boosting
^n - existence
field:*
- regex literals:
- Autocomplete remains based on
/fields. No new UI dependencies.
Critical invariants:
- No scroll resets when queries run, columns change, or state restores.
- Immutable VanJS state updates.
Testing plan
Unit tests (Go)
- Query parsing acceptance tests for:
field:*existsfield:/regex/- phrases
"..."and field-scoped phrasesmessage:"..." - required/prohibited
+/- - wildcards
*and?(if supported)
- Indexing tests:
- Correct docID mapping
- Dynamic field mapping for string/number/bool
- Query execution tests:
- Results match expected docIDs
- Time range filters (
timestamp:[start TO end]) behave correctly
- Delete/retention tests:
- Deleting logs removes index docs (or documented reindex requirement)
E2E tests (Playwright)
Add: e2e/lucene-query.spec.mjs
Test cases (minimum):
- Field existence
- Seed logs where some have
request_id, others do not. - Query
request_id:* - Assert only logs with that field are shown.
- Regex on keyword field
- Seed services:
api-gateway,api-edge,auth-service - Query
service:/^api-(gateway|edge)$/ - Assert only the api services match.
- FTS on message (default field)
- Seed messages: "connection timeout", "connection refused", "all good"
- Query
timeout - Assert only timeout logs match.
- Query
"connection refused" - Assert phrase match returns the correct entry.
- Required/prohibited clauses
- Seed mixed logs
- Query
+level:ERROR -service:auth - Assert results include only ERROR and exclude auth.
- Wildcard
- Query
service:api* - Assert correct matches.
- Backward compatibility path (optional)
- Run same dataset with search index disabled and confirm old behavior still works (or document differences if unavoidable).
Follow existing E2E conventions:
- Use
e2e/helpers.mjsstartServer/stopServer pattern - Isolated ports and temp DB path
- Polling assertions, avoid timing flakiness
Acceptance criteria
- Query string syntax supports: unfielded terms (FTS), field scoping, phrases, regex, existence
field:*, wildcards, boolean,+and-, boosting syntax accepted. -
/queryreturns correct results using the embedded index when enabled. - WS subscriptions apply the same query semantics for streaming.
-
peek db reindexbuilds an index for an existing DB. - Index stays consistent with deletes/retention, or reindex requirement is clearly documented for v1.
- E2E tests added and passing in CI.
-
AGENTS.mdupdated in the same PR to reflect new commands, files, and dependencies. -
/docs/README.mdupdated with technical details;/README.mdupdated only with user-facing query syntax and flags.
Implementation checklist
- Add embedded search index (config, path, enable flag)
- Define mapping (keyword vs analyzed vs numeric/datetime)
- Index on ingest and on
reindex - Implement
/queryexecution via index + Badger fetch - Implement WS per-entry matching strategy (prefer compiled matcher)
- Update UI syntax highlighter (index.html only)
- Add Go unit tests for parsing/matching and index integration
- Add
e2e/lucene-query.spec.mjs - Update docs and AGENTS.md