Skip to content

Comments

feat: add rtk rgai command for semantic code search#124

Closed
heAdz0r wants to merge 2 commits intortk-ai:masterfrom
heAdz0r:feat/rgai-command
Closed

feat: add rtk rgai command for semantic code search#124
heAdz0r wants to merge 2 commits intortk-ai:masterfrom
heAdz0r:feat/rgai-command

Conversation

@heAdz0r
Copy link
Contributor

@heAdz0r heAdz0r commented Feb 14, 2026

Summary

Extracted from #118 per reviewer feedback. This is the actual feature implementation — the command that #118's docs/hooks referenced but didn't include.

rtk rgai is a Rust-native semantic search that scores files by term relevance without requiring external embedding services.

Usage

rtk rgai "auth token refresh"              # multi-word semantic query
rtk rgai auth token refresh --compact      # unquoted, compact output
rtk rgai "error handler" --json            # machine-readable JSON
rtk rgai "database migration" -t rust      # filter by file type
rtk rgai "auth flow" -p ./src              # explicit path

How it works

  1. Parse query → remove stop words, stem tokens
  2. Walk project (gitignore-aware, skip binary/large files)
  3. Score each line: term match (+1.4-1.7), phrase match (+6.0), multi-term bonus (+1.2), symbol definition boost (+2.5), comment penalty (×0.7)
  4. Score each file: path relevance + top snippet scores + match density
  5. Rank files, emit top N with context snippets

Token savings

  • 75-90% reduction vs raw grep output (fewer files, ranked by relevance, truncated lines)
  • Compact mode: 1 snippet per file, 0 context lines
  • JSON mode: structured output for programmatic consumption

Changes

  • src/rgai_cmd.rs: 789 lines — full search implementation with scoring, ranking, output formatting
  • src/main.rs: Commands::Rgai variant, match arm, normalize_rgai_args() for backward-compat path detection

Test plan

  • cargo test rgai — 8 tests pass (5 unit + 3 arg normalization)
  • cargo test — 321 total tests pass
  • cargo fmt --all --check — clean
  • cargo clippy --all-targets — no new warnings
  • Manual: rtk rgai "token tracking" -p . returns ranked results from this repo

Rust-native semantic search that scores files and lines by term
relevance, symbol definitions, and path matching. No external
dependencies (no grepai/embeddings required).

Features:
- Natural-language multi-word queries: rtk rgai "auth token refresh"
- File scoring with symbol definition boost (+2.5) and comment penalty
- Stop word removal + basic stemming for better recall
- Compact and JSON output modes
- File type filtering (--file-type ts/py/rust/etc.)
- gitignore-aware traversal via `ignore` crate
- Binary and large file skipping
- Backward-compat: trailing path token auto-detection

Includes 8 unit tests (5 in rgai_cmd, 3 for arg normalization).

let suffixes = ["ingly", "edly", "ing", "ed", "es", "s"];
for suffix in suffixes {
if token.len() > suffix.len() + 2 && token.ends_with(suffix) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stem_token("caches") → "cach", stem_token("services") → "servic", stem_token("changes") → "chang". Any word ending in -ce, -ge, -se + s loses its final e. These broken stems won't match actual occurrences in code.

src/main.rs Outdated
|| token == ".."
|| token.starts_with("./")
|| token.starts_with('/')
|| token.contains('/')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token.contains('/') is too greedy. A query like rtk rgai "client/server architecture" will treat "client/server" as a path and silently drop it from the query.

src/rgai_cmd.rs Outdated

fn is_comment_line(line: &str) -> bool {
let trimmed = line.trim_start();
trimmed.starts_with("//")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

starts_with('#') penalizes Markdown headers and YAML keys on top of Python comments. Can skew scoring on non-code files.

…ment scoring

- stem_token: remove "es" suffix to fix broken stems for -ce/-ge/-ve words
  (caches→cache, services→service, changes→change instead of cach/servic/chang)
- looks_like_path_token: remove bare contains('/') check that treated
  "client/server" as a path; now requires actual path prefixes (./  ../  /  ~/)
- is_comment_line: make '#' detection extension-aware to avoid penalizing
  Markdown headers and YAML in non-script files; only applies to py/sh/rb/etc.
- Add tests for all three fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@heAdz0r
Copy link
Contributor Author

heAdz0r commented Feb 15, 2026

@pszymkowiak All three issues have been addressed in commit db322a6:

1. stem_token broken stems — Removed "es" suffix rule that broke -ce/-ge/-ve words. Now:

  • cachescache (was cach)
  • servicesservice (was servic)
  • changeschange (was chang)
  • Tests added: stem_token_preserves_trailing_e

2. token.contains('/') too greedy — Replaced bare contains('/') with explicit path prefixes (./, ../, /, ~/). "client/server architecture" is now correctly treated as a query, not a path.

  • Test added: normalize_rgai_does_not_treat_slash_word_as_path

3. starts_with('#') penalizing Markdown/YAMLis_comment_line now accepts file extension parameter. # is only treated as a comment for scripting languages (py, sh, rb, pl, r, yaml, yml, toml, conf, ini). Markdown headers and YAML keys in other files are not penalized.

  • Test added: is_comment_line_extension_aware

Ready for re-review. All tests pass, CI checks are green.

@pszymkowiak
Copy link
Collaborator

I'm having trouble understanding the rationale here. Why would rtk auto-install a third-party binary (grepai) via rtk init? This raises supply chain concerns — rtk shouldn't be
downloading and installing external tools on behalf of the user.

As a reminder, rtk's scope is intentionally narrow: it's a lightweight CLI proxy that compresses command output to save LLM tokens. It wraps existing commands — it doesn't implement new
search engines or manage third-party dependencies.

Looking at the bigger picture, PRs #124, #125, #127, and #136 form a chain that progressively introduces grepai into rtk. Can you explain the relationship between you and the grepai
project, and why this integration belongs in rtk rather than being a standalone tool?

@heAdz0r
Copy link
Contributor Author

heAdz0r commented Feb 17, 2026

Closing — agreed with maintainers to keep grepai/rgai activity in my fork (heAdz0r/rtk) and not mix it into upstream for now.

@heAdz0r heAdz0r closed this Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants