Skip to content

A local semantic search tool for querying code bases with NLP.

Notifications You must be signed in to change notification settings

Sycritz/code-semantic-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Semantic Searcher

Status: Work in Progress

A local semantic search tool for querying codebases with natural language. Point it at a directory, it indexes everything, and then you can ask things like "where do we parse config files" instead of grepping around.

Why I Built This

Mainly to learn. I wanted to get hands-on with:

  • Building a TUI that actually talks to a backend
  • Working with embeddings and vector search (FAISS specifically)
  • Wiring up a simple ML pipeline end-to-end
  • FastAPI for quick API prototyping

It's not meant to replace proper code search tools. It's a playground for understanding how semantic search works under the hood.

How It Works

  1. Preprocess - Reads source files, strips comments, outputs clean text chunks
  2. Embed - Runs each chunk through a sentence-transformer model (all-MiniLM-L6-v2)
  3. Index - Builds a FAISS index for fast similarity lookup
  4. Search - Query gets embedded, matched against the index, results come back ranked

The API serves results, the TUI consumes them. Nothing fancy.

What I Picked Up Along the Way

  • Regex-based comment stripping is fragile but works for a prototype
  • FAISS IndexIVF needs training data, which is awkward for tiny datasets
  • Keeping pipeline stages separate makes debugging way easier
  • Type hints pay off when you're wiring modules together

Known Issues

This is learning code, not production code:

  • No incremental updates (you rebuild the whole index on changes)
  • Chunking is per-file, not per-function
  • No auth on the API
  • Probably breaks on edge cases I haven't hit yet

Running It

export PYTHONPATH=$PWD

# Index your code
python3 -m src.preprocessing.preprocess --input ~/your-code --output data/processed
python3 -m src.embedding.embedder
python3 -m src.indexing.build_index

# Start the API
uvicorn src.api.server:app --reload

# Open another terminal, run the TUI
python3 tui/client.py

Needs Python 3.10+, sentence-transformers, faiss-cpu, fastapi.

Project Structure

src/
  preprocessing/   # cleans code files
  embedding/       # generates vectors
  indexing/        # builds FAISS index
  search/          # runs queries
  api/             # FastAPI server
tui/
  client.py        # terminal interface

License

MIT

About

A local semantic search tool for querying code bases with NLP.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages