Status: Work in Progress
A local semantic search tool for querying codebases with natural language. Point it at a directory, it indexes everything, and then you can ask things like "where do we parse config files" instead of grepping around.
Mainly to learn. I wanted to get hands-on with:
- Building a TUI that actually talks to a backend
- Working with embeddings and vector search (FAISS specifically)
- Wiring up a simple ML pipeline end-to-end
- FastAPI for quick API prototyping
It's not meant to replace proper code search tools. It's a playground for understanding how semantic search works under the hood.
- Preprocess - Reads source files, strips comments, outputs clean text chunks
- Embed - Runs each chunk through a sentence-transformer model (all-MiniLM-L6-v2)
- Index - Builds a FAISS index for fast similarity lookup
- Search - Query gets embedded, matched against the index, results come back ranked
The API serves results, the TUI consumes them. Nothing fancy.
- Regex-based comment stripping is fragile but works for a prototype
- FAISS IndexIVF needs training data, which is awkward for tiny datasets
- Keeping pipeline stages separate makes debugging way easier
- Type hints pay off when you're wiring modules together
This is learning code, not production code:
- No incremental updates (you rebuild the whole index on changes)
- Chunking is per-file, not per-function
- No auth on the API
- Probably breaks on edge cases I haven't hit yet
export PYTHONPATH=$PWD
# Index your code
python3 -m src.preprocessing.preprocess --input ~/your-code --output data/processed
python3 -m src.embedding.embedder
python3 -m src.indexing.build_index
# Start the API
uvicorn src.api.server:app --reload
# Open another terminal, run the TUI
python3 tui/client.pyNeeds Python 3.10+, sentence-transformers, faiss-cpu, fastapi.
src/
preprocessing/ # cleans code files
embedding/ # generates vectors
indexing/ # builds FAISS index
search/ # runs queries
api/ # FastAPI server
tui/
client.py # terminal interface
MIT