Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
cdba77c
WIP(cli): database abstraction layer
Davidyz Sep 1, 2025
dac5f75
WIP(cli): Add wip chromadb connector
Davidyz Sep 1, 2025
8bda33f
feat(db): Improve database abstraction and vectorisation process
Davidyz Sep 4, 2025
6f42c52
feat(db): Implement delete and drop methods for database connectors
Davidyz Sep 5, 2025
9028981
refactor(cli): `drop` now use the DB adapter layer.
Davidyz Sep 5, 2025
416d7c8
fix(cli): default db_url for chroma0
Davidyz Sep 5, 2025
8cb0e2d
fix(cli): minor fixes.
Davidyz Sep 5, 2025
1f457f7
feat(cli): implement database builder with lazy import
Davidyz Sep 5, 2025
40b6d97
refactor(cli): `ls` in CLI mode now use the DB adapter layer.
Davidyz Sep 5, 2025
129ebd4
docs about database connectors.
Davidyz Sep 7, 2025
fdc6e27
feat(cli): support excluding files in queries.
Davidyz Sep 7, 2025
984cf15
fix query filter
Davidyz Sep 11, 2025
b082fa6
feat(db): Implement config update and replace methods for database co…
Davidyz Sep 12, 2025
9494a76
remove some parameters.
Davidyz Sep 12, 2025
6263c1b
update `chroma0` connector to adopt the new API definition.
Davidyz Sep 12, 2025
fd3dd3c
fix(cli): fix default value of `excluded_files`
Davidyz Sep 12, 2025
79e540e
fix(cli): improve assertions for `metadatas`, `documents` and `ids`
Davidyz Sep 12, 2025
1bea28a
refactor(cli): Use DB adapter layer for `ls` and `rm` commands
Davidyz Sep 12, 2025
4dd184e
docs(db): Improve docstrings for database connectors
Davidyz Sep 13, 2025
b5e3d79
docs(db): Add documentation for database connectors
Davidyz Sep 13, 2025
a34d84e
feat(db): Implement orphan removal functionality in the database conn…
Davidyz Sep 13, 2025
5f8b65f
feat(db): Refactor database connectors to use embeddings and improve …
Davidyz Sep 13, 2025
b578328
refactor(db): Implement `get_chunks` method in the database connectors
Davidyz Sep 13, 2025
a5546ee
wip(db): port `vectorise` command to the new db layer
Davidyz Sep 13, 2025
46b52c0
fix(db): Refactor `list_collection_content` to accept keyword arguments
Davidyz Sep 13, 2025
29f53d3
feat(cli): Refactor vectorise command to use DB adapter layer
Davidyz Sep 14, 2025
35db49f
feat(cli): Refactor `update` command to use DB adapter layer
Davidyz Sep 14, 2025
dcf8309
refactor(cli): Refactor query command to use DB adapter layer
Davidyz Sep 14, 2025
3f7b591
feat(db): Enforce Chroma v0 client version
Davidyz Sep 14, 2025
beb1d8b
feat(deps): Update chromadb dependency to latest and add `chromadb==0…
Davidyz Sep 14, 2025
c3ec51f
fix(cli): Fix query command by removing deprecated code and chunking …
Davidyz Sep 14, 2025
92cdc9d
refactor(cli): Refactor mcp_main to use DB adapter layer
Davidyz Sep 14, 2025
265a213
refactor(cli): Refactor query command to use `preprocess_query_keywor…
Davidyz Sep 15, 2025
5e3a0cc
fix(cli): Remove all references to chromadb-related APIs.
Davidyz Sep 15, 2025
80a8893
fix(chroma0): fix import origin for `get_uuid`
Davidyz Sep 15, 2025
7facf78
refactor(lsp): Refactor LSP server to use DB adapter layer
Davidyz Sep 15, 2025
7281c45
refactor(cli): Refactor `clean` command to use DB adapter layer
Davidyz Sep 15, 2025
236d57f
refactor(cli)!: remove `common.py`
Davidyz Sep 15, 2025
f064fa3
docs(database): Clarify terminology in database README
Davidyz Sep 15, 2025
07d8c53
feat(chroma0): add graceful shutdown for bundled chroma server
Davidyz Sep 16, 2025
5fdb3d1
test(chroma0): Add unittests.
Davidyz Sep 16, 2025
aed9e6d
remove unnecessary test
Davidyz Sep 16, 2025
bce3e28
tests(mcp): fix tests.
Davidyz Sep 16, 2025
a335964
fix(mcp): Improve error handling and test coverage
Davidyz Sep 17, 2025
8c29871
fix(lsp): use correct file lists
Davidyz Sep 17, 2025
3478ac7
tests(chroma0): make sure database is mocked.
Davidyz Sep 17, 2025
61628eb
fix(lsp): remove collection-related code and improved test coverage.
Davidyz Sep 18, 2025
3949592
tests(cli): Improve test coverage for `ls` subcommand and handle `Col…
Davidyz Sep 19, 2025
301525b
tests(cli): refactor tests and coverage for `files rm`
Davidyz Sep 19, 2025
341fc24
tests(cli): Refactor query subcommand and add tests
Davidyz Sep 19, 2025
03432d4
fix(lsp): Validate include parameters in query command
Davidyz Sep 19, 2025
7b9a633
tests(cli): Refactor drop subcommand and add exception handling
Davidyz Sep 19, 2025
333d663
feat(cli): Improve test coverage for `vectorise`
Davidyz Sep 19, 2025
c46118e
tests(cli): Improve test coverage and refactor update subcommand
Davidyz Sep 19, 2025
b6d23d5
tests(cli): Refactor tests and improve coverage for `main`
Davidyz Sep 19, 2025
e3518bf
tests: Refactor database types and improve test coverage
Davidyz Sep 19, 2025
a6a2d72
tests(cli): Refactor config and cleanup path handling
Davidyz Sep 19, 2025
c7e27d6
chore(cli): mark some code as nocover
Davidyz Sep 19, 2025
2ef9349
pin to chroma 0.6.3 for now.
Davidyz Sep 19, 2025
7edec0c
fix: use `os.path.samefile` for accurate directory comparison
Davidyz Sep 19, 2025
108b195
fix error message
Davidyz Sep 19, 2025
c15051f
docs(database): Document the database connector API
Davidyz Sep 20, 2025
778d84e
tests(chroma0): Add more tests to chroma0
Davidyz Sep 20, 2025
b05701b
tests(cli): Improve test coverage for SpecResolver
Davidyz Sep 20, 2025
e64d297
tests: Add tests for database types
Davidyz Sep 20, 2025
5894d7a
tests(chroma0): improve test coverages.
Davidyz Sep 20, 2025
734e330
docs(database): update database configuration documentation
Davidyz Sep 20, 2025
3abc848
Auto generate docs
Davidyz Sep 20, 2025
3c4f759
docs(database): Document helper methods and error handling in the dat…
Davidyz Sep 21, 2025
35c554f
docs: Document database configuration and connector development
Davidyz Sep 21, 2025
ff775cb
Auto generate docs
Davidyz Sep 21, 2025
4aa138d
docs: clarify default chroma version
Davidyz Sep 22, 2025
424ccb4
Auto generate docs
Davidyz Sep 22, 2025
1df845e
feat(cli): Report skipped files in vectorise stats
Davidyz Sep 25, 2025
34421db
build(cli): add `chroma0` dep group.
Davidyz Oct 3, 2025
7baa033
Auto generate docs
Davidyz Oct 3, 2025
11aecc4
chore(cli): Use chroma0 for CI in test workflow
Davidyz Oct 3, 2025
4f12fd5
chore(cli): Document how to install extra dependencies
Davidyz Oct 3, 2025
dc44ba9
docs(cli): reflect packaging changes.
Davidyz Oct 3, 2025
d064d99
Auto generate docs
Davidyz Oct 3, 2025
dad1eb9
tests(db): Add database connector initialization test
Davidyz Oct 3, 2025
78495a9
tests(db): skip tests when dependency's not met
Davidyz Oct 4, 2025
1684d14
chore: extra coverage args via env var.
Davidyz Oct 4, 2025
5c13bea
feat(db): Check ChromaDB version on startup
Davidyz Oct 4, 2025
ff1c3da
refactor(cli): remove obsolete opts
Davidyz Oct 5, 2025
d84fa81
fix(cli): Fixes db_url config retrieval
Davidyz Oct 5, 2025
db10d56
refactor(chroma0): extract some stuff that can be reused by chromadb …
Davidyz Oct 5, 2025
6a20a0e
fix(db): Upgrade chromadb version check and include enums
Davidyz Oct 5, 2025
5ea7d74
feat(chroma): a WIP chromadb connector for chroma 1.x
Davidyz Oct 5, 2025
19db7dd
feat(chroma0): Raise CollectionNotFoundError on missing collection
Davidyz Oct 6, 2025
f3dfdae
feat(db): Implemented all methods in the ChromaDB 1.x connector
Davidyz Oct 6, 2025
be217d6
coverage(cli): Remove static analysis and improve coverage pipeline
Davidyz Oct 6, 2025
29cb6aa
feat(db): Add inter-process and inter-thread locks for ChromaDB conne…
Davidyz Oct 8, 2025
9021967
feat(cli): Refactor lock manager and implement inter-thread locks for…
Davidyz Oct 10, 2025
c59136b
ci(cli): Run coverage via shell script and enable coredumpy
Davidyz Oct 10, 2025
799e1fe
Merge branch 'main' into feat/db_layer
Davidyz Nov 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 2 additions & 8 deletions .github/workflows/test_and_cov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
run: pdm config use_uv true

- name: install pdm and dependencies
run: make deps
run: EXTRA_LOCK_ARGS="--group chroma0" make deps

- name: Set custom HF cache directory
run: |
Expand All @@ -47,18 +47,12 @@ jobs:
mkdir -p "$HF_HOME"
[ -z "$(ls "$HF_HOME")" ] || rm "${HF_HOME:?}/*" -rf && true

- name: run tests
run: pdm run pytest --enable-coredumpy --coredumpy-dir ${{ env.COREDUMPY_DUMP_DIR }}

- name: run coverage
run: |
pdm run coverage run -m pytest
sh ./scripts/coverage.sh
pdm run coverage report -m
pdm run coverage xml -i

- name: static analysis by basedpyright
run: pdm run basedpyright

- name: upload coverage reports to codecov
uses: codecov/codecov-action@v5
with:
Expand Down
13 changes: 10 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
.PHONY: multitest
EXTRA_LOCK_ARGS?=
EXTRA_DEPS?=
EXTRA_COVERAGEPY_ARGS?=

LOADED_DOT_ENV=@if [ -f .env ] ; then source .env; fi;

DEFAULT_GROUPS=--group dev --group lsp --group mcp --group debug
DEFAULT_GROUPS=--group dev --group lsp --group mcp --group debug $(EXTRA_LOCK_ARGS)

.PHONY: multitest

deps:
pdm lock $(DEFAULT_GROUPS) || pdm lock $(DEFAULT_GROUPS) --group legacy; \
pdm install
[ -z "$(EXTRA_DEPS)" ] || (pdm run python -m ensurepip && pdm run python -m pip install $(EXTRA_DEPS))

test:
make deps; \
Expand All @@ -18,7 +25,7 @@ multitest:

coverage:
make deps; \
pdm run coverage run -m pytest; \
pdm run coverage run $(EXTRA_COVERAGEPY_ARGS) -m pytest --enable-coredumpy --coredumpy-dir dumps; \
pdm run coverage html; \
pdm run coverage report -m

Expand Down
129 changes: 49 additions & 80 deletions doc/VectorCode-cli.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ Table of Contents *VectorCode-cli-table-of-contents*

- |VectorCode-cli-installation|
- |VectorCode-cli-install-from-source|
- |VectorCode-cli-migration-from-`pipx`|
- |VectorCode-cli-chromadb|
- |VectorCode-cli-for-windows-users|
- |VectorCode-cli-legacy-environments|
Expand Down Expand Up @@ -66,7 +65,7 @@ virtual environments.
After installing `uv`, run:

>bash
uv tool install "vectorcode<1.0.0"
uv tool install "vectorcode[chroma0]"
<

in your shell. To specify a particular version of Python, use the `--python`
Expand All @@ -76,40 +75,25 @@ If you want a CPU-only installation without CUDA dependencies required by
default by PyTorch, run:

>bash
uv tool install "vectorcode<1.0.0" --index https://download.pytorch.org/whl/cpu --index-strategy unsafe-best-match
uv tool install "vectorcode[chroma0]" --index https://download.pytorch.org/whl/cpu --index-strategy unsafe-best-match
<

If you need to install multiple dependency group (for |VectorCode-cli-lsp| or
|VectorCode-cli-mcp|), you can use the following syntax:

>bash
uv tool install "vectorcode[lsp,mcp]<1.0.0"
uv tool install "vectorcode[lsp,mcp,chroma0]"
<


[!NOTE] The command only install VectorCode and `SentenceTransformer`, the
default embedding engine. If you need to install an extra dependency, you can
use `uv tool install vectorcode --with <your_deps_here>`
use `uv tool install vectorcode[chroma0] --with <your_deps_here>`

INSTALL FROM SOURCE ~

To install from source, either `git clone` this repository and run `uv tool
install <path_to_vectorcode_repo>`, or use `pipx`:

>bash
pipx install git+https://github.com/Davidyz/VectorCode
<


MIGRATION FROM PIPX ~

The motivation behind the change from `pipx` to `uv tool` is mainly the
performance. The caching mechanism in uv makes it a lot faster than `pipx` for
a lot of operations. If you installed VectorCode via `pipx`, you can continue
to use `pipx` to manage your VectorCode installation. If you wish to switch to
`uv`, you need to uninstall VectorCode using `pipx` and then use `uv` to
install it as described above. All your VectorCode configurations and database
files will work out of the box on your new install.
To install from source, please `git clone` this repository and run `uv tool
install <path_to_vectorcode_repo>`.


CHROMADB ~
Expand All @@ -124,9 +108,6 @@ instructions through docker
significantly reduce the IO overhead and avoid potential race condition.


If you’re setting up a standalone ChromaDB server, I recommend sticking to
v0.6.3, because VectorCode is not ready for the upgrade to ChromaDB 1.0 yet.

FOR WINDOWS USERS ~

Windows support is not officially tested at this moment. This PR
Expand Down Expand Up @@ -309,7 +290,10 @@ extension, the json5 syntax will be accepted. This allows you to leave trailing
comma in the config file, as well as writing comments (`//`). This can be very
useful if you’re experimenting with the configs.

The JSON configuration file may hold the following values: -
The JSON configuration file may hold the following values: - `db_type`: string,
default: `"ChromaDB0Connector"` (for chromadb 0.6.3), the database backend to
use; - `db_params`: dictionary. See the database connector documentation
<../src/vectorcode/database/README.md> for the default values; -
`embedding_function`: string, one of the embedding functions supported by
Chromadb <https://www.trychroma.com/> (find more here
<https://docs.trychroma.com/docs/embeddings/embedding-functions> and here
Expand All @@ -329,62 +313,45 @@ model_name="nomic-embed-text")`. Default: `{}`; - `embedding_dims`: integer or
model supports Matryoshka Representation Learning (MRL) before using this._
Learn more about MRL here
<https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html#matryoshka-embeddings>.
When set to `null` (or unset), the embeddings won’t be truncated; - `db_url`:
string, the url that points to the Chromadb server. VectorCode will start an
HTTP server for Chromadb at a randomly picked free port on `localhost` if your
configured `http://host:port` is not accessible. Default:
`http://127.0.0.1:8000`; - `db_path`: string, Path to local persistent
database. If you didn’t set up a standalone Chromadb server, this is where
the files for your database will be stored. Default:
`~/.local/share/vectorcode/chromadb/`; - `db_log_path`: string, path to the
_directory_ where the built-in chromadb server will write the log to. Default:
`~/.local/share/vectorcode/`; - `chunk_size`: integer, the maximum number of
characters per chunk. A larger value reduces the number of items in the
database, and hence accelerates the search, but at the cost of potentially
truncated data and lost information. Default: `2500`. To disable chunking, set
it to a negative number; - `overlap_ratio`: float between 0 and 1, the ratio of
overlapping/shared content between 2 adjacent chunks. A larger ratio improves
the coherence of chunks, but at the cost of increasing number of entries in the
database and hence slowing down the search. Default: `0.2`. _Starting from
0.4.11, VectorCode will use treesitter to parse languages that it can
automatically detect. It uses pygments to guess the language from filename, and
tree-sitter-language-pack to fetch the correct parser. overlap_ratio has no
effects when treesitter works. If VectorCode fails to find an appropriate
parser, it’ll fallback to the legacy naive parser, in which case
overlap_ratio works exactly in the same way as before;_ - `query_multiplier`:
integer, when you use the `query` command to retrieve `n` documents, VectorCode
will check `n * query_multiplier` chunks and return at most `n` documents. A
larger value of `query_multiplier` guarantees the return of `n` documents, but
with the risk of including too many less-relevant chunks that may affect the
document selection. Default: `-1` (any negative value means selecting documents
based on all indexed chunks); - `reranker`: string, the reranking method to
use. Currently supports `NaiveReranker` (sort chunks by the "distance" between
the embedding vectors) and `CrossEncoderReranker` (using sentence-transformers
cross-encoder
When set to `null` (or unset), the embeddings won’t be truncated; -
`chunk_size`: integer, the maximum number of characters per chunk. A larger
value reduces the number of items in the database, and hence accelerates the
search, but at the cost of potentially truncated data and lost information.
Default: `2500`. To disable chunking, set it to a negative number; -
`overlap_ratio`: float between 0 and 1, the ratio of overlapping/shared content
between 2 adjacent chunks. A larger ratio improves the coherence of chunks, but
at the cost of increasing number of entries in the database and hence slowing
down the search. Default: `0.2`. _Starting from 0.4.11, VectorCode will use
treesitter to parse languages that it can automatically detect. It uses
pygments to guess the language from filename, and tree-sitter-language-pack to
fetch the correct parser. overlap_ratio has no effects when treesitter works.
If VectorCode fails to find an appropriate parser, it’ll fallback to the
legacy naive parser, in which case overlap_ratio works exactly in the same way
as before;_ - `query_multiplier`: integer, when you use the `query` command to
retrieve `n` documents, VectorCode will check `n * query_multiplier` chunks and
return at most `n` documents. A larger value of `query_multiplier` guarantees
the return of `n` documents, but with the risk of including too many
less-relevant chunks that may affect the document selection. Default: `-1` (any
negative value means selecting documents based on all indexed chunks); -
`reranker`: string, the reranking method to use. Currently supports
`NaiveReranker` (sort chunks by the "distance" between the embedding vectors)
and `CrossEncoderReranker` (using sentence-transformers cross-encoder
<https://sbert.net/docs/package_reference/cross_encoder/cross_encoder.html> ).
- `reranker_params`: dictionary, similar to `embedding_params`. The options
passed to the reranker class constructor. For `CrossEncoderReranker`, these are
the options passed to the `CrossEncoder`
<https://sbert.net/docs/package_reference/cross_encoder/cross_encoder.html#id1>
class. For example, if you want to use a non-default model, you can use the
following: `json { "reranker_params": { "model_name_or_path": "your_model_here"
} }` - `db_settings`: dictionary, works in a similar way to `embedding_params`,
but for Chromadb client settings so that you can configure authentication for
remote Chromadb <https://docs.trychroma.com/production/administration/auth>; -
`hnsw`: a dictionary of hnsw settings
<https://cookbook.chromadb.dev/core/configuration/#hnsw-configuration> that may
improve the query performances or avoid runtime errors during queries. **It’s
recommended to re-vectorise the collection after modifying these options,
because some of the options can only be set during collection creation.**
Example (and default): `json5 "hnsw": { "hnsw:M": 64, }` - `filetype_map`:
`dict[str, list[str]]`, a dictionary where keys are language name
} }` - `filetype_map`: `dict[str, list[str]]`, a dictionary where keys are
language name
<https://github.com/Goldziher/tree-sitter-language-pack?tab=readme-ov-file#available-languages>
and values are lists of Python regex patterns
<https://docs.python.org/3/library/re.html> that will match file extensions.
This allows overriding automatic language detection and specifying a treesitter
parser for certain file types for which the language parser cannot be correctly
identified (e.g., `.phtml` files containing both php and html). Example
configuration: `json5 "filetype_map": { "php": ["^phtml$"] }`
configuration: `json5 { "filetype_map": { "php": ["^phtml$"], }, }`

- `chunk_filters`: `dict[str, list[str]]`, a dictionary where the keys are
language name
Expand All @@ -395,10 +362,12 @@ configuration: `json5 "filetype_map": { "php": ["^phtml$"] }`
treesitter chunker. By default, no filters will be added. Example
configuration:
>json5
"chunk_filters": {
"python": ["^[^a-zA-Z0-9]+$"], // multiple patterns will be merged (unioned)
// or you can use wildcard to match any languages that has no dedicated filters:
"*": ["^[^a-zA-Z0-9]+$"],
{
"chunk_filters": {
"python": ["^[^a-zA-Z0-9]+$"], // multiple patterns will be merged (unioned)
// or you can use wildcard to match any languages that has no dedicated filters:
"*": ["^[^a-zA-Z0-9]+$"],
},
}
<
- `encoding`: string, alternative encoding used for this project. By default this
Expand Down Expand Up @@ -743,13 +712,13 @@ A JSON array of collection information of the following format will be printed:

>json
{
"project_root": str,
"user": str,
"hostname": str,
"collection_name": str,
"size": int,
"num_files": int,
"embedding_function": str
"project_root": "project_root",
"user": "user",
"hostname": "host",
"collection_name": "fuerbvo13571943ofuib",
"size": 10,
"num_files": 100,
"embedding_function": "SomeEmbeddingFunction"
}
<

Expand Down
35 changes: 35 additions & 0 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,41 @@ You may also find it helpful to
[enable logging](https://github.com/Davidyz/VectorCode/blob/main/docs/cli.md#debugging-and-diagnosing)
for the CLI when developing new features or working on fixes.

### Local Dependencies

Sometimes you want to run `make deps` that install non-default dependencies. The
`Makefile` provides easy ways to do that.

When you want to install a dependency group of VectorCode:

```bash
EXTRA_LOCK_ARGS="--group chroma0" make deps
```

When you want to install a library that is not declared in any of the dependency
groups (like `openai`):

```bash
EXTRA_DEPS="openai\<2.0.0" make deps
```

Both environment variables apply to `make deps`, `make test` and `make coverage`.

### Database Connectors

Please take a look at [the database documentation](../src/vectorcode/database/README.md),
which contains a brief introduction on the API design that explains what you'd need
to do to add support for a new database.

### Coverage Across Mutiple Runs

If, for some reasons, you need to run the tests multiple times to get full coverage
(maybe when there are conflicting dependency groups like chromadb 0.6.3 vs chromadb 1.x), you can pass `--append` flag to the `coverage` command.
If you're using `make coverage`, you can set this flag via the `EXTRA_COVERAGEPY_ARGS` environment variable:
```bash
EXTRA_COVERAGEPY_ARGS="--append" make coverage
```

## Neovim Plugin

At the moment, there isn't much to cover on here. As long as the code is
Expand Down
Loading
Loading