feat(retrieval): add file-level chunked vectorization by mildred522 · Pull Request #860 · volcengine/OpenViking

mildred522 · 2026-03-22T04:42:16Z

Description

This PR contributes to Category 3 ("Embedding Processing & Chunking Strategy") in the token/cost optimization tracker
described in #744.

It focuses on the file-level part of that work. Previously, long text files could be indexed as a single
coarse file-level embedding, which reduced retrieval quality for oversized documents and left file-level
chunking behavior undefined.

This change adds configurable chunked vectorization for long text files and collapses chunk-level hits back
to the base file URI during retrieval, so long files gain finer-grained vector coverage internally while
retrieval results remain file-level externally.

Related Issue

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Add file_chunk_chars and file_chunk_overlap config options, plus validation in OpenVikingConfig
Add chunked vectorization for long text files in embedding_utils.py
Collapse chunk-level retrieval hits back to base file URIs in HierarchicalRetriever
Update English and Chinese configuration docs for the new file chunking settings
Add regression tests for config validation, long-file chunked vectorization, and retrieval-side chunk collapse

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Targeted local verification completed:

py -3.11 -m ruff check openviking_cli\utils\config\open_viking_config.py openviking\utils\embedding_utils.py openviking\retrieve\hierarchical_retriever.py tests\misc\test_openviking_config_file_chunking.py tests\misc\test_file_chunk_vectorization.py tests\retrieve\test_hierarchical_retriever_chunk_collapse.py
py -3.11 -m pytest tests\misc\test_openviking_config_file_chunking.py tests\misc\test_file_chunk_vectorization.py tests\retrieve\test_hierarchical_retriever_chunk_collapse.py -q

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

This PR is intentionally scoped to file-level chunked vectorization only.
It does not attempt to define chunking behavior for memory or directory indexing.

Chunk-level candidates are currently collapsed back to file-level results using the generated
chunk URI suffix convention, while preserving source_uri in metadata.

github-actions · 2026-03-22T04:43:22Z

Failed to generate code suggestions for PR

mildred522 · 2026-03-22T04:50:16Z

This PR overlaps with #858 in problem space, but the approach is different.

#858 focuses on configurable truncation / text source selection to avoid oversized embedding inputs.
This PR focuses on file-level chunked vectorization, so long text files keep finer-grained vector coverage, while
retrieval still returns file-level results by collapsing chunk hits back to the base URI.

If maintainers prefer the smaller config-only step first, I can rebase this PR later.

…e-chunked-vectorization-v2

feat(retrieval): chunk long text file embeddings

4025852

github-project-automation bot added this to OpenViking project Mar 22, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 22, 2026

mildred522 added 3 commits March 22, 2026 12:54

style(retrieval): apply ruff formatting

e527a9e

Merge commit '07cfc76cde7fc16b9124268a34e1e83734a952e0' into feat/fil…

545f2b1

…e-chunked-vectorization-v2

chore: merge upstream main and fix formatter drift

e8b72e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(retrieval): add file-level chunked vectorization#860

feat(retrieval): add file-level chunked vectorization#860
mildred522 wants to merge 4 commits intovolcengine:mainfrom
mildred522:feat/file-chunked-vectorization-v2

mildred522 commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

mildred522 commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mildred522 commented Mar 22, 2026

Description

Related Issue

Type of Change

Changes Made

Testing

Checklist

Additional Notes

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

mildred522 commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant