Skip to content

feat(retrieval): add file-level chunked vectorization#860

Open
mildred522 wants to merge 4 commits intovolcengine:mainfrom
mildred522:feat/file-chunked-vectorization-v2
Open

feat(retrieval): add file-level chunked vectorization#860
mildred522 wants to merge 4 commits intovolcengine:mainfrom
mildred522:feat/file-chunked-vectorization-v2

Conversation

@mildred522
Copy link
Contributor

Description

This PR contributes to Category 3 ("Embedding Processing & Chunking Strategy") in the token/cost optimization tracker
described in #744.

It focuses on the file-level part of that work. Previously, long text files could be indexed as a single
coarse file-level embedding, which reduced retrieval quality for oversized documents and left file-level
chunking behavior undefined.

This change adds configurable chunked vectorization for long text files and collapses chunk-level hits back
to the base file URI during retrieval, so long files gain finer-grained vector coverage internally while
retrieval results remain file-level externally.

Related Issue

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Add file_chunk_chars and file_chunk_overlap config options, plus validation in OpenVikingConfig
  • Add chunked vectorization for long text files in embedding_utils.py
  • Collapse chunk-level retrieval hits back to base file URIs in HierarchicalRetriever
  • Update English and Chinese configuration docs for the new file chunking settings
  • Add regression tests for config validation, long-file chunked vectorization, and retrieval-side chunk collapse

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Targeted local verification completed:

  • py -3.11 -m ruff check openviking_cli\utils\config\open_viking_config.py openviking\utils\embedding_utils.py openviking\retrieve\hierarchical_retriever.py tests\misc\test_openviking_config_file_chunking.py tests\misc\test_file_chunk_vectorization.py tests\retrieve\test_hierarchical_retriever_chunk_collapse.py
  • py -3.11 -m pytest tests\misc\test_openviking_config_file_chunking.py tests\misc\test_file_chunk_vectorization.py tests\retrieve\test_hierarchical_retriever_chunk_collapse.py -q

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

This PR is intentionally scoped to file-level chunked vectorization only.
It does not attempt to define chunking behavior for memory or directory indexing.

Chunk-level candidates are currently collapsed back to file-level results using the generated
chunk URI suffix convention, while preserving source_uri in metadata.

@github-actions
Copy link

Failed to generate code suggestions for PR

@mildred522
Copy link
Contributor Author

This PR overlaps with #858 in problem space, but the approach is different.

#858 focuses on configurable truncation / text source selection to avoid oversized embedding inputs.
This PR focuses on file-level chunked vectorization, so long text files keep finer-grained vector coverage, while
retrieval still returns file-level results by collapsing chunk hits back to the base URI.

If maintainers prefer the smaller config-only step first, I can rebase this PR later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant