feat: make file vectorization strategy configurable by ningfeemic-dev · Pull Request #858 · volcengine/OpenViking

ningfeemic-dev · 2026-03-22T03:28:26Z

Summary

Refs #857.

This PR makes text file vectorization strategy configurable to reduce embedding oversize failures on long text inputs.

What changed

add embedding.text_source with supported values:
- summary_first
- summary_only
- content_only
add embedding.max_text_chars to cap raw text sent to embeddings
update vectorize_file() to respect the new config
add minimal validation/unit tests for config and strategy behavior

Why

Current upstream behavior still defaults to full-text embedding for text files. In real deployments using OpenAI-compatible embedding backends, this can trigger repeated oversize failures like input (...) is too large to process, reducing indexing completeness and operational stability.

Making this configurable gives operators a safer, backward-compatible way to balance:

stability
indexing completeness
retrieval quality

Notes

This PR intentionally keeps the scope small.
It does not introduce more complex chunking logic yet.
It focuses on configurable text source selection plus max raw-text length control.

Validation

source files and tests pass py_compile
targeted runtime-style validation in the current environment was limited by missing test dependencies for the upstream repo checkout, so this PR includes focused unit tests for the new config/strategy paths

CLAassistant · 2026-03-22T03:28:33Z

All committers have signed the CLA.

github-actions · 2026-03-22T03:29:25Z

Failed to generate code suggestions for PR

openviking_cli/utils/config/embedding_config.py

openviking/utils/embedding_utils.py

feat: make file vectorization strategy configurable

a92b5f7

github-project-automation bot moved this to Backlog in OpenViking project Mar 22, 2026

github-project-automation bot added this to OpenViking project Mar 22, 2026

ZaynJarvis reviewed Mar 22, 2026

View reviewed changes

openviking_cli/utils/config/embedding_config.py Outdated Show resolved Hide resolved

mildred522 mentioned this pull request Mar 22, 2026

feat(retrieval): add file-level chunked vectorization #860

Open

19 tasks

ZaynJarvis reviewed Mar 22, 2026

View reviewed changes

openviking/utils/embedding_utils.py Show resolved Hide resolved

refactor: align vectorization config naming and strategy

d9769d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make file vectorization strategy configurable#858

feat: make file vectorization strategy configurable#858
ningfeemic-dev wants to merge 2 commits intovolcengine:mainfrom
ningfeemic-dev:feat/configurable-file-vectorization

ningfeemic-dev commented Mar 22, 2026

Uh oh!

CLAassistant commented Mar 22, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ningfeemic-dev commented Mar 22, 2026

Summary

What changed

Why

Notes

Validation

Uh oh!

CLAassistant commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Mar 22, 2026 •

edited

Loading