Skip to content

feat: make file vectorization strategy configurable#858

Open
ningfeemic-dev wants to merge 2 commits intovolcengine:mainfrom
ningfeemic-dev:feat/configurable-file-vectorization
Open

feat: make file vectorization strategy configurable#858
ningfeemic-dev wants to merge 2 commits intovolcengine:mainfrom
ningfeemic-dev:feat/configurable-file-vectorization

Conversation

@ningfeemic-dev
Copy link

Summary

Refs #857.

This PR makes text file vectorization strategy configurable to reduce embedding oversize failures on long text inputs.

What changed

  • add embedding.text_source with supported values:
    • summary_first
    • summary_only
    • content_only
  • add embedding.max_text_chars to cap raw text sent to embeddings
  • update vectorize_file() to respect the new config
  • add minimal validation/unit tests for config and strategy behavior

Why

Current upstream behavior still defaults to full-text embedding for text files. In real deployments using OpenAI-compatible embedding backends, this can trigger repeated oversize failures like input (...) is too large to process, reducing indexing completeness and operational stability.

Making this configurable gives operators a safer, backward-compatible way to balance:

  • stability
  • indexing completeness
  • retrieval quality

Notes

  • This PR intentionally keeps the scope small.
  • It does not introduce more complex chunking logic yet.
  • It focuses on configurable text source selection plus max raw-text length control.

Validation

  • source files and tests pass py_compile
  • targeted runtime-style validation in the current environment was limited by missing test dependencies for the upstream repo checkout, so this PR includes focused unit tests for the new config/strategy paths

@CLAassistant
Copy link

CLAassistant commented Mar 22, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link

Failed to generate code suggestions for PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants