Skip to content

Fix embedding edge cases: dynamic batch sizing and token limit fallback#7

Merged
drewburchfield merged 2 commits intomasterfrom
fix/chunking-token-limits
Apr 3, 2026
Merged

Fix embedding edge cases: dynamic batch sizing and token limit fallback#7
drewburchfield merged 2 commits intomasterfrom
fix/chunking-token-limits

Conversation

@drewburchfield
Copy link
Copy Markdown
Owner

@drewburchfield drewburchfield commented Apr 3, 2026

Closes NAS-989
Closes NAS-993

Summary

  • Replace hardcoded batch_size=60 with dynamic calculation based on chunk density (3 chars/token, 28k token budget)
  • Add fallback: if whole-note embed fails on token limit, auto-retry with chunking
  • Don't retry deterministic token limit errors in _call_api_with_retry
  • Extract shared _is_token_limit_error() helper for consistent detection across call sites
  • Update stale docs (TOOLS.md limitations, vector_store.py comment)

Root cause

Two edge cases in embed_with_chunks():

  1. Sending 60 chunks to contextualized_embed pushes against Voyage's 32k token limit when content is dense
  2. Borderline notes estimated under 30k tokens but actually over 32k had no fallback to chunking

Test plan

  • All security tests pass locally
  • Docker image builds and imports work
  • Server initializes successfully
  • Quality gate review: 1 important finding (inconsistent token detection) remediated with shared helper
  • Braintrust consultation (Claude + Gemini): aligned on dynamic batch sizing and catch-and-retry approach

Open with Devin

- Replace hardcoded batch_size=60 with dynamic calculation based on
  chunk density (3 chars/token estimate, 28k token budget)
- Add fallback: if whole-note embed fails on token limit, retry with chunking
- Don't retry deterministic token limit errors in _call_api_with_retry
- Extract shared _is_token_limit_error() helper for consistent detection
- Update stale docs: TOOLS.md limitations, vector_store.py comment

Closes NAS-989, NAS-993
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

If a chunk batch exceeds the token limit, halve the batch size and
retry the same position. Keeps halving until it fits or batch_size
reaches 1. This handles dense content where the 3 chars/token
estimate is still too generous.
@drewburchfield drewburchfield merged commit c0202e5 into master Apr 3, 2026
4 checks passed
@drewburchfield drewburchfield deleted the fix/chunking-token-limits branch April 3, 2026 18:06
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review


logger.debug(f"Embedded chunks {i + 1}-{i + len(chunk_batch)} of {len(chunks)}")
except EmbeddingError as e:
if _is_token_limit_error(e) and batch_size > 1:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Batch retry checks batch_size > 1 instead of len(chunk_batch) > 1, causing wasteful identical API calls

When the remaining chunks at position i are fewer than batch_size, reducing batch_size does not change the actual chunk_batch sent to the API. The loop at src/embedder.py:229 checks batch_size > 1 to decide whether to retry with a smaller batch, but the actual batch sent is chunks[i : i + batch_size] (src/embedder.py:207), which is bounded by the remaining chunks. This means the code can make multiple identical failing API calls (each with a rate-limit sleep at src/embedder.py:210) before batch_size shrinks below the actual remaining chunk count.

Example scenario: 5 remaining chunks with batch_size=42
  • batch_size=42chunk_batch = 5 chunks → API fails (token limit)
  • batch_size=21chunk_batch = 5 chunks → same request, fails again
  • batch_size=10chunk_batch = 5 chunks → same request, fails again
  • batch_size=5chunk_batch = 5 chunks → same request, fails again
  • batch_size=2chunk_batch = 2 chunks → different request, may succeed

That's 4 wasted API calls with rate-limit delays.

Suggested change
if _is_token_limit_error(e) and batch_size > 1:
if _is_token_limit_error(e) and len(chunk_batch) > 1:
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant