Skip to content

New OutlineTextSplitter class #88

@Stevenic

Description

@Stevenic

One of the thing that drives me batty about RAG is that it often retrieves chunks that contain partial lists. It'll retrieve a chunk that contains Step 2 and 3 of a task but completely drop Step 1 and 4. My Document Sections is designed to improve that situation by at least keeping the order of steps correct but it doesn't solve for dropped steps and list items. The core issue is that many times these missing steps are in in adjacent chunks that aren't semantically relevant to the query.

In an effort to solve this issue, I'm working with my tool (GPT-5.2) to design a new OutlineTextSplitter class. The idea is to break text splitting into a 2 phase problem. You first create an outline that identifies the structure of the document you want to split, ignoring any token counts, and then you split the document based on the outline, not delimiters. This should result in a better chance of a sequence of steps landing in the same chunk. The Document Sections algorithm will then do the rest of the work.

Bellow is a partial discussion with my tool thinking through the problem:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions