-
Notifications
You must be signed in to change notification settings - Fork 49
Description
One of the thing that drives me batty about RAG is that it often retrieves chunks that contain partial lists. It'll retrieve a chunk that contains Step 2 and 3 of a task but completely drop Step 1 and 4. My Document Sections is designed to improve that situation by at least keeping the order of steps correct but it doesn't solve for dropped steps and list items. The core issue is that many times these missing steps are in in adjacent chunks that aren't semantically relevant to the query.
In an effort to solve this issue, I'm working with my tool (GPT-5.2) to design a new OutlineTextSplitter class. The idea is to break text splitting into a 2 phase problem. You first create an outline that identifies the structure of the document you want to split, ignoring any token counts, and then you split the document based on the outline, not delimiters. This should result in a better chance of a sequence of steps landing in the same chunk. The Document Sections algorithm will then do the rest of the work.
Bellow is a partial discussion with my tool thinking through the problem: