-
Notifications
You must be signed in to change notification settings - Fork 234
Omitting too much data? #27
Description
Hi,
I just wanted to start a discussion here about the topic of which data is worth keeping / deleting.
It seems to me that the current approach omits data a little bit too generously. E.g., detailed tool outputs seem useful as well as subagent logs if someone wanted to use this data for (1.) mining of successful model rollouts (2.) subsequent tuning on those.
If one of the usecases is to make this data useful for model training the rollouts should be as complete as possible right? E.g., when tuning a model to leverage tool calls one could imagine to teacher force on all the tokens that lead up to a tool-call, then mask the output of the tool-call from the loss computation, and continue loss computation on the assistant response following the tool-call. For this one would want the entire tool-call in the dataset.
Best,
Chris.