Omitting too much data?

Hi,

I just wanted to start a discussion here about the topic of which data is worth keeping / deleting. 

It seems to me that the current approach omits data a little bit too generously. E.g., detailed tool outputs seem useful as well as subagent logs if someone wanted to use this data for (1.) mining of successful model rollouts (2.) subsequent tuning on those. 

If one of the usecases is to make this data useful for model training the rollouts should be as complete as possible right? E.g., when tuning a model to leverage tool calls one could imagine to teacher force on all the tokens that lead up to a tool-call, then mask the output of the tool-call from the loss computation, and continue loss computation on the assistant response following the tool-call. For this one would want the entire tool-call in the dataset.

Best,
Chris.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Omitting too much data? #27

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Omitting too much data? #27

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions