Skip to content

Omitting too much data? #27

@wendlerc

Description

@wendlerc

Hi,

I just wanted to start a discussion here about the topic of which data is worth keeping / deleting.

It seems to me that the current approach omits data a little bit too generously. E.g., detailed tool outputs seem useful as well as subagent logs if someone wanted to use this data for (1.) mining of successful model rollouts (2.) subsequent tuning on those.

If one of the usecases is to make this data useful for model training the rollouts should be as complete as possible right? E.g., when tuning a model to leverage tool calls one could imagine to teacher force on all the tokens that lead up to a tool-call, then mask the output of the tool-call from the loss computation, and continue loss computation on the assistant response following the tool-call. For this one would want the entire tool-call in the dataset.

Best,
Chris.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions