Skip to content

avoid recency bias in prompt construction #104

@AndreasKarasenko

Description

@AndreasKarasenko

Context
According to this paper ChatGPT (and likely other LLMs) suffer from a recency bias. Whatever class comes last has a higher propability of being selected.
Issue
Currently scikit-llm constructs prompts based on the order of the training data.
Since we are recommended to restrict the training data I would usually do something like this:

df = df.groupby(label_col).apply(lambda x: x.sample(n_samples))
df = df.reset_index(drop=True)

Which returns a sorted dataframe by label_col. Even if sort=False is passed to groupby the instances are still clustered by label.

Question/Solution
Should a method be implemented that randomizes the order of samples in the prompt / training data, or should users take care of that themselves?
The most straightforward way would be to simply add this to sampling:

df = df.sample(frac=1)

Which leaves it up to chance to balance it reasonably.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions