-
Notifications
You must be signed in to change notification settings - Fork 368
Open
Description
Hi there, thank you for this great release!
I'm wondering if it would be possible to use the quality filter to filter out documents under a certain length. For example, I'm looking to assemble a dataset where each sequence is between 64-128k in context.
- Is this easily configurable in the quality filter?
- Would this filter be applied before or after tokenization?
Thank you for your help.
Metadata
Metadata
Assignees
Labels
No labels