Conversation
pefontana
left a comment
There was a problem hiding this comment.
Nice one @dsocolobsky !
Just one thing, maybe we can check
in the PreprocessedDataProvider we check the sequence length
shared/data-provider/src/preprocessed.rs:99
let input_ids = list_to_vec(
&row,
inputs_column,
Some(num_tokens_per_sequence),
)?;
Maybe we can do the same here, but only if doesnt involve a lot of code
| // For example if row_indices = [5, 6, 7, 10, 11] then the loop does: | ||
| // - [5, 6, 7] in first iteration since they are consecutive | ||
| // - [10, 11] in second iteration |
There was a problem hiding this comment.
I wonder how this works with the shuffled indices. There’s a small chance we could end up with consecutive numbers, which would mean making one request per index, so we’d likely get rate-limited sooner rather than later. Maybe we could request the data using ordered indices and then shuffle the results afterward.
There was a problem hiding this comment.
I've taken a look at this but at a first glance it seems like we can't avoid fetching 1 by 1 in the case the indices are shuffled. It's very unlikely that if we have e.g. 4 shuffled indices in the range [0,40000] some of them will be consecutive.
| if self.num_rows == 0 { | ||
| return vec![]; | ||
| } |
There was a problem hiding this comment.
I think we're already checking this in the get_samples function before calling this, can this be removed?
There was a problem hiding this comment.
I think it's worth it to leave the check just in case at some point we re-use it somewhere. If we ever call this before checking before if num_rows=0 it will access out of bounds and panic. I prefer not to have this hidden condition.
|
I think in the |
|
it would be useful for me to learn about this if you could add information about this new data provider under the Provider Configuration subheading in |
Adds a new
HuggingFacePreprocessedDataProviderData Provider which fetches pre-tokenized datasets from HuggingFace via their streaming interface (so, on-demand).I added a new test config
hf-preprocessed-config.tomlusingemozilla/Hermes-3-Preprocessed-Llama3, training seemed to work alright but I'm not that familiar with these datasets/api so requires further testing.