-
Notifications
You must be signed in to change notification settings - Fork 39
Description
It seems like all of the potential methods to download the dataset are about 6k samples shy. Using huggingface's load_dataset() option, downloading tar.xz files manually, gif lfs, etc. all run into the same issue as of this date.
The downloaded dataset ends up containing ~94k samples (94164 samples per my most recent attempt at this), which makes attempts to reproduce the work or leverage the excellent dataset/dataloader work done already quite challenging.
If I'm eyeballing it, it looks like the data_10.tar.xz file in the data is the most likely culprit, as the other .tar files over around ~7.8 GB in size, and data_10.tar.xz is 3.25 GB.
It's certainly possible I'm missing something, but I haven't been able to figure out an effective way around this issue. Any assistance in the matter would be appreciated!