Hi,
First of all, thank you for releasing MINT-1T dataset :-)
I loaded one of MINT-1T datasets (MINT-1T-PDF-2023-06) but encountered the following error.
In [1]: ds = load_dataset(
"mlfoundations/MINT-1T-PDF-CC-2023-06",
download_config=DownloadConfig(resume_download=True))
...
Downloading data: 100%|████████████████████| 65007/65007 [9:17:51<00:00, 1.94files/s]
Computing checksums: 100%|████████████████████| 65007/65007 [00:59<00:00, 1087.82it/s]
Traceback (most recent call last):
File "/home/hancheol/projects/mllm-data/mllm_data/a_preprocess/mint_1t_pdf/00_download.py", line 24, in <module>
main(args)
File "/home/hancheol/projects/mllm-data/mllm_data/a_preprocess/mint_1t_pdf/00_download.py", line 7, in main
ds = load_dataset(
^^^^^^^^^^^^^
File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/load.py", line 2609, in load_dataset
builder_instance.download_and_prepare(
File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1027, in download_and_prepare
self._download_and_prepare(
File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1789, in _download_and_prepare
super()._download_and_prepare(
File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 82, in _split_generators
inferred_arrow_schema = pa.concat_tables(pa_tables, promote_options="default").schema
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 5245, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<bff_contained_ngram_count_before_dedupe: int64, image_metadata: list<item: struct<height: int64, page: int64, sha256: string, width: int64, xref: int64>>, images: list<item: string>, pdf_name: string, texts: list<item: string>, url: string> output fields: struct<bff_contained_ngram_count_before_dedupe: int64, image_metadata: list<item: struct<height: int64, page: int64, sha256: string, width: int64, xref: int64>>, images: list<item: string>, language_id_whole_page_fasttext: struct<en: double>, pdf_name: string, previous_word_count: int64, texts: list<item: string>, url: string>
Error message shows that the output data has two additional fields: language_id_whole_page_fasttext and previous_word_count.
Input fields:
struct<
bff_contained_ngram_count_before_dedupe: int64,
image_metadata: list<item: struct<
height: int64,
page: int64,
sha256: string,
width: int64,
xref: int64
>>,
images: list<item: string>,
pdf_name: string,
texts: list<item: string>,
url: string
>
Output fields:
struct<
bff_contained_ngram_count_before_dedupe: int64,
image_metadata: list<item: struct<
height: int64,
page: int64,
sha256: string,
width: int64,
xref: int64
>>,
images: list<item: string>,
language_id_whole_page_fasttext: struct<en: double>, # <=New one
pdf_name: string,
previous_word_count: int64, # <=New one
texts: list<item: string>,
url: string
>
Do you have any idea how to fix it?
Best regards,
Han-Cheol
Hi,
First of all, thank you for releasing MINT-1T dataset :-)
I loaded one of MINT-1T datasets (MINT-1T-PDF-2023-06) but encountered the following error.
Error message shows that the output data has two additional fields: language_id_whole_page_fasttext and previous_word_count.
Do you have any idea how to fix it?
Best regards,
Han-Cheol