Error when loading MINT-1T-PDF-2023-06

Hi,
First of all, thank you for releasing MINT-1T dataset :-)

I loaded one of MINT-1T datasets (MINT-1T-PDF-2023-06) but encountered the following error.
```
In [1]: ds = load_dataset(
    "mlfoundations/MINT-1T-PDF-CC-2023-06",
    download_config=DownloadConfig(resume_download=True))


...
Downloading data: 100%|████████████████████| 65007/65007 [9:17:51<00:00,  1.94files/s]
Computing checksums: 100%|████████████████████| 65007/65007 [00:59<00:00, 1087.82it/s]
Traceback (most recent call last):
  File "/home/hancheol/projects/mllm-data/mllm_data/a_preprocess/mint_1t_pdf/00_download.py", line 24, in <module>
    main(args)
  File "/home/hancheol/projects/mllm-data/mllm_data/a_preprocess/mint_1t_pdf/00_download.py", line 7, in main
    ds = load_dataset(
         ^^^^^^^^^^^^^
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/load.py", line 2609, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1789, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 82, in _split_generators
    inferred_arrow_schema = pa.concat_tables(pa_tables, promote_options="default").schema
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 5245, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<bff_contained_ngram_count_before_dedupe: int64, image_metadata: list<item: struct<height: int64, page: int64, sha256: string, width: int64, xref: int64>>, images: list<item: string>, pdf_name: string, texts: list<item: string>, url: string> output fields: struct<bff_contained_ngram_count_before_dedupe: int64, image_metadata: list<item: struct<height: int64, page: int64, sha256: string, width: int64, xref: int64>>, images: list<item: string>, language_id_whole_page_fasttext: struct<en: double>, pdf_name: string, previous_word_count: int64, texts: list<item: string>, url: string>
```

Error message shows that the output data has two additional fields: language_id_whole_page_fasttext and previous_word_count.
```
Input fields:
    struct<
        bff_contained_ngram_count_before_dedupe: int64, 
        image_metadata: list<item: struct<
            height: int64, 
            page: int64, 
            sha256: string, 
            width: int64, 
            xref: int64
        >>, 
        images: list<item: string>, 
        pdf_name: string, 
        texts: list<item: string>, 
        url: string
    >

Output fields:
    struct<
        bff_contained_ngram_count_before_dedupe: int64, 
        image_metadata: list<item: struct<
            height: int64, 
            page: int64, 
            sha256: string, 
            width: int64, 
            xref: int64
        >>, 
        images: list<item: string>, 
        language_id_whole_page_fasttext: struct<en: double>,    # <=New one 
        pdf_name: string, 
        previous_word_count: int64,    # <=New one
        texts: list<item: string>, 
        url: string
    >
```

Do you have any idea how to fix it?

Best regards,
Han-Cheol


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when loading MINT-1T-PDF-2023-06 #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error when loading MINT-1T-PDF-2023-06 #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions