Skip to content

Error when loading MINT-1T-PDF-2023-06 #13

@hancheolcho

Description

@hancheolcho

Hi,
First of all, thank you for releasing MINT-1T dataset :-)

I loaded one of MINT-1T datasets (MINT-1T-PDF-2023-06) but encountered the following error.

In [1]: ds = load_dataset(
    "mlfoundations/MINT-1T-PDF-CC-2023-06",
    download_config=DownloadConfig(resume_download=True))


...
Downloading data: 100%|████████████████████| 65007/65007 [9:17:51<00:00,  1.94files/s]
Computing checksums: 100%|████████████████████| 65007/65007 [00:59<00:00, 1087.82it/s]
Traceback (most recent call last):
  File "/home/hancheol/projects/mllm-data/mllm_data/a_preprocess/mint_1t_pdf/00_download.py", line 24, in <module>
    main(args)
  File "/home/hancheol/projects/mllm-data/mllm_data/a_preprocess/mint_1t_pdf/00_download.py", line 7, in main
    ds = load_dataset(
         ^^^^^^^^^^^^^
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/load.py", line 2609, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1789, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 82, in _split_generators
    inferred_arrow_schema = pa.concat_tables(pa_tables, promote_options="default").schema
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 5245, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<bff_contained_ngram_count_before_dedupe: int64, image_metadata: list<item: struct<height: int64, page: int64, sha256: string, width: int64, xref: int64>>, images: list<item: string>, pdf_name: string, texts: list<item: string>, url: string> output fields: struct<bff_contained_ngram_count_before_dedupe: int64, image_metadata: list<item: struct<height: int64, page: int64, sha256: string, width: int64, xref: int64>>, images: list<item: string>, language_id_whole_page_fasttext: struct<en: double>, pdf_name: string, previous_word_count: int64, texts: list<item: string>, url: string>

Error message shows that the output data has two additional fields: language_id_whole_page_fasttext and previous_word_count.

Input fields:
    struct<
        bff_contained_ngram_count_before_dedupe: int64, 
        image_metadata: list<item: struct<
            height: int64, 
            page: int64, 
            sha256: string, 
            width: int64, 
            xref: int64
        >>, 
        images: list<item: string>, 
        pdf_name: string, 
        texts: list<item: string>, 
        url: string
    >

Output fields:
    struct<
        bff_contained_ngram_count_before_dedupe: int64, 
        image_metadata: list<item: struct<
            height: int64, 
            page: int64, 
            sha256: string, 
            width: int64, 
            xref: int64
        >>, 
        images: list<item: string>, 
        language_id_whole_page_fasttext: struct<en: double>,    # <=New one 
        pdf_name: string, 
        previous_word_count: int64,    # <=New one
        texts: list<item: string>, 
        url: string
    >

Do you have any idea how to fix it?

Best regards,
Han-Cheol

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions