Hello team,
First, thank you for making the S2ORC dataset available - I've successfully downloaded and accessed the dataset.
I've noticed that all image content appears to be missing from the dataset. During the OCR processing of PDF papers, I believe the images are typically extracted and stored separately in some format. However, I'm unable to locate these image files in the current dataset distribution.
Could you please clarify:
Whether image data is supposed to be included in the S2ORC dataset?
If yes, where can I find the extracted images or how should I access them?
If images are not included, is there a separate repository or method to obtain the image data from the source papers?
Thank you for your assistance!
Hello team,
First, thank you for making the S2ORC dataset available - I've successfully downloaded and accessed the dataset.
I've noticed that all image content appears to be missing from the dataset. During the OCR processing of PDF papers, I believe the images are typically extracted and stored separately in some format. However, I'm unable to locate these image files in the current dataset distribution.
Could you please clarify:
Whether image data is supposed to be included in the S2ORC dataset?
If yes, where can I find the extracted images or how should I access them?
If images are not included, is there a separate repository or method to obtain the image data from the source papers?
Thank you for your assistance!