Missing Data in the abstracts dataset.

Hello Team,

I downloaded the "abstracts" dataset from the following link:
https://api.semanticscholar.org/datasets/v1/release/2025-11-11/dataset/abstracts

The dataset description states it contains "100M records." However, the files I downloaded (30 shards 1.8GB unziped each) only contain approximately 39 million abstracts.

Furthermore, some highly cited papers are missing their abstracts in the data. For example, the paper with Corpus ID 195908774 (Title: ImageNet classification with deep convolutional neural networks) is not present.

Could you please look into this discrepancy? Thank you for your assistance.


>     {
>         "name": "abstracts",
>         "description": "Paper abstract text, where available.\n100M records in 30 1.8GB files.",
>         "README": "Semantic Scholar Academic Graph Datasets\n\nThe \"abstracts\" dataset provides abstract text for selected papers.\n\nSCHEMA\n - openAccessInfo\n   - externalIds: IDs of this paper in different catalogs\n   - license/url/status: open-access information provided by Unpaywall, linked by DOI or PubMed Central ID\n\nLICENSE\nThis collection is licensed under ODC-BY. (https://opendatacommons.org/licenses/by/1.0/)\n\nBy downloading this data you acknowledge that you have read and agreed to all the terms in this license.\n\nATTRIBUTION\nWhen using this data in a product or service, or including data in a redistribution, please cite the following paper:\n\nBibTex format:\n@misc{https://doi.org/10.48550/arxiv.2301.10140,\n  title = {The Semantic Scholar Open Data Platform},\n  author = {Kinney, Rodney and Anastasiades, Chloe and Authur, Russell and Beltagy, Iz and Bragg, Jonathan and Buraczynski, Alexandra and Cachola, Isabel and Candra, Stefan and Chandrasekhar, Yoganand and Cohan, Arman and Crawford, Miles and Downey, Doug and Dunkelberger, Jason and Etzioni, Oren and Evans, Rob and Feldman, Sergey and Gorney, Joseph and Graham, David and Hu, Fangzhou and Huff, Regan and King, Daniel and Kohlmeier, Sebastian and Kuehl, Bailey and Langan, Michael and Lin, Daniel and Liu, Haokun and Lo, Kyle and Lochner, Jaron and MacMillan, Kelsey and Murray, Tyler and Newell, Chris and Rao, Smita and Rohatgi, Shaurya and Sayre, Paul and Shen, Zejiang and Singh, Amanpreet and Soldaini, Luca and Subramanian, Shivashankar and Tanaka, Amber and Wade, Alex D. and Wagner, Linda and Wang, Lucy Lu and Wilhelm, Chris and Wu, Caroline and Yang, Jiangjiang and Zamarron, Angele and Van Zuylen, Madeleine and Weld, Daniel S.},\n  publisher = {arXiv},\n  year = {2023},\n  doi = {10.48550/ARXIV.2301.10140},\n  url = {https://arxiv.org/abs/2301.10140},\n}"
>     },

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Data in the abstracts dataset. #58

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing Data in the abstracts dataset. #58

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions