CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis

This repository contains the files and code we used to prepare the CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis.

Folder structure

    .
    ├── dataset           # All the data files required to reproduce the results
    ├── figures           # Visualizations stored location
    ├── labeling-gui      # GUI used for crowd-sourcing
    ├── notebooks         # .ipynb notebook files
    ├── scripts           # Scripts to reproduce the llm fine-tuning results
    └── README.md

Dependencies

All the required dependencies included in the requirements.txt file.

torch==1.13.0
transformers==4.39.1
pandas==2.0.3
beautifulsoup4==4.12.2
Selenium==4.11.2
webdriver-manager==4.0.0
bibtexparser==1.4.0
pdfminer.six
ipykernel==6.26.0
openpyxl==3.1.2
matplotlib==3.8.0

Dataset

The CC30k dataset consists of labeled citation contexts obtained primarily through crowdsourcing, where each context is annotated by three independent workers. In addition, the dataset includes augmented negative labels generated to extend the coverage of negative citation contexts. This README describes the structure and columns of the dataset.

Available in the dataset directory

Dataset Description

The CC30k dataset is unique in its focus on reproducibility-oriented sentiments (ROS) within scientific literature. This introduces a novel approach to studying computational reproducibility by leveraging citation contexts, which are textual fragments in scientific papers that reference prior work. This dataset comprises 30,734 labeled citation contexts from scientific literature published at AI venues, each annotated with one of three ROS labels: Positive, Negative, or Neutral. These labels reflect the cited work's perceived reproducibility. The dataset contains ROS labeled contexts along with metadata about the workers, reproducibility study, related original paper, citing paper, the final aggregated labels and the label type. The columns in the dataset are detailed in the table below:

Column Name	Description
`input_index`	Unique ID for each citation context.
`input_context`	Citation context that workers are asked to label.
`input_file_key`	Identifier linking the context to a rep-study.
`input_first_author`	Name or identifier of the first author of the cited paper.
`worker_id_w1`	Unique ID of the first worker who labeled this citation context.
`work_time_in_seconds_w1`	Time (in seconds) the first worker took to label the citation context.
`worker_id_w2`	Unique ID of the second worker who labeled this citation context.
`work_time_in_seconds_w2`	Time (in seconds) the second worker took to label the citation context.
`worker_id_w3`	Unique ID of the third worker who labeled this citation context.
`work_time_in_seconds_w3`	Time (in seconds) the third worker took to label the citation context.
`label_w1`	Label assigned by the first worker.
`label_w2`	Label assigned by the second worker.
`label_w3`	Label assigned by the third worker.
`batch`	Batch number for the posted Mechanical Turk job.
`majority_vote`	Final label based on the majority vote among workers’ labels (reproducibility-oriented sentiment: `Positive`, `Negative`, or `Neutral`).
`majority_agreement`	Indicates how many of the three workers agreed on the final majority vote.
`rs_doi`	Digital Object Identifier (DOI) of the reproducibility study paper.
`rs_title`	Title of the reproducibility study paper.
`rs_authors`	List of authors of the reproducibility study paper.
`rs_year`	Publication year of the reproducibility study paper.
`rs_venue`	Venue (conference or journal) where the reproducibility study was published.
`rs_selected_claims`	Number of claims selected from the original paper for reproducibility study (by manual inspection).
`rs_reproduced_claims`	Number of selected claims that were successfully reproduced (by manual inspection).
`reproducibility`	Final reproducibility label assigned to the original paper by manual inspection (reproducible, not-reproducible, partially-reproducible [if 0 < `rs_reproduced_claims` < `rs_selected_claims`]).
`org_doi`	DOI of the original (cited) paper that was assessed for reproducibility.
`org_title`	Title of the original (cited) paper.
`org_authors`	List of authors of the original (cited) paper.
`org_year`	Publication year of the original (cited) paper.
`org_venue`	Venue where the original (cited) paper was published.
`org_paper_url`	URL to access the original (cited) paper.
`org_citations`	Number of citations received by the original (cited) paper.
`org_s2ga_id`	Semantic Scholar Graph API ID of the original (cited) paper.
`citing_doi`	DOI of the citing paper that cited the original (cited) paper.
`citing_year`	Publication year of the citing paper.
`citing_venue`	Venue where the citing paper was published.
`citing_title`	Title of the citing paper.
`citing_authors`	List of authors of the citing paper.
`citing_s2ga_id`	Semantic Scholar Graph API ID of the citing paper.
`label_type`	Label source: `crowdsourced` or `augmented_human_validated` or `augmented_machine_labeled`.

Jupyter Notebook Descriptions

Available inside notebooks directory.

R001_AWS_Labelling_Dataset_Preprocessing_Mturk.ipynb
- Used to pre-process data for Mechanical Turk (MTurk) labeling.
R001_AWS_MTurk_API.ipynb
- Used to communicate with MTurk workers.
R001_AWS_MTurk_process_results.ipynb
- Used to process crowdsourced results from MTurk.
R001_Extend_CC25k_Dataset.ipynb
- Used to extend the crowdsourced labels with newly augmented ROS: Negative contexts.
R_001_Creating_the_RS_superset.ipynb
- Used to collect the original and reproducibility studies.
R_001_Extract_Citing_Paper_Details.ipynb
- Used to collect citing paper details and contexts using the Semantic Scholar Graph API (S2GA).
R_001_MTurk_Sentiment_Analysis_5_models.ipynb
- Generates the performance measures for the selected five open-source multiclass sentiment analysis models.

@misc{obadage2025cc30kcitationcontextsdataset,
      title={CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis}, 
      author={Rochana R. Obadage and Sarah M. Rajtmajer and Jian Wu},
      year={2025},
      eprint={2511.07790},
      archivePrefix={arXiv},
      primaryClass={cs.DL},
      url={https://arxiv.org/abs/2511.07790}, 
}

Rochana R. Obadage
11/14/2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis

Folder structure

Dependencies

Dataset

Dataset Description

Jupyter Notebook Descriptions

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
dataset		dataset
figures		figures
labeling-gui		labeling-gui
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
appendix.pdf		appendix.pdf
requirements.txt		requirements.txt

lamps-lab/CC30k

Folders and files

Latest commit

History

Repository files navigation

CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis

Folder structure

Dependencies

Dataset

Dataset Description

Jupyter Notebook Descriptions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages