This repository contains the Fact-Stance dataset, a collection of structured scientific claims annotated with factual triples and epistemic stance information.
Scientific literature is a rich source of domain knowledge—but how authors present their claims matters just as much as the claims themselves. Are they stating facts with confidence, hedging their assertions, or simply reporting others’ views?
Such subjective stances, encoded through subtle linguistic markers, are per-vasive in scholarly writing yet remain largely overlooked in computational approaches that reduce claims to objective facts.
In this work, we present a stance-aware framework that models scientific claims as dual-layered struc-tures comprising factual elements (entities, events, and relations) and stance markers (that express epistemic, evaluative, and interpersonal positioning). To support large-scale analysis, we develop a LLM-assisted annotation pipe-line that combines model-generated rationales with expert adjudication. This process yields Fact-Stance-9K, a dataset of 9,261 claims from 537 biomedi-cal papers, annotated with fine-grained claim structure and achieving 72% inter-annotator agreement. Through a claim reconstruction task, we demon-strate that our representation preserves essential semantic and rhetorical con-tent (BERTScore-F1=0.90, ROUGE-L=0.74), confirming the framework’s expressive power and utility for future stance-aware scientific data mining.
- Factual Elements refer to the core propositional content of the claim—verifiable information about entities, events, actions, or relationships. These elements represent what is being asserted as true within the scientific narra-tive.
- Stance Markers refer to the linguistic expressions that indicate how authors position themselves toward the claim. These markers capture epistemic com-mitment, evaluative judgment, and interpersonal engagement, shaping how the claim is framed within the scientific discourse.
This dataset includes:
- ✅ Structured factual triples (subject-predicate-object)
- ✅ Epistemic stance markers
- ✅ Original sentences extracted from scientific full-text papers
- ✅ Prompts used for LLM annotation
- ✅ Lexicon of stance-related expressions
.
├── data/
│ ├── fact_stance.json # JSON version of the dataset
│ └── filter_markers.json # JSON version of all the markers
├── prompts/
│ └── fact_stance_prompts.md # Prompt templates used for LLM annotation
├── lexicon/
│ └── stance_lexicon.csv # List of stance-related expressions
├── README.md # You are here!
└── LICENSE # License file
This dataset contains two views designed to support the analysis of factual language and stance markers extracted from textual data. Each view serves a specific purpose and provides structured access to relevant linguistic annotations and sentence-level metadata.
Schema:
| Column Name | Type | Description |
|---|---|---|
sen_id |
INTEGER | Unique identifier for the sentence. |
article_id |
INTEGER | Identifier for the article containing the sentence. |
section_id |
INTEGER | Identifier for the section within the article. |
sen_text |
TEXT | The original sentence text. |
qwen_reason |
TEXT | Reasoning generated by the Qwen model regarding the factual or stance content. |
deepseek_reason |
TEXT | Reasoning generated by the DeepSeek model regarding the factual or stance content. |
moves |
TEXT | Discourse moves identified in the sentence (e.g., claim, premise). |
factual_elements |
TEXT | Extracted factual elements present in the sentence. |
markers_json |
TEXT | JSON-formatted list of associated linguistic markers detected in the sentence. |
reverse_text |
TEXT | A reversed version of the sentence used for contrastive analysis. |
Note: In total, 537 articles and 9,196 sentences identified as factual scientific claims were included in fact-stance.
Example:
{
"sen_id": 1,
"article_id": "PMC11085310",
"section_id": "1",
"sen_text": "Gestational diabetes mellitus (GDM) refers to a carbohydrate intolerance that manifests as hyperglycemia of varying severity and initiates or is first observed during pregnancy [1].",
"qwen_reason": "The sentence defines GDM by describing its relationship to carbohydrate intolerance, hyperglycemia, and pregnancy, presenting actionable scientific knowledge.",
"deepseek_reason": "The sentence asserts a clear scientific claim about the nature of gestational diabetes mellitus (GDM), defining it as a carbohydrate intolerance that manifests as hyperglycemia and is first observed during pregnancy. It presents actionable scientific knowledge and advances understanding of GDM.",
"moves": "Background",
"factual_elements": "[{\"object\": \"carbohydrate intolerance\", \"subject\": \"Gestational diabetes mellitus (GDM)\", \"relation\": \"refers to\"}, {\"object\": \"hyperglycemia\", \"subject\": \"carbohydrate intolerance\", \"relation\": \"manifests as\"}, {\"value\": \"varying\", \"subject\": \"hyperglycemia\", \"attribute\": \"severity\"}, {\"object\": \"pregnancy\", \"subject\": \"GDM\", \"relation\": \"initiates or is first observed during\"}]",
"markers_json": "[{\"marker\": \"is first observed during pregnancy\", \"primary_category\": \"Epistemic Stance\", \"secondary_category\": \"Evidentiality\"}, {\"marker\": \"[1]\", \"primary_category\": \"Epistemic Stance\", \"secondary_category\": \"Evidentiality\"}]",
"reverse_text": "Gestational diabetes mellitus (GDM) refers to carbohydrate intolerance, which manifests as hyperglycemia of varying severity, and GDM is first observed during pregnancy [1]."
}Schema:
| Column Name | Type | Description |
|---|---|---|
marker_id |
INTEGER | Unique identifier for each marker. |
article_id |
INTEGER | Identifier for the article in which the marker appears. |
sen_id |
INTEGER | Identifier for the sentence containing the marker. |
marker |
TEXT | The actual linguistic marker text. |
primary_category |
TEXT | High-level classification of the marker type (e.g., "factual", "hedging"). |
secondary_category |
TEXT | Sub-category providing more granular classification. |
explanation |
TEXT | Optional explanation or justification for the categorization. |
token_position |
INTEGER | Position of the marker within the sentence (token-level index). |
relative_position |
TEXT | Relative position of the marker within the sentence (e.g., "beginning", "middle"). |
Note: In total, 543 articles and 11517 sentences included in filter_markers.
Example:
{
"marker_id": 5546,
"article_id": "PMC11085310",
"sen_id": 54,
"marker": "should not be underestimated",
"primary_category": "Evaluative Stance",
"secondary_category": "Appreciative Evaluation",
"explanation": "emphasizes the importance of maternal nutrition",
"token_position": 6,
"relative_position": 0.5455
}All prompts used to extract factual triples and identify epistemic stances are included in the prompts/fact_stance_prompts.md. These include:
- System Prompt : Role and Scope Definition.
- Claim Identification
- Factual Elements Extraction
- Stance Markers Extraction
- Claim Reconstruction
A curated list of epistemic stance markers commonly found in scientific discourse is provided in lexicon/stance_lexicon.csv. It can be used for:
- Training stance classifiers
- Fact verification / Scientific uncertainty recognition
- Writing assistance
If you use the Fact-Stance dataset in your work, please cite our paper:
@inproceedings{lin2025factstance,
title={Fact-Stance: A Stance-Aware Dataset of Structured Scientific Claims with LLM Annotators},
author={Lin, Xin and Zhao, Yang and Zhang, Zhixiong and Wang, Yajiao and Li, Yang and Zhang, Mengting},
booktitle={Proceedings of the 25th ACM/IEEE Joint Conference on Digital Libraries (JCDL)},
pages={1--4},
year={2025},
organization={IEEE}
}🎉 This dataset has been accepted at JCDL 2025 (Resources Track).
This project is licensed under the MIT License.

