🧠 Fact-Stance: A Stance-Aware Dataset of Structured Scientific Claims

This repository contains the Fact-Stance dataset, a collection of structured scientific claims annotated with factual triples and epistemic stance information.

🔍 Overview

Scientific literature is a rich source of domain knowledge—but how authors present their claims matters just as much as the claims themselves. Are they stating facts with confidence, hedging their assertions, or simply reporting others’ views?

Such subjective stances, encoded through subtle linguistic markers, are per-vasive in scholarly writing yet remain largely overlooked in computational approaches that reduce claims to objective facts.

In this work, we present a stance-aware framework that models scientific claims as dual-layered struc-tures comprising factual elements (entities, events, and relations) and stance markers (that express epistemic, evaluative, and interpersonal positioning). To support large-scale analysis, we develop a LLM-assisted annotation pipe-line that combines model-generated rationales with expert adjudication. This process yields Fact-Stance-9K, a dataset of 9,261 claims from 537 biomedi-cal papers, annotated with fine-grained claim structure and achieving 72% inter-annotator agreement. Through a claim reconstruction task, we demon-strate that our representation preserves essential semantic and rhetorical con-tent (BERTScore-F1=0.90, ROUGE-L=0.74), confirming the framework’s expressive power and utility for future stance-aware scientific data mining.

Factual Elements refer to the core propositional content of the claim—verifiable information about entities, events, actions, or relationships. These elements represent what is being asserted as true within the scientific narra-tive.
Stance Markers refer to the linguistic expressions that indicate how authors position themselves toward the claim. These markers capture epistemic com-mitment, evaluative judgment, and interpersonal engagement, shaping how the claim is framed within the scientific discourse.

This dataset includes:

✅ Structured factual triples (subject-predicate-object)
✅ Epistemic stance markers
✅ Original sentences extracted from scientific full-text papers
✅ Prompts used for LLM annotation
✅ Lexicon of stance-related expressions

📁 Repository Structure

.
├── data/
│   ├── fact_stance.json       # JSON version of the dataset 
│   └── filter_markers.json    # JSON version of all the markers
├── prompts/
│   └── fact_stance_prompts.md    # Prompt templates used for LLM annotation
├── lexicon/
│   └── stance_lexicon.csv     # List of stance-related expressions
├── README.md                  # You are here!
└── LICENSE                    # License file

📄 Dataset Description

This dataset contains two views designed to support the analysis of factual language and stance markers extracted from textual data. Each view serves a specific purpose and provides structured access to relevant linguistic annotations and sentence-level metadata.

1. `fact_stance`

Schema:

Column Name	Type	Description
`sen_id`	INTEGER	Unique identifier for the sentence.
`article_id`	INTEGER	Identifier for the article containing the sentence.
`section_id`	INTEGER	Identifier for the section within the article.
`sen_text`	TEXT	The original sentence text.
`qwen_reason`	TEXT	Reasoning generated by the Qwen model regarding the factual or stance content.
`deepseek_reason`	TEXT	Reasoning generated by the DeepSeek model regarding the factual or stance content.
`moves`	TEXT	Discourse moves identified in the sentence (e.g., claim, premise).
`factual_elements`	TEXT	Extracted factual elements present in the sentence.
`markers_json`	TEXT	JSON-formatted list of associated linguistic markers detected in the sentence.
`reverse_text`	TEXT	A reversed version of the sentence used for contrastive analysis.

Note: In total, 537 articles and 9,196 sentences identified as factual scientific claims were included in fact-stance.

Example:

  {
    "sen_id": 1,
    "article_id": "PMC11085310",
    "section_id": "1",
    "sen_text": "Gestational diabetes mellitus (GDM) refers to a carbohydrate intolerance that manifests as hyperglycemia of varying severity and initiates or is first observed during pregnancy [1].",
    "qwen_reason": "The sentence defines GDM by describing its relationship to carbohydrate intolerance, hyperglycemia, and pregnancy, presenting actionable scientific knowledge.",
    "deepseek_reason": "The sentence asserts a clear scientific claim about the nature of gestational diabetes mellitus (GDM), defining it as a carbohydrate intolerance that manifests as hyperglycemia and is first observed during pregnancy. It presents actionable scientific knowledge and advances understanding of GDM.",
    "moves": "Background",
    "factual_elements": "[{\"object\": \"carbohydrate intolerance\", \"subject\": \"Gestational diabetes mellitus (GDM)\", \"relation\": \"refers to\"}, {\"object\": \"hyperglycemia\", \"subject\": \"carbohydrate intolerance\", \"relation\": \"manifests as\"}, {\"value\": \"varying\", \"subject\": \"hyperglycemia\", \"attribute\": \"severity\"}, {\"object\": \"pregnancy\", \"subject\": \"GDM\", \"relation\": \"initiates or is first observed during\"}]",
    "markers_json": "[{\"marker\": \"is first observed during pregnancy\", \"primary_category\": \"Epistemic Stance\", \"secondary_category\": \"Evidentiality\"}, {\"marker\": \"[1]\", \"primary_category\": \"Epistemic Stance\", \"secondary_category\": \"Evidentiality\"}]",
    "reverse_text": "Gestational diabetes mellitus (GDM) refers to carbohydrate intolerance, which manifests as hyperglycemia of varying severity, and GDM is first observed during pregnancy [1]."
  }

2. `filter_markers`

Schema:

Column Name	Type	Description
`marker_id`	INTEGER	Unique identifier for each marker.
`article_id`	INTEGER	Identifier for the article in which the marker appears.
`sen_id`	INTEGER	Identifier for the sentence containing the marker.
`marker`	TEXT	The actual linguistic marker text.
`primary_category`	TEXT	High-level classification of the marker type (e.g., "factual", "hedging").
`secondary_category`	TEXT	Sub-category providing more granular classification.
`explanation`	TEXT	Optional explanation or justification for the categorization.
`token_position`	INTEGER	Position of the marker within the sentence (token-level index).
`relative_position`	TEXT	Relative position of the marker within the sentence (e.g., "beginning", "middle").

Note: In total, 543 articles and 11517 sentences included in filter_markers.

Example:

  {
    "marker_id": 5546,
    "article_id": "PMC11085310",
    "sen_id": 54,
    "marker": "should not be underestimated",
    "primary_category": "Evaluative Stance",
    "secondary_category": "Appreciative Evaluation",
    "explanation": "emphasizes the importance of maternal nutrition",
    "token_position": 6,
    "relative_position": 0.5455
  }

⚙️ Prompts Used

All prompts used to extract factual triples and identify epistemic stances are included in the prompts/fact_stance_prompts.md. These include:

System Prompt : Role and Scope Definition.
Claim Identification
Factual Elements Extraction
Stance Markers Extraction
Claim Reconstruction

📚 Stance Lexicon

A curated list of epistemic stance markers commonly found in scientific discourse is provided in lexicon/stance_lexicon.csv. It can be used for:

Training stance classifiers
Fact verification / Scientific uncertainty recognition
Writing assistance

🧾 Citation

If you use the Fact-Stance dataset in your work, please cite our paper:

@inproceedings{lin2025factstance,
  title={Fact-Stance: A Stance-Aware Dataset of Structured Scientific Claims with LLM Annotators},
  author={Lin, Xin and Zhao, Yang and Zhang, Zhixiong and Wang, Yajiao and Li, Yang and Zhang, Mengting},
  booktitle={Proceedings of the 25th ACM/IEEE Joint Conference on Digital Libraries (JCDL)},
  pages={1--4},
  year={2025},
  organization={IEEE}
}

🎉 This dataset has been accepted at JCDL 2025 (Resources Track).

📄 License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Fact-Stance: A Stance-Aware Dataset of Structured Scientific Claims

🔍 Overview

📁 Repository Structure

📄 Dataset Description

1. `fact_stance`

2. `filter_markers`

⚙️ Prompts Used

📚 Stance Lexicon

🧾 Citation

📄 License

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
lexicons		lexicons
prompts		prompts
LICENSE		LICENSE
README.md		README.md
examples.png		examples.png
overview.png		overview.png

License

Lynnnx/factstance

Folders and files

Latest commit

History

Repository files navigation

🧠 Fact-Stance: A Stance-Aware Dataset of Structured Scientific Claims

🔍 Overview

📁 Repository Structure

📄 Dataset Description

1. fact_stance

2. filter_markers

⚙️ Prompts Used

📚 Stance Lexicon

🧾 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

1. `fact_stance`

2. `filter_markers`

Packages