Skip to content

Lynnnx/factstance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Fact-Stance: A Stance-Aware Dataset of Structured Scientific Claims

Overview of the Stance-Aware Scientific Claim Framework

This repository contains the Fact-Stance dataset, a collection of structured scientific claims annotated with factual triples and epistemic stance information.

🔍 Overview

Scientific literature is a rich source of domain knowledge—but how authors present their claims matters just as much as the claims themselves. Are they stating facts with confidence, hedging their assertions, or simply reporting others’ views?

Such subjective stances, encoded through subtle linguistic markers, are per-vasive in scholarly writing yet remain largely overlooked in computational approaches that reduce claims to objective facts.

In this work, we present a stance-aware framework that models scientific claims as dual-layered struc-tures comprising factual elements (entities, events, and relations) and stance markers (that express epistemic, evaluative, and interpersonal positioning). To support large-scale analysis, we develop a LLM-assisted annotation pipe-line that combines model-generated rationales with expert adjudication. This process yields Fact-Stance-9K, a dataset of 9,261 claims from 537 biomedi-cal papers, annotated with fine-grained claim structure and achieving 72% inter-annotator agreement. Through a claim reconstruction task, we demon-strate that our representation preserves essential semantic and rhetorical con-tent (BERTScore-F1=0.90, ROUGE-L=0.74), confirming the framework’s expressive power and utility for future stance-aware scientific data mining.

  • Factual Elements refer to the core propositional content of the claim—verifiable information about entities, events, actions, or relationships. These elements represent what is being asserted as true within the scientific narra-tive.
  • Stance Markers refer to the linguistic expressions that indicate how authors position themselves toward the claim. These markers capture epistemic com-mitment, evaluative judgment, and interpersonal engagement, shaping how the claim is framed within the scientific discourse.

Annotated Examples of Stance Markers Across Different Categories

This dataset includes:

  • ✅ Structured factual triples (subject-predicate-object)
  • ✅ Epistemic stance markers
  • ✅ Original sentences extracted from scientific full-text papers
  • ✅ Prompts used for LLM annotation
  • ✅ Lexicon of stance-related expressions

📁 Repository Structure

.
├── data/
│   ├── fact_stance.json       # JSON version of the dataset 
│   └── filter_markers.json    # JSON version of all the markers
├── prompts/
│   └── fact_stance_prompts.md    # Prompt templates used for LLM annotation
├── lexicon/
│   └── stance_lexicon.csv     # List of stance-related expressions
├── README.md                  # You are here!
└── LICENSE                    # License file

📄 Dataset Description

This dataset contains two views designed to support the analysis of factual language and stance markers extracted from textual data. Each view serves a specific purpose and provides structured access to relevant linguistic annotations and sentence-level metadata.

1. fact_stance

Schema:

Column Name Type Description
sen_id INTEGER Unique identifier for the sentence.
article_id INTEGER Identifier for the article containing the sentence.
section_id INTEGER Identifier for the section within the article.
sen_text TEXT The original sentence text.
qwen_reason TEXT Reasoning generated by the Qwen model regarding the factual or stance content.
deepseek_reason TEXT Reasoning generated by the DeepSeek model regarding the factual or stance content.
moves TEXT Discourse moves identified in the sentence (e.g., claim, premise).
factual_elements TEXT Extracted factual elements present in the sentence.
markers_json TEXT JSON-formatted list of associated linguistic markers detected in the sentence.
reverse_text TEXT A reversed version of the sentence used for contrastive analysis.

Note: In total, 537 articles and 9,196 sentences identified as factual scientific claims were included in fact-stance.

Example:

  {
    "sen_id": 1,
    "article_id": "PMC11085310",
    "section_id": "1",
    "sen_text": "Gestational diabetes mellitus (GDM) refers to a carbohydrate intolerance that manifests as hyperglycemia of varying severity and initiates or is first observed during pregnancy [1].",
    "qwen_reason": "The sentence defines GDM by describing its relationship to carbohydrate intolerance, hyperglycemia, and pregnancy, presenting actionable scientific knowledge.",
    "deepseek_reason": "The sentence asserts a clear scientific claim about the nature of gestational diabetes mellitus (GDM), defining it as a carbohydrate intolerance that manifests as hyperglycemia and is first observed during pregnancy. It presents actionable scientific knowledge and advances understanding of GDM.",
    "moves": "Background",
    "factual_elements": "[{\"object\": \"carbohydrate intolerance\", \"subject\": \"Gestational diabetes mellitus (GDM)\", \"relation\": \"refers to\"}, {\"object\": \"hyperglycemia\", \"subject\": \"carbohydrate intolerance\", \"relation\": \"manifests as\"}, {\"value\": \"varying\", \"subject\": \"hyperglycemia\", \"attribute\": \"severity\"}, {\"object\": \"pregnancy\", \"subject\": \"GDM\", \"relation\": \"initiates or is first observed during\"}]",
    "markers_json": "[{\"marker\": \"is first observed during pregnancy\", \"primary_category\": \"Epistemic Stance\", \"secondary_category\": \"Evidentiality\"}, {\"marker\": \"[1]\", \"primary_category\": \"Epistemic Stance\", \"secondary_category\": \"Evidentiality\"}]",
    "reverse_text": "Gestational diabetes mellitus (GDM) refers to carbohydrate intolerance, which manifests as hyperglycemia of varying severity, and GDM is first observed during pregnancy [1]."
  }

2. filter_markers

Schema:

Column Name Type Description
marker_id INTEGER Unique identifier for each marker.
article_id INTEGER Identifier for the article in which the marker appears.
sen_id INTEGER Identifier for the sentence containing the marker.
marker TEXT The actual linguistic marker text.
primary_category TEXT High-level classification of the marker type (e.g., "factual", "hedging").
secondary_category TEXT Sub-category providing more granular classification.
explanation TEXT Optional explanation or justification for the categorization.
token_position INTEGER Position of the marker within the sentence (token-level index).
relative_position TEXT Relative position of the marker within the sentence (e.g., "beginning", "middle").

Note: In total, 543 articles and 11517 sentences included in filter_markers.

Example:

  {
    "marker_id": 5546,
    "article_id": "PMC11085310",
    "sen_id": 54,
    "marker": "should not be underestimated",
    "primary_category": "Evaluative Stance",
    "secondary_category": "Appreciative Evaluation",
    "explanation": "emphasizes the importance of maternal nutrition",
    "token_position": 6,
    "relative_position": 0.5455
  }

⚙️ Prompts Used

All prompts used to extract factual triples and identify epistemic stances are included in the prompts/fact_stance_prompts.md. These include:

  • System Prompt : Role and Scope Definition.
  • Claim Identification
  • Factual Elements Extraction
  • Stance Markers Extraction
  • Claim Reconstruction

📚 Stance Lexicon

A curated list of epistemic stance markers commonly found in scientific discourse is provided in lexicon/stance_lexicon.csv. It can be used for:

  • Training stance classifiers
  • Fact verification / Scientific uncertainty recognition
  • Writing assistance

🧾 Citation

If you use the Fact-Stance dataset in your work, please cite our paper:

@inproceedings{lin2025factstance,
  title={Fact-Stance: A Stance-Aware Dataset of Structured Scientific Claims with LLM Annotators},
  author={Lin, Xin and Zhao, Yang and Zhang, Zhixiong and Wang, Yajiao and Li, Yang and Zhang, Mengting},
  booktitle={Proceedings of the 25th ACM/IEEE Joint Conference on Digital Libraries (JCDL)},
  pages={1--4},
  year={2025},
  organization={IEEE}
}

🎉 This dataset has been accepted at JCDL 2025 (Resources Track).

📄 License

This project is licensed under the MIT License.

About

A Stance-Aware Dataset of Structured Scientific Claims with LLM Annotators

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published