Explainable Subjective Stance Classification with SetFit in Political Discourse

This repository contains the code and resources for the paper "Explainable Subjective Stance Classification with SetFit in Political Discourse". The project leverages the SetFit few-shot learning framework, Sentence Transformers architecture, and traditional linguistic approaches to enhance explainability in stance classification for political discourse.

Project Overview

Stance classification in NLP is a crucial tool for understanding political discourse and the attitudes underlying political statements. This research addresses the challenge of limited annotated datasets in political science by proposing a practical sentence-level dataset for binary subjective stance classification—support or oppose—using the SetFit few-shot learning framework.

The project focuses on identifying linguistic markers that predict subjective stance toward explicitly identified political targets or policy issues, using:

SetFit few-shot learning framework
Sentence Transformers architecture
Traditional linguistic approaches for explainability
SHAP (SHapley Additive exPlanations) analysis

Academic Context

This paper is part of the doctoral research at the Institute of Computer Science, Brandenburgische Technische Universität Cottbus-Senftenberg by Juan-Francisco Reyes.

Repository Structure

Data Processing and Dataset Creation

paper_b_1_dataset_extract_sentences.py: Extracts sentences from source texts
paper_b_2_dataset_build_pools.py: Builds data pools for annotation
paper_b_3_dataset_build_unlabeled.py: Creates unlabeled dataset
paper_b_4_dataset_build_dataset_gsheets.py: Builds dataset from Google Sheets
paper_b_4_dataset_filter_unlabeled.py: Filters unlabeled data
paper_b_5_dataset_preprocess.py: Preprocesses dataset
paper_b_6_dataset_1a_split.py: Splits dataset into train/validation/test sets

Feature Extraction and Analysis

paper_b_7_dataset_1a_feature_extraction.py: Extracts linguistic features
paper_b_7_dataset_1a_feature_extraction_token_level.py: Token-level feature extraction
paper_b_8_dataset_1a_feature_aggregation_binary.py: Aggregates features for binary classification
paper_b_9_frames_chi2.py: Chi-square analysis for frames
paper_b_10_logistic_regression.py: Logistic regression baseline model
paper_b_11_shap_analysis.py: SHAP analysis for model explainability
paper_b_12_shap_aggregate_and_rank.py: Aggregates and ranks SHAP values
paper_b_13_dataset_1a_data_analysis_binary.py: Binary data analysis
paper_b_14_dataset_1a_data_processing.py: Data processing for analysis
paper_b_15_dataset_1a_feature_aggregation_1.py: Feature aggregation

Model Training and Inference

paper_b_16_rb_frameBERT_sentence_analizer.py: FrameBERT sentence analyzer
paper_b_17_rb_frameBERT_visualizer.py: FrameBERT visualizer
paper_b_18_lime_analysis.ipynb: LIME analysis for model explainability
paper_b_19_dl_chat_gpt_inference.py: ChatGPT inference for comparison
paper_b_20_dl_setfit_train.py: Trains the SetFit model
paper_b_21_dl_setfit_hop.py: Hyperparameter optimization for SetFit
paper_b_22_dl_setfit_inference.py: Performs inference using the trained SetFit model
paper_b_23_dl_setfit_test.py: Tests the SetFit model performance

Library Files (lib/)

The lib folder contains utility modules and specialized components for linguistic analysis and stance classification. Note that some of these files are used directly by the root Python files, while others serve as supporting modules or resources for other library files:

Utility Functions

utils.py: General utility functions for file operations (JSON, JSONL, TXT) and Google Sheet interactions
utils2.py: Dataset manipulation utilities including deduplication, anonymization, and stratified splitting
utils_db.py: Database utility functions

Text Processing

text_utils.py: Comprehensive text preprocessing functions for cleaning and normalizing text
linguistic_utils.py: Utilities for linguistic analysis including checking sentence structure
count_tokens.py: Functions for token counting and analysis

Stance Lexicons

stance_markers_adj.py: Adjective-based stance markers for identifying stance expressions
stance_markers_adv.py: Adverb-based stance markers for identifying stance expressions
stance_markers_verb.py: Verb-based stance markers for identifying stance expressions
stance_markers_modals.py: Modal verb-based stance markers for expressing certainty and possibility

Semantic Analysis

frames.py: Semantic frame definitions for understanding conceptual structures in text
semantic_frames.py: Functions for processing and analyzing semantic frames
issues_matcher.py: Custom named entity recognition for identifying political issues

Visualization

visualizations.py: Functions for creating visualizations of analysis results including confusion matrices, dependency trees, and feature distributions

Key Features

The study leverages several approaches to enhance explainability:

Corpus Linguistics: Analysis of language patterns in political discourse
Tailored Lexicons: Custom lexicons for political language analysis
Lexicogrammatical Rules: Rules based on linguistic structures
SHAP Analysis: Quantifies the influence of linguistic features on model decisions

The project identifies eight distinct linguistic features for stance classification:

Positive affect
Negative affect
Pro polarity
Con polarity
Certainty
Emphatics
Doubt
Hedges

Usage

Dataset Preparation: Run the dataset extraction and preprocessing scripts
Feature Extraction: Extract linguistic features using the feature extraction scripts
Model Training: Train the SetFit model using paper_b_20_dl_setfit_train.py
Model Evaluation: Evaluate the model using paper_b_23_dl_setfit_test.py
Inference: Perform inference on new data using paper_b_22_dl_setfit_inference.py
Explainability Analysis: Analyze feature importance using SHAP and LIME analysis scripts

Requirements

The project dependencies are listed in the requirements.txt file. Install them using:

pip install -r requirements.txt

Key dependencies include:

setfit
transformers
sentence-transformers
datasets
scikit-learn
shap
spacy
torch
pandas
matplotlib

Research Findings

The findings demonstrate the efficacy of few-shot learning in subjective stance classification and highlight the importance of linguistic features, particularly pro/con polarity and affective expressions. The StanceSentences dataset and the hybrid analytical approach offer a benchmark for future research, emphasizing the need for nuanced, multi-layered analysis in political discourse.

Citation

If you use this code or the findings in your research, please cite the original paper:

Reyes, J. F. (2024). Explainable Subjective Stance Classification with SetFit in Political Discourse. Institute of Computer Science, Brandenburgische Technische Universität Cottbus-Senftenberg.

License

This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Explainable Subjective Stance Classification with SetFit in Political Discourse

Project Overview

Academic Context

Repository Structure

Data Processing and Dataset Creation

Feature Extraction and Analysis

Model Training and Inference

Library Files (lib/)

Utility Functions

Text Processing

Stance Lexicons

Semantic Analysis

Visualization

Key Features

Usage

Requirements

Research Findings

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
lib		lib
shared_data		shared_data
README.md		README.md
db.py		db.py
paper_b_10_logistic_regression.py		paper_b_10_logistic_regression.py
paper_b_11_shap_analysis.ipynb		paper_b_11_shap_analysis.ipynb
paper_b_11_shap_analysis.py		paper_b_11_shap_analysis.py
paper_b_12_shap_aggregate_and_rank.py		paper_b_12_shap_aggregate_and_rank.py
paper_b_13_dataset_1a_data_analysis_binary.py		paper_b_13_dataset_1a_data_analysis_binary.py
paper_b_14_dataset_1a_data_processing.py		paper_b_14_dataset_1a_data_processing.py
paper_b_15_dataset_1a_feature_aggregation_1.py		paper_b_15_dataset_1a_feature_aggregation_1.py
paper_b_16_rb_frameBERT_sentence_analizer.py		paper_b_16_rb_frameBERT_sentence_analizer.py
paper_b_17_rb_frameBERT_visualizer.py		paper_b_17_rb_frameBERT_visualizer.py
paper_b_18_lime_analysis.ipynb		paper_b_18_lime_analysis.ipynb
paper_b_19_dl_chat_gpt_inference.py		paper_b_19_dl_chat_gpt_inference.py
paper_b_1_dataset_extract_sentences.py		paper_b_1_dataset_extract_sentences.py
paper_b_20_dl_setfit_train.py		paper_b_20_dl_setfit_train.py
paper_b_21_dl_setfit_hop.py		paper_b_21_dl_setfit_hop.py
paper_b_22_dl_setfit_inference.py		paper_b_22_dl_setfit_inference.py
paper_b_23_dl_setfit_test.py		paper_b_23_dl_setfit_test.py
paper_b_2_dataset_build_pools.py		paper_b_2_dataset_build_pools.py
paper_b_3_dataset_build_unlabeled.py		paper_b_3_dataset_build_unlabeled.py
paper_b_4_dataset_build_dataset_gsheets.py		paper_b_4_dataset_build_dataset_gsheets.py
paper_b_4_dataset_filter_unlabeled.py		paper_b_4_dataset_filter_unlabeled.py
paper_b_5_dataset_preprocess.py		paper_b_5_dataset_preprocess.py
paper_b_6_dataset_1a_split.py		paper_b_6_dataset_1a_split.py
paper_b_7_dataset_1a_feature_extraction.py		paper_b_7_dataset_1a_feature_extraction.py
paper_b_7_dataset_1a_feature_extraction_token_level.py		paper_b_7_dataset_1a_feature_extraction_token_level.py
paper_b_8_dataset_1a_feature_aggregation_binary.py		paper_b_8_dataset_1a_feature_aggregation_binary.py
paper_b_9_frames_chi2.py		paper_b_9_frames_chi2.py
requirements.txt		requirements.txt

pacoreyes/stance_classification

Folders and files

Latest commit

History

Repository files navigation

Explainable Subjective Stance Classification with SetFit in Political Discourse

Project Overview

Academic Context

Repository Structure

Data Processing and Dataset Creation

Feature Extraction and Analysis

Model Training and Inference

Library Files (lib/)

Utility Functions

Text Processing

Stance Lexicons

Semantic Analysis

Visualization

Key Features

Usage

Requirements

Research Findings

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages