This repository contains the code and resources for the paper "Explainable Subjective Stance Classification with SetFit in Political Discourse". The project leverages the SetFit few-shot learning framework, Sentence Transformers architecture, and traditional linguistic approaches to enhance explainability in stance classification for political discourse.
Stance classification in NLP is a crucial tool for understanding political discourse and the attitudes underlying political statements. This research addresses the challenge of limited annotated datasets in political science by proposing a practical sentence-level dataset for binary subjective stance classification—support or oppose—using the SetFit few-shot learning framework.
The project focuses on identifying linguistic markers that predict subjective stance toward explicitly identified political targets or policy issues, using:
- SetFit few-shot learning framework
- Sentence Transformers architecture
- Traditional linguistic approaches for explainability
- SHAP (SHapley Additive exPlanations) analysis
This paper is part of the doctoral research at the Institute of Computer Science, Brandenburgische Technische Universität Cottbus-Senftenberg by Juan-Francisco Reyes.
- paper_b_1_dataset_extract_sentences.py: Extracts sentences from source texts
- paper_b_2_dataset_build_pools.py: Builds data pools for annotation
- paper_b_3_dataset_build_unlabeled.py: Creates unlabeled dataset
- paper_b_4_dataset_build_dataset_gsheets.py: Builds dataset from Google Sheets
- paper_b_4_dataset_filter_unlabeled.py: Filters unlabeled data
- paper_b_5_dataset_preprocess.py: Preprocesses dataset
- paper_b_6_dataset_1a_split.py: Splits dataset into train/validation/test sets
- paper_b_7_dataset_1a_feature_extraction.py: Extracts linguistic features
- paper_b_7_dataset_1a_feature_extraction_token_level.py: Token-level feature extraction
- paper_b_8_dataset_1a_feature_aggregation_binary.py: Aggregates features for binary classification
- paper_b_9_frames_chi2.py: Chi-square analysis for frames
- paper_b_10_logistic_regression.py: Logistic regression baseline model
- paper_b_11_shap_analysis.py: SHAP analysis for model explainability
- paper_b_12_shap_aggregate_and_rank.py: Aggregates and ranks SHAP values
- paper_b_13_dataset_1a_data_analysis_binary.py: Binary data analysis
- paper_b_14_dataset_1a_data_processing.py: Data processing for analysis
- paper_b_15_dataset_1a_feature_aggregation_1.py: Feature aggregation
- paper_b_16_rb_frameBERT_sentence_analizer.py: FrameBERT sentence analyzer
- paper_b_17_rb_frameBERT_visualizer.py: FrameBERT visualizer
- paper_b_18_lime_analysis.ipynb: LIME analysis for model explainability
- paper_b_19_dl_chat_gpt_inference.py: ChatGPT inference for comparison
- paper_b_20_dl_setfit_train.py: Trains the SetFit model
- paper_b_21_dl_setfit_hop.py: Hyperparameter optimization for SetFit
- paper_b_22_dl_setfit_inference.py: Performs inference using the trained SetFit model
- paper_b_23_dl_setfit_test.py: Tests the SetFit model performance
The lib folder contains utility modules and specialized components for linguistic analysis and stance classification. Note that some of these files are used directly by the root Python files, while others serve as supporting modules or resources for other library files:
- utils.py: General utility functions for file operations (JSON, JSONL, TXT) and Google Sheet interactions
- utils2.py: Dataset manipulation utilities including deduplication, anonymization, and stratified splitting
- utils_db.py: Database utility functions
- text_utils.py: Comprehensive text preprocessing functions for cleaning and normalizing text
- linguistic_utils.py: Utilities for linguistic analysis including checking sentence structure
- count_tokens.py: Functions for token counting and analysis
- stance_markers_adj.py: Adjective-based stance markers for identifying stance expressions
- stance_markers_adv.py: Adverb-based stance markers for identifying stance expressions
- stance_markers_verb.py: Verb-based stance markers for identifying stance expressions
- stance_markers_modals.py: Modal verb-based stance markers for expressing certainty and possibility
- frames.py: Semantic frame definitions for understanding conceptual structures in text
- semantic_frames.py: Functions for processing and analyzing semantic frames
- issues_matcher.py: Custom named entity recognition for identifying political issues
- visualizations.py: Functions for creating visualizations of analysis results including confusion matrices, dependency trees, and feature distributions
The study leverages several approaches to enhance explainability:
- Corpus Linguistics: Analysis of language patterns in political discourse
- Tailored Lexicons: Custom lexicons for political language analysis
- Lexicogrammatical Rules: Rules based on linguistic structures
- SHAP Analysis: Quantifies the influence of linguistic features on model decisions
The project identifies eight distinct linguistic features for stance classification:
- Positive affect
- Negative affect
- Pro polarity
- Con polarity
- Certainty
- Emphatics
- Doubt
- Hedges
- Dataset Preparation: Run the dataset extraction and preprocessing scripts
- Feature Extraction: Extract linguistic features using the feature extraction scripts
- Model Training: Train the SetFit model using
paper_b_20_dl_setfit_train.py - Model Evaluation: Evaluate the model using
paper_b_23_dl_setfit_test.py - Inference: Perform inference on new data using
paper_b_22_dl_setfit_inference.py - Explainability Analysis: Analyze feature importance using SHAP and LIME analysis scripts
The project dependencies are listed in the requirements.txt file. Install them using:
pip install -r requirements.txt
Key dependencies include:
- setfit
- transformers
- sentence-transformers
- datasets
- scikit-learn
- shap
- spacy
- torch
- pandas
- matplotlib
The findings demonstrate the efficacy of few-shot learning in subjective stance classification and highlight the importance of linguistic features, particularly pro/con polarity and affective expressions. The StanceSentences dataset and the hybrid analytical approach offer a benchmark for future research, emphasizing the need for nuanced, multi-layered analysis in political discourse.
If you use this code or the findings in your research, please cite the original paper:
Reyes, J. F. (2024). Explainable Subjective Stance Classification with SetFit in Political Discourse. Institute of Computer Science, Brandenburgische Technische Universität Cottbus-Senftenberg.
This project is released under the MIT License.