Classification Experiments for A Comparative Study of Automatic Speech Act Classification - From Logistic Regression to GPT-4o

This repository contains the code and resources for the classification experiments described in the paper:

"A Comparative Study of Automatic Speech Act Classification - From Logistic Regression to GPT-4o"
[Anonymzied Authors for reviewing process]

Overview

This repository implements the classification experiments described in the paper, including:

Preprocessing and vectorization of the SPICE dataset
Baseline Code
Hyperparameter tuning on different models (Logistic Regression, Random Forest, XGBoost)
Using GPT API

Usage

Clone this repo:

git clone https://github.com/sophiakene/SpeechActClassification.git
cd repo-name

Preprocessing

Importantly, note that because of conflicting package versions, two different environments are needed. Example using conda:

** First environment for Preprocessing the data **

conda create -n preprocessing_env
conda activate preprocessing_env
pip install -r prep_requirements.txt

then run the cells in preprocessing.ipynb

Vectorizing

Prerequisites:

The SPICE Dataset: The data folder is expected to contain subfolders called SPICE Broadcast discussion, SPICE Broadcast interview, etc. as in the original dataset distribution. Each register subfolder contains two more folders for North and South Ireland, respectively, which in turn contain the annotated txt files.
Fill in the absolute or relative path to your data folder in the second code cell (directory = "...")
Jupyter Notebook or JupyterLab

Importantly, note that due to version incompabilities of libraries, another environment is needed for vectorizing the preprocessed data and running the classification scripts.

** Second environment for vectorization and classification experiments **

conda create -n preprocessing_env python=3.13.5
conda activate preprocessing_env
pip install -r prep_requirements.txt

Following that, you can run the preprocessing.ipynb Jupyter Notebook. Output:

preprocessed_data.csv: A CSV file containing the preprocessed data and meta data in a tabular format
Speech_Acts_Distribution.png: A bar plot with speech act counts

Classifying

Now you can run the cells of vectorize.ipynb
Input: preprocessed_data.csv (output from preprocessing
Outputs:

vectorized_data.npz
labels.npy
filenames.npy

Run Classification Experiments

baseline.ipynb
hyperparameter_tuning.ipynb
xgboost_oversampling.ipynb

Note that for the GPT classification and fine-tuning you need an API key and insert it in the following script:

zero-and-few-shot-clf.ipynb

The notebook gpt_fine_tuning.ipynb prepares the prompts and train, validate and test set for fine-tuning GPT-4o. The actual fine-tuning was carried out on platform.openai.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classification Experiments for A Comparative Study of Automatic Speech Act Classification - From Logistic Regression to GPT-4o

Table of Contents

Overview

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
baseline.ipynb		baseline.ipynb
gpt_fine_tuning.ipynb		gpt_fine_tuning.ipynb
hyperparameter_tuning.ipynb		hyperparameter_tuning.ipynb
preprocessing.ipynb		preprocessing.ipynb
roberta_cv_full.py		roberta_cv_full.py
vectorize.ipynb		vectorize.ipynb
xgboost_oversampling.ipynb		xgboost_oversampling.ipynb
zero-and-few-shot-clf.ipynb		zero-and-few-shot-clf.ipynb

Folders and files

Latest commit

History

Repository files navigation

Classification Experiments for A Comparative Study of Automatic Speech Act Classification - From Logistic Regression to GPT-4o

Table of Contents

Overview

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages