Aspect-Based Sentiment Analysis (ABSA) for Healthcare Reviews

Project Overview

This repository implements Aspect-Based Sentiment Analysis (ABSA) for healthcare reviews using the CADEC (CSIRO Adverse Drug Event Corpus) dataset. The project analyzes healthcare and medication reviews at a fine-grained level, extracting sentiments about specific aspects of patient experiences (effectiveness, side effects, dosage, etc.) rather than just overall sentiment.

Features

Multiple Model Implementations:
- Domain-specific transformer-based ABSA models (DistilBERT), performing binary (positive/non-positive) classification.
- Baseline sentiment models (Naive Bayes, SVM)
- Performance comparison between approaches (including baseline vs ABSA format)
Comprehensive Healthcare Dataset:
- CADEC dataset with medication reviews and adverse drug events
- Aspect-level annotations and robust train/dev/test splits (default 70%/15%/15%)
- Cleaned and preprocessed datasets
Evaluation & Visualization:
- Accuracy, F1, and other metrics
- Performance and dataset visualizations (see dataset_stats/ and training_plots/)
- Detailed analysis of model results
Healthcare Domain Specificity:
- Medication efficacy and side effects analysis
- Patient-reported outcomes interpretation
- Healthcare-specific sentiment classification (binary: positive vs. non-positive)
Explainability:
- LIME and SHAP explanations for model predictions (see lime_explanations/ and shap_explanations/)
Interactive Dashboard:
- Streamlit dashboard for interactive ABSA exploration (streamlit_absa_dashboard.py)
Data Augmentation & Feature Engineering:
- Scripts for augmenting minority classes and extracting domain-specific features (used in models/absa/absa-cadec.py)

Repository Structure

aspectRx/
├── Cadec_Data/            # Raw CADEC dataset and metadata (Download separately)
├── Dataset/               # Processed CADEC dataset with ABSA annotations (Generated by script)
├── dataset_stats/         # Visualizations and statistics of dataset characteristics (Generated by scripts)
├── evaluation/            # Evaluation scripts and metrics
├── lime_explanations/     # LIME HTML explanations for model predictions
├── logs/                  # Training logs
├── models/                # Model implementations
│   ├── absa/              # Transformer-based ABSA models (absa-cadec.py)
│   ├── baseline/          # Naive Bayes baseline models
│   └── svm/               # SVM baseline models
├── results/               # Model checkpoints and evaluation results
├── scripts/               # Utility scripts for visualization and analysis
├── shap_explanations/     # SHAP visualizations for model predictions
├── training_plots/        # Training performance visualizations
├── utils/                 # Utility functions for data processing
└── streamlit_absa_dashboard.py # Streamlit dashboard for ABSA

Dataset

The project uses the CADEC (CSIRO Adverse Drug Event Corpus) dataset with healthcare and medication reviews containing aspect-based annotations:

tokens: Tokenized medication review text
absa1, absa2, absa3: Aspect annotations including:
- Position indices of aspect terms
- Aspect category (e.g., MEDICATION#EFFICACY, MEDICATION#SIDE-EFFECT, TREATMENT#DOSAGE)
- Sentiment polarity (0=negative, 1=neutral, 2=positive)

The CADEC dataset is designed for research on adverse drug events and patient experiences with medications, making it ideal for healthcare-focused sentiment analysis applications.

Model Architecture

The primary ABSA model (models/absa/absa-cadec.py) uses DistilBERT, a lightweight transformer model, fine-tuned for aspect-based sentiment classification. The architecture includes:

DistilBERT encoder for text representation
Classification head for binary sentiment prediction (positive vs. non-positive)
Custom data preprocessing for aspect extraction
Domain-specific feature engineering (e.g., detecting side effects, benefits, negation)
Class weighting to handle imbalanced data

Data Processing

To process the CADEC dataset and generate ABSA-compatible files in the Dataset/ folder:

Download the CADEC dataset and place it in the Cadec_Data/ folder.
Run the processing script:
```
python scripts/process_cadec_data.py
```
- This script splits the data into training, validation, and test sets using a default ratio of 70%/15%/15% and a random seed of 42 for reproducibility.
- You can generate detailed dataset statistics and visualizations (saved to dataset_stats/) by adding the --generate-stats flag.

Dataset Statistics

Detailed statistics and visualizations about the dataset (e.g., sentiment distribution, aspect category distribution, token lengths) can be found in the dataset_stats/ directory. These can be generated or updated using scripts:

# Option 1: Use the flag during initial processing
python scripts/process_cadec_data.py --generate-stats

# Option 2: Run dedicated analysis scripts (after processing)
python scripts/analyze_cadec_distribution.py
python evaluation/stats.py # Generates basic plots

Setup and Installation

# Clone the repository
git clone https://github.com/muhabdullahd/aspectRx.git
cd aspectRx

# Recommended: Create and activate a Python virtual environment
# python -m venv absa_env
# source absa_env/bin/activate  # On Windows use `absa_env\\Scripts\\activate`

# Install dependencies
pip install -r requirements.txt

# Download required NLTK data and spaCy model
python -m nltk.downloader punkt
python -m spacy download en_core_web_sm

Usage

Training the ABSA Model

This trains the main DistilBERT-based binary classification model.

cd models/absa
python absa-cadec.py

Training Baseline Models

# Naive Bayes baseline
cd models/baseline
python train_baseline.py

# OR SVM baseline
cd ../svm
python train_svm_baseline.py

Evaluation

The main ABSA model (absa-cadec.py) performs evaluation during training and saves metrics.
To evaluate the baseline models on the specific ABSA task format (binary classification):
```
cd evaluation
python test_baseline_on_absa.py # Tests Naive Bayes on ABSA format
```
General evaluation metrics can be calculated using:
```
cd evaluation
python evaluate_metrics.py
```

Visualizing Results

cd scripts
python plot_training_stats.py # Plots metrics from training logs

Additional visualizations are generated during data processing/analysis (see dataset_stats/) and model training (see training_plots/).

Running the Streamlit Dashboard

streamlit run streamlit_absa_dashboard.py

Model Comparison with PyABSA

To evaluate the effectiveness of the domain-specific CADEC ABSA model, this project includes a script to compare its performance against a general-purpose ABSA model from the PyABSA library.

Script: `scripts/compare_with_pyabsa.py`

This script performs the following steps:

Runs PyABSA: Executes a pre-trained, general-purpose PyABSA model (Aspect Polarity Classification - APC) on the processed CADEC test set (Dataset/cadec_absa_test.tsv).
Loads Custom Model Results: Reads the evaluation metrics (Accuracy and F1 score) of the custom-trained CADEC ABSA model from results/cadec-absa/enhanced_results.json.
Generates Comparison Plot: Creates a bar chart comparing the Accuracy and F1 scores of the two models and saves it to results/absa_comparison.png.

How to Run the Comparison

# Ensure you are in the aspectRx directory
# Activate your Python environment if you have one
# source absa_env/bin/activate 

python scripts/compare_with_pyabsa.py

This will output the metrics for both models and save the comparison plot.

Results

Main CADEC ABSA model metrics: results/cadec-absa/metrics_cadec.json
Baseline model metrics (general): results/metrics.json (may vary based on script run)
Comparison plot: results/absa_comparison.png
Comparison script input metrics: results/cadec-absa/enhanced_results.json
Training visualizations: training_plots/
Dataset statistics visualizations: dataset_stats/
LIME and SHAP explanations in their respective folders (lime_explanations/, shap_explanations/)

License

This project is licensed under the terms of the LICENSE file included in the repository.

Acknowledgments

Developed as a final project for a Machine Learning course
Utilizes the CADEC (CSIRO Adverse Drug Event Corpus) dataset for healthcare and medication reviews

Notes

The CADEC dataset is not included in this repository due to size constraints. Please download it separately and place it in Cadec_Data/.
For more details on dataset processing and utilities, see scripts/ and utils/ folders.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aspect-Based Sentiment Analysis (ABSA) for Healthcare Reviews

Project Overview

Features

Repository Structure

Dataset

Model Architecture

Data Processing

Dataset Statistics

Setup and Installation

Usage

Training the ABSA Model

Training Baseline Models

Evaluation

Visualizing Results

Running the Streamlit Dashboard

Model Comparison with PyABSA

Script: `scripts/compare_with_pyabsa.py`

How to Run the Comparison

Results

License

Acknowledgments

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Cadec_Data		Cadec_Data
Dataset		Dataset
dataset_stats		dataset_stats
evaluation		evaluation
lime_explanations		lime_explanations
logs/cadec-absa		logs/cadec-absa
models		models
results		results
scripts		scripts
shap_explanations		shap_explanations
training_plots		training_plots
utils		utils
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
checkpoints.json		checkpoints.json
requirements.txt		requirements.txt
streamlit_absa_dashboard.py		streamlit_absa_dashboard.py

License

muhabdullahd/aspectRx

Folders and files

Latest commit

History

Repository files navigation

Aspect-Based Sentiment Analysis (ABSA) for Healthcare Reviews

Project Overview

Features

Repository Structure

Dataset

Model Architecture

Data Processing

Dataset Statistics

Setup and Installation

Usage

Training the ABSA Model

Training Baseline Models

Evaluation

Visualizing Results

Running the Streamlit Dashboard

Model Comparison with PyABSA

Script: scripts/compare_with_pyabsa.py

How to Run the Comparison

Results

License

Acknowledgments

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Script: `scripts/compare_with_pyabsa.py`

Packages