This repository implements Aspect-Based Sentiment Analysis (ABSA) for healthcare reviews using the CADEC (CSIRO Adverse Drug Event Corpus) dataset. The project analyzes healthcare and medication reviews at a fine-grained level, extracting sentiments about specific aspects of patient experiences (effectiveness, side effects, dosage, etc.) rather than just overall sentiment.
- Multiple Model Implementations:
- Domain-specific transformer-based ABSA models (DistilBERT), performing binary (positive/non-positive) classification.
- Baseline sentiment models (Naive Bayes, SVM)
- Performance comparison between approaches (including baseline vs ABSA format)
- Comprehensive Healthcare Dataset:
- CADEC dataset with medication reviews and adverse drug events
- Aspect-level annotations and robust train/dev/test splits (default 70%/15%/15%)
- Cleaned and preprocessed datasets
- Evaluation & Visualization:
- Accuracy, F1, and other metrics
- Performance and dataset visualizations (see
dataset_stats/andtraining_plots/) - Detailed analysis of model results
- Healthcare Domain Specificity:
- Medication efficacy and side effects analysis
- Patient-reported outcomes interpretation
- Healthcare-specific sentiment classification (binary: positive vs. non-positive)
- Explainability:
- LIME and SHAP explanations for model predictions (see
lime_explanations/andshap_explanations/)
- LIME and SHAP explanations for model predictions (see
- Interactive Dashboard:
- Streamlit dashboard for interactive ABSA exploration (
streamlit_absa_dashboard.py)
- Streamlit dashboard for interactive ABSA exploration (
- Data Augmentation & Feature Engineering:
- Scripts for augmenting minority classes and extracting domain-specific features (used in
models/absa/absa-cadec.py)
- Scripts for augmenting minority classes and extracting domain-specific features (used in
aspectRx/
├── Cadec_Data/ # Raw CADEC dataset and metadata (Download separately)
├── Dataset/ # Processed CADEC dataset with ABSA annotations (Generated by script)
├── dataset_stats/ # Visualizations and statistics of dataset characteristics (Generated by scripts)
├── evaluation/ # Evaluation scripts and metrics
├── lime_explanations/ # LIME HTML explanations for model predictions
├── logs/ # Training logs
├── models/ # Model implementations
│ ├── absa/ # Transformer-based ABSA models (absa-cadec.py)
│ ├── baseline/ # Naive Bayes baseline models
│ └── svm/ # SVM baseline models
├── results/ # Model checkpoints and evaluation results
├── scripts/ # Utility scripts for visualization and analysis
├── shap_explanations/ # SHAP visualizations for model predictions
├── training_plots/ # Training performance visualizations
├── utils/ # Utility functions for data processing
└── streamlit_absa_dashboard.py # Streamlit dashboard for ABSA
The project uses the CADEC (CSIRO Adverse Drug Event Corpus) dataset with healthcare and medication reviews containing aspect-based annotations:
tokens: Tokenized medication review textabsa1,absa2,absa3: Aspect annotations including:- Position indices of aspect terms
- Aspect category (e.g., MEDICATION#EFFICACY, MEDICATION#SIDE-EFFECT, TREATMENT#DOSAGE)
- Sentiment polarity (0=negative, 1=neutral, 2=positive)
The CADEC dataset is designed for research on adverse drug events and patient experiences with medications, making it ideal for healthcare-focused sentiment analysis applications.
The primary ABSA model (models/absa/absa-cadec.py) uses DistilBERT, a lightweight transformer model, fine-tuned for aspect-based sentiment classification. The architecture includes:
- DistilBERT encoder for text representation
- Classification head for binary sentiment prediction (positive vs. non-positive)
- Custom data preprocessing for aspect extraction
- Domain-specific feature engineering (e.g., detecting side effects, benefits, negation)
- Class weighting to handle imbalanced data
To process the CADEC dataset and generate ABSA-compatible files in the Dataset/ folder:
- Download the CADEC dataset and place it in the
Cadec_Data/folder. - Run the processing script:
python scripts/process_cadec_data.py
- This script splits the data into training, validation, and test sets using a default ratio of 70%/15%/15% and a random seed of 42 for reproducibility.
- You can generate detailed dataset statistics and visualizations (saved to
dataset_stats/) by adding the--generate-statsflag.
Detailed statistics and visualizations about the dataset (e.g., sentiment distribution, aspect category distribution, token lengths) can be found in the dataset_stats/ directory. These can be generated or updated using scripts:
# Option 1: Use the flag during initial processing
python scripts/process_cadec_data.py --generate-stats
# Option 2: Run dedicated analysis scripts (after processing)
python scripts/analyze_cadec_distribution.py
python evaluation/stats.py # Generates basic plots# Clone the repository
git clone https://github.com/muhabdullahd/aspectRx.git
cd aspectRx
# Recommended: Create and activate a Python virtual environment
# python -m venv absa_env
# source absa_env/bin/activate # On Windows use `absa_env\\Scripts\\activate`
# Install dependencies
pip install -r requirements.txt
# Download required NLTK data and spaCy model
python -m nltk.downloader punkt
python -m spacy download en_core_web_smThis trains the main DistilBERT-based binary classification model.
cd models/absa
python absa-cadec.py# Naive Bayes baseline
cd models/baseline
python train_baseline.py
# OR SVM baseline
cd ../svm
python train_svm_baseline.py- The main ABSA model (
absa-cadec.py) performs evaluation during training and saves metrics. - To evaluate the baseline models on the specific ABSA task format (binary classification):
cd evaluation python test_baseline_on_absa.py # Tests Naive Bayes on ABSA format
- General evaluation metrics can be calculated using:
cd evaluation python evaluate_metrics.py
cd scripts
python plot_training_stats.py # Plots metrics from training logs- Additional visualizations are generated during data processing/analysis (see
dataset_stats/) and model training (seetraining_plots/).
streamlit run streamlit_absa_dashboard.pyTo evaluate the effectiveness of the domain-specific CADEC ABSA model, this project includes a script to compare its performance against a general-purpose ABSA model from the PyABSA library.
This script performs the following steps:
- Runs PyABSA: Executes a pre-trained, general-purpose PyABSA model (Aspect Polarity Classification - APC) on the processed CADEC test set (
Dataset/cadec_absa_test.tsv). - Loads Custom Model Results: Reads the evaluation metrics (Accuracy and F1 score) of the custom-trained CADEC ABSA model from
results/cadec-absa/enhanced_results.json. - Generates Comparison Plot: Creates a bar chart comparing the Accuracy and F1 scores of the two models and saves it to
results/absa_comparison.png.
# Ensure you are in the aspectRx directory
# Activate your Python environment if you have one
# source absa_env/bin/activate
python scripts/compare_with_pyabsa.pyThis will output the metrics for both models and save the comparison plot.
- Main CADEC ABSA model metrics:
results/cadec-absa/metrics_cadec.json - Baseline model metrics (general):
results/metrics.json(may vary based on script run) - Comparison plot:
results/absa_comparison.png - Comparison script input metrics:
results/cadec-absa/enhanced_results.json - Training visualizations:
training_plots/ - Dataset statistics visualizations:
dataset_stats/ - LIME and SHAP explanations in their respective folders (
lime_explanations/,shap_explanations/)
This project is licensed under the terms of the LICENSE file included in the repository.
- Developed as a final project for a Machine Learning course
- Utilizes the CADEC (CSIRO Adverse Drug Event Corpus) dataset for healthcare and medication reviews
- The CADEC dataset is not included in this repository due to size constraints. Please download it separately and place it in
Cadec_Data/. - For more details on dataset processing and utilities, see
scripts/andutils/folders.