This repository contains our code for the Computational Intelligence Lab (CIL) 2025 course project at the department of ETH Zurich on sentiment analysis. The goal of this project is to implement and evaluate various machine learning pipelines for sentiment classification on a given dataset.
We use Python 3.12.3 with the following dependencies:
pip install -r requirements.txtTo run a pipeline, use the command
python scripts/run_pipeline.py --config config/<config_file>.yamlWe have implemented the following pipelines:
classical_ml_bow_*.yaml: bag-of-words embeddings + classical machine learning models (e.g., logistic regression, random forest, SVM, XGBoost)mlp_head.yaml: pretrained embeddings + MLP headboosted_mlp_head.yaml: pretrained embeddings + boosted MLP headpretrained_classifier.yaml: pretrained language models (inference-only)finetuned_classifier.yaml: finetuned language modelsmixture_of_experts.yaml: multiple pretrained embeddings + mixture-of-experts module
For some pipelines, we use pretrained embeddings extracted from pretrained models. To extract and save these embeddings to the cache, use the save_embeddings.py script.
- To extract and save embeddings from
SentenceTransformermodels to the cache, runpython scripts/save_embeddings.py --cache <cache_dir> --pipeline sentencetransformer --model <model_name>
- To extract and save embeddings and predictions from HuggingFace models to the cache, run
python scripts/save_embeddings.py --cache <cache_dir> --pipeline huggingface --model <model_name>
To reproduce our final submission with a train score of 0.96351 and validation score of 0.90646, first finetune FacebookAI/roberta-large with the following command:
python scripts/run_pipeline.py --config config/finetuned_classifier.yamlThen update config/finetuned_classifier.yaml with the following parameters:
pipeline.model.pretrained_model_name_or_path: change it to your last checkpoint pathpipeline.preprocessing.difficulty_filter: uncommentpipeline.trainer.learning_rate: change it to1e-6
Finally, post-tune your finetuned model with the following command:
python scripts/run_pipeline.py --config config/finetuned_classifier.yamlIf you plan to contribute to this repository, run
pip install -r requirements_dev.txt
pre-commit install
nbstripout --install
to install the pre-commit and nbstripout hooks.
To contribute to this repository, please work on a branch named <name>/<description> and create pull requests.
To add new pipelines, create the following two files
config/<config_file>.yaml: The configuration file with the module name of the pipeline and all hyperparameters.src/pipelines/<module_file>.py: The module file with the pipeline definition.
To save intermediate outputs of expensive function calls to the cache, you can use the CACHE object provided by the cache.py module. To specifically save and load embeddings from the cache, you can use the save_embeddings and load_embeddings wrappers around the CACHE object.
To save custom embeddings to the cache, use
from cache import CACHE, save_embeddings
CACHE.init(cache_dir=<cache_dir>)
save_embeddings(embeddings, <pipeline_name>, <model_name>)To load the saved embeddings from the cache, use
from cache import CACHE, load_embeddings
CACHE.init(cache_dir=<cache_dir>)
embeddings = load_embeddings(<pipeline_name>, <model_name>)To cache the output of a function call, use
from cache import CACHE
CACHE.init(cache_dir=<cache_dir>)
y = f(x) # no cache
y = CACHE(lambda: f(x), "y.npz") # cached to <cache_dir>/y.npzDepending on the provided file ending, CACHE expects the following return type from the function call:
.npy: expectsnp.ndarray.npz: expectsdict[str, np.ndarray].pt: expectstorch.Tensor.csv: expectspd.DataFrame.pkl: expects any object
The following individuals contributed equally to this project:
- Redhardt, Florian - GitHub Profile
- Siebert, Tavis - GitHub Profile
- Stante, Samuel - GitHub Profile
- Yang, Daniel - GitHub Profile