CIL 2025 - Sentiment Analysis

This repository contains our code for the Computational Intelligence Lab (CIL) 2025 course project at the department of ETH Zurich on sentiment analysis. The goal of this project is to implement and evaluate various machine learning pipelines for sentiment classification on a given dataset.

Setup

We use Python 3.12.3 with the following dependencies:

pip install -r requirements.txt

Run a pipeline

To run a pipeline, use the command

python scripts/run_pipeline.py --config config/<config_file>.yaml

We have implemented the following pipelines:

classical_ml_bow_*.yaml: bag-of-words embeddings + classical machine learning models (e.g., logistic regression, random forest, SVM, XGBoost)
mlp_head.yaml: pretrained embeddings + MLP head
boosted_mlp_head.yaml: pretrained embeddings + boosted MLP head
pretrained_classifier.yaml: pretrained language models (inference-only)
finetuned_classifier.yaml: finetuned language models
mixture_of_experts.yaml: multiple pretrained embeddings + mixture-of-experts module

Extract embeddings

For some pipelines, we use pretrained embeddings extracted from pretrained models. To extract and save these embeddings to the cache, use the save_embeddings.py script.

To extract and save embeddings from SentenceTransformer models to the cache, run

python scripts/save_embeddings.py --cache <cache_dir> --pipeline sentencetransformer --model <model_name>

To extract and save embeddings and predictions from HuggingFace models to the cache, run

python scripts/save_embeddings.py --cache <cache_dir> --pipeline huggingface --model <model_name>

Final submission

To reproduce our final submission with a train score of 0.96351 and validation score of 0.90646, first finetune FacebookAI/roberta-large with the following command:

python scripts/run_pipeline.py --config config/finetuned_classifier.yaml

Then update config/finetuned_classifier.yaml with the following parameters:

pipeline.model.pretrained_model_name_or_path: change it to your last checkpoint path
pipeline.preprocessing.difficulty_filter: uncomment
pipeline.trainer.learning_rate: change it to 1e-6

Finally, post-tune your finetuned model with the following command:

python scripts/run_pipeline.py --config config/finetuned_classifier.yaml

Contributing

If you plan to contribute to this repository, run

pip install -r requirements_dev.txt
pre-commit install
nbstripout --install

to install the pre-commit and nbstripout hooks.

To contribute to this repository, please work on a branch named <name>/<description> and create pull requests.

Add a pipeline

To add new pipelines, create the following two files

config/<config_file>.yaml: The configuration file with the module name of the pipeline and all hyperparameters.
src/pipelines/<module_file>.py: The module file with the pipeline definition.

Use the cache

To save intermediate outputs of expensive function calls to the cache, you can use the CACHE object provided by the cache.py module. To specifically save and load embeddings from the cache, you can use the save_embeddings and load_embeddings wrappers around the CACHE object.

Cache embeddings

To save custom embeddings to the cache, use

from cache import CACHE, save_embeddings

CACHE.init(cache_dir=<cache_dir>)

save_embeddings(embeddings, <pipeline_name>, <model_name>)

To load the saved embeddings from the cache, use

from cache import CACHE, load_embeddings

CACHE.init(cache_dir=<cache_dir>)

embeddings = load_embeddings(<pipeline_name>, <model_name>)

Cache custom outputs

To cache the output of a function call, use

from cache import CACHE

CACHE.init(cache_dir=<cache_dir>)

y = f(x) # no cache
y = CACHE(lambda: f(x), "y.npz") # cached to <cache_dir>/y.npz

Depending on the provided file ending, CACHE expects the following return type from the function call:

.npy: expects np.ndarray
.npz: expects dict[str, np.ndarray]
.pt: expects torch.Tensor
.csv: expects pd.DataFrame
.pkl: expects any object

Authors

The following individuals contributed equally to this project:

Redhardt, Florian - GitHub Profile
Siebert, Tavis - GitHub Profile
Stante, Samuel - GitHub Profile
Yang, Daniel - GitHub Profile

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
config		config
data		data
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CIL 2025 - Sentiment Analysis

Setup

Run a pipeline

Extract embeddings

Final submission

Contributing

Add a pipeline

Use the cache

Cache embeddings

Cache custom outputs

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CIL 2025 - Sentiment Analysis

Setup

Run a pipeline

Extract embeddings

Final submission

Contributing

Add a pipeline

Use the cache

Cache embeddings

Cache custom outputs

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages