Code for the practical contributions of my Master's Thesis "Fine-Tuning Embedding Models for Sustainability Analysis on Public Procurement"
As a heads up, the cluster I used to run the experiments had very limited memory. For this reason I placed the data subfolder and any additional files, such as models, on a seperate disk. Thereby, you will have to adjust some path names.
To train an embedding model head to the training script of the corresponding loss function in train_scripts. Hyperparameters and path to the desired training data need to be adjusted in the scripts themselves with the exception of TSDAE_training.py which uses a command line argument for the data path.
To run the evaluation benchmarks you can use the eval.py file. It uses command line arguments for its configuration. Run the following to get an overview of the different arguments and what they do:
python eval.py --helpNote that you currently can't set a custom output path, as I used the output path to prevent running the benchmarks with already evaluated models.
The code to generate synthetic data is in notebooks/GenerateSentenceForCriteria.ipynb. You find more information within the notebook.
I use uv as my python package manager, you should be able to use the pyproject.toml and uv.lock files to get a python environment with the required dependencies. Alternatively the requirements.txt file can be used with :
pip install -r requirements.txtThe APIs that provide access to the LLMs for data generation or LLM based evaluations require keys. I set TOGETHER_API_KEY and OPENROUTER_API_KEY in a .env file and load them from there. Alternatively, you can modify the code directly to load your keys.
Similarly some embedding models will require you to request access on HuggingFace and then link your HuggingFace account to the code with huggingface-cli or an access token.
- data
- This folder includes the data I used for my thesis, this includes both data I obtained from the Sinergia research project and data I synthetically generated. More detail in its own README.
- evaluation
- beir_spare_search.py: Fix of BEIR's sparse search class required for sparse encoder support
- expert_based_evaluation.py: Implementation of Expert Sentence Retrieval (ESR) and Expert Criterion Retrieval (ECR) benchmakrs
- LLM_based_evaluation.py: Implementation of LLM-Score (LLM-S) and LLM-Comparison (LLM-C) benchmarks
- synthetic_based_evaluation.py: Implementation of Synthetic Sentence Retrieval (SSR) benchmark, which supports sparse-, bi-, cross-encoders and retrieve and re-rank pipelines combining tow of them.
- utils_LLM_as_a_Judge.py: Helper methods specific to LLM calls required in LLM-S and LLM-C.
- utils.py: General helper methods for all benchmark implementations.
- notebooks:
- human_evaluation_experiment: files generated and used to do the final human evaluation experiment.
- CreatePlots.ipynb: Code used to evaluate embedding models and to generate plots and tables.
- EvaluationMethodAnalysis.ipynb: Notebook used to evaluate the different evaluation benchmarks and take a deeper look into SSR. This is where the files for human_evaluation_experiment are generated and evaluated.
- gemini_annotation_dataset.ipynb: Notebook to generate different training datasets in the corresponding format and with the desired compositions.
- GenerateSentenceForCriteria.ipynb: Notebook for synthetic data generation. Only the first approach, the one I implemented myself.
- LLM_as_a_Judge_Execution.ipynb: Notebook for the LLM-as-a-Judge evaluation of generated samples.
- LLM_as_a_Judge_Setup.ipynb: Notebook used to test different LLM-as-a-Judge implementations and compute their Pearson-correlation to a small set of manually evaluated samples.
- SentenceSplitting.ipynb: An implementation of my own sentence splitting component that was temporarely used and then replaced by the more sophisticated one from the research group.
- Sparse_encoding.ipynb: Alternative implementation of sparse-encoder evaluation on SSR rather than fixing BEIR's sparse_search class to add support to the usual SSR implementation.
- TrainAndEval.ipynb: Notebook used to implement and test fine-tuning methods before transfering them into the train_scripts folder.
- train_scirpts: Generally adaptations of examples found in the sentence-transformer library.
- CrossEncoder:
- BCE_training.py: Binary Cross Entropy Loss (BCE) fine-tuning script for cross-encoders on custom dataset.
- LambdaLoss_training.py: LambdaLoss fine-tuning script for cross-encoders on a custom dataset.
- SparseEncoder:
- SpladeLoss_training: SpladeLoss fine-tuning script to train a custom sparse encoder. It takes MNRL as the main loss function and trains EuroBERT on a custom dataset.
- CL_training.py: Contrastive Loss (CL) fine-tuning script for bi-encoders on our generated dataset.
- MNRL_custom_batching_training.py: Multiple Negative Ranking Loss (MNRL) fine-tuning script with my custom batching strategy to avoid false in-batch negatives at the cost of limiting the batch size to 15.
- MNRL_standard_batching_training.py: MNRL fine-tuning script with random batching that allows for higher batch size.
- TSDAE_training.py: TSDAE training script for either further pretraining or as its own fine-tuning method.
- CrossEncoder:
- eval.py: The main python script used to run the evaluation benchmarks. It allows for efficient evaluation of trained models via the command line with multiple arguments.