chore(deps): update dependency sentence-transformers to v3#13
Open
mend-for-github-com[bot] wants to merge 1 commit intomasterfrom
Open
chore(deps): update dependency sentence-transformers to v3#13mend-for-github-com[bot] wants to merge 1 commit intomasterfrom
mend-for-github-com[bot] wants to merge 1 commit intomasterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
^2.7.0→^3.0.0By merging this PR, the below vulnerabilities will be automatically resolved:
Release Notes
huggingface/sentence-transformers (sentence-transformers)
v3.0.0: - Sentence Transformer Training Refactor; new similarity methods; hyperparameter optimization; 50+ datasets releaseCompare Source
This release consists of a major refactor that overhauls the training approach (introducing multi-gpu training, bf16, loss logging, callbacks, and much more), adds convenient
similarityandsimilarity_pairwisemethods, adds extra keyword arguments, introduces Hyperparameter Optimization, and includes a massive reformatting and release of 50+ datasets for training embedding models. In total, this is the largest Sentence Transformers update since the project was first created.Install this version with
Sentence Transformer training refactor (#2449)
The v3.0 release centers around this huge modernization of the training approach for
SentenceTransformermodels. Whereas training before v3.0 used to be all aboutInputExample,DataLoaderandmodel.fit, the new training approach relies on 5 new components. You can learn more about these components in our Training and Finetuning Embedding Models with Sentence Transformers v3 blogpost. Additionally, you can read the new Training Overview, check out the Training Examples, or read this summary:A training
DatasetorDatasetDict. This class is much more suited for sharing & efficient modifications than lists/DataLoaders ofInputExampleinstances. ADatasetcan contain multiple text columns that will be fed in order to the corresponding loss function. So, if the loss expects (anchor, positive, negative) triplets, then your dataset should also have 3 columns. The names of these columns are irrelevant. If there is a "label" or "score" column, it is treated separately, and used as the labels during training.A
DatasetDictcan be used to train with multiple datasets at once, e.g.:DatasetDictis used, thelossparameter to theSentenceTransformerTrainermust also be a dictionary with these dataset keys, e.g.:{ 'multi_nli': SoftmaxLoss(...), 'snli': SoftmaxLoss(...), 'stsb': CosineSimilarityLoss(...), }A loss function, or a dictionary of loss functions like described above. These loss functions do not require changes compared to before this PR.
A SentenceTransformerTrainingArguments instance, subclass of a TrainingArguments instance. This powerful class controls the specific details of the training.
An optional
SentenceEvaluatorinstance. Unlike before, models can now be evaluated both on an evaluation dataset with some loss function and/or aSentenceEvaluatorinstance.The new
SentenceTransformersTrainerinstance based on thetransformersTrainer. This instance is provided with a SentenceTransformer model, a SentenceTransformerTrainingArguments class, a SentenceEvaluator, a training and evaluation Dataset/DatasetDict and a loss function/dict of loss functions. Most of these parameters are optional. Once provided, all you have to do is calltrainer.train().Some of the major features that are now implemented include:
This script is a minimal example (no evaluator, no training arguments) of training
mpnet-baseon a part of theall-nlidataset usingMultipleNegativesRankingLoss:Additionally, trained models now automatically produce extensive model cards. Each of the following models were trained using some script from the Training Examples, and the model cards were not edited manually whatsoever:
Prior to the Sentence Transformer v3 release, all models would be trained using the
SentenceTransformer.fitmethod. Rather than deprecating this method, starting from v3.0, this method will use theSentenceTransformerTrainerbehind the scenes. This means that your old training code should still work, and should even be upgraded with the new features such as multi-gpu training, loss logging, etc. That said, the new training approach is much more powerful, so it is recommended to write new training scripts using the new approach.Many of the old training scripts were updated to use the new Trainer-based approach, but not all have been updated yet. We accept help via Pull Requests to assist in updating the scripts.
Similarity Score (#2615, #2490)
Sentence Transformers v3.0 introduces two new useful methods:
and one property:
These can be used to calculate the similarity between embeddings, and to specify which similarity function should be used, for example:
Additionally, you can compute the similarity between pairs of embeddings, resulting in a 1-dimensional vector of similarities rather than a 2-dimensional matrix:
The
similarity_fn_namecan now be specified via theSentenceTransformerlike so:Valid options include "cosine" (default), "dot", "euclidean", "manhattan". The chosen
similarity_fn_namewill also be saved into the model configuration, and loaded automatically. For example, themsmarco-distilbert-dot-v5model was trained to work best withdot, so we've configured it to use thatsimilarity_fn_namein its configuration:Big thanks to @ir2718 for helping set up this major feature.
Allow passing
model_kwargs,tokenizer_kwargs, andconfig_kwargstoSentenceTransformer(#2578)To those familiar with the internals of Sentence Transformers, you might know that internally, we call
AutoModel.from_pretrained,AutoTokenizer.from_pretrainedandAutoConfig.from_pretrainedfromtransformers.Each of these are rather powerful, and they are constantly improved with new features. For example, the
AutoModelkeyword arguments include:torch_dtype- this allows you to immediately load a model inbfloat16orfloat16(or"auto", i.e. whatever the model was stored in), which can speed up inference a lot.quantization_configattn_implementation- all models support "eager", but some also support the much faster "fa2" (Flash Attention 2) and "sdpa" (Scaled Dot Product Attention).These options allow for speeding up the model inference. Additionally, via
AutoConfigyou can update the model configuration, e.g. updating the dropout probability during training, and withAutoTokenizeryou can disable the fast Rust-based tokenizer if you're having issues with it viause_fast=False.Due to how useful these options can be, the following arguments are added to
SentenceTransformer:model_kwargsforAutoModel.from_pretrainedkeyword argumentstokenizer_kwargsforAutoTokenizer.from_pretrainedkeyword argumentsconfig_kwargsforAutoConfig.from_pretrainedkeyword argumentsYou can use it like so:
Big thanks to @satyamk7054 for starting this work.
Hyperparameter Optimization (#2655)
Sentence Transformers v3.0 introduces Hyperparameter Optimization (HPO) by extending the
transformersHPO support. We recommend reading the all new Hyperparameter Optimization for many more details.Datasets Release
Alongside Sentence Transformers v3.0, we reformat and release 50+ useful datasets in our Embedding Model Datasets Collection on Hugging Face. These can be used with at least one loss function in Sentence Transformers v3.0 out of the box. We recommend browsing through these to see if there are datasets akin to your use cases - training a model on them might just produce large gains on your task(s).
MSELoss extension (#2641)
The MSELoss now accepts multiple text columns for each label (where each label is a target/gold embedding), rather than only accepting one text column. This is extremely powerful for following the excellent Multilingual Models strategy to convert a monolingual model into a multilingual one. You can now conveniently train both English and (identical but translated) non-English texts to represent the same embedding (that was generated by a powerful English embedding model).
Add
local_files_onlyargument to SentenceTransformer & CrossEncoder (#2603)You can now initialize a
SentenceTransformerandCrossEncoderwithlocal_files_only. IfTrue, then it will not try and download a model from Hugging Face, it will only look in the local filesystem for the model or try and load it from a cache.Thanks @debanjum for this change.
All changes
v3] Training refactor - MultiGPU, loss logging, bf16, etc. by @tomaarsen in UKPLab#2449v3] Addsimilarityandsimilarity_pairwisemethods to Sentence Transformers by @tomaarsen in UKPLab#2615v3] Fix various model card errors by @tomaarsen in UKPLab#2616v3] Fix trainercompute_losswhen evaluating/predicting if thelossupdated the inputs in-place by @tomaarsen in UKPLab#2617v3] Never return None in infer_datasets, could result in crash by @tomaarsen in UKPLab#2620v3] Trainer: Implement resume from checkpoint support by @tomaarsen in UKPLab#2621trust_remote_codetoCrossEncoder.tokenizerby @michaelfeil in UKPLab#2623v3] Update example scripts to the new v3 training format by @tomaarsen in UKPLab#2622v3] Remove "return_outputs" as it's not strictly necessary. Avoids OOM & speeds up training by @tomaarsen in UKPLab#2633v3] Fix crash from inferring the dataset_id from a local dataset by @tomaarsen in UKPLab#2636v3] Fix multilingual conversion script; extend MSELoss to multi-column by @tomaarsen in UKPLab#2641v3] Update evaluation scripts to use HF Datasets by @tomaarsen in UKPLab#2642b1quantization for USearch by @ashvardanian in UKPLab#2644v3] Fixresume_from_checkpointby also updating the loss model by @tomaarsen in UKPLab#2648v3] Fix backwards pass on MSELoss due to in-place update by @tomaarsen in UKPLab#2647v3] Simplifyload_from_checkpointusingload_state_dictby @tomaarsen in UKPLab#2650v3] Usetorch.arangeinstead oftorch.tensor(range(...))by @tomaarsen in UKPLab#2651v3] Resolve inplace modification error in DDP by @tomaarsen in UKPLab#2654v3] Add hyperparameter optimization support by lettinglossbe a Callable that accepts amodelby @tomaarsen in UKPLab#2655v3] Add tag hinting at the number of training samples by @tomaarsen in UKPLab#2660v3] For the Cached losses; ignore gradients if grad is disabled (e.g. eval) by @tomaarsen in UKPLab#2668docs] Rewrite the https://sbert.net documentation by @tomaarsen in UKPLab#2632v3] Chore - include import sorting in ruff by @tomaarsen in UKPLab#2672v3] Prevent warning with 'model.fit' with transformers >= 4.41.0 due to evaluation_strategy by @tomaarsen in UKPLab#2673v3] Add various useful Sphinx packages (copy code, link to code, nicer tabs) by @tomaarsen in UKPLab#2674v3] Make the "primary_metric" for evaluators a bit more robust by @tomaarsen in UKPLab#2675v3] Setbroadcast_buffers = Falsewhen training with DDP by @tomaarsen in UKPLab#2663v3] Warn about using DP instead of DDP + set dataloader_drop_last with DDP by @tomaarsen in UKPLab#2677v3] Add warning that Evaluators only run on 1 GPU when multi-GPU training by @tomaarsen in UKPLab#2678v3] Move training dependencies into a "train" extra by @tomaarsen in UKPLab#2676v3] Docs: update references to the API reference by @tomaarsen in UKPLab#2679v3] Add "dataset_size:" to the tag denoting the number of training samples by @tomaarsen in UKPLab#2680New Contributors
A special shoutout to @Jakobhenningjensen, @smerrill, @b5y, @ScottishFold007, @pszemraj, @bwanglzu, @igorkurinnyi, for experimenting with the v3.0 release prior to release and @matthewfranglen for the initial work on the training refactor back in October of 2022 in #1733.
cc @AlexJonesNLP as I know you are interested in this release!
Full Changelog: huggingface/sentence-transformers@v2.7.0...v3.0.0