Anchor Embedding

The official implementation of the paper "Training LLMs to be Better Text Embedders through Bidirectional Reconstruction" (EMNLP 2025 Main Conference).

Introduction

Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks.

We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.

Dataset preparation

For both training stages, we use the public portion of dataset used in Improving Text Embeddings with Large Language Models, curated by authors of Repetition Improves Language Model Embeddings. The dataset can be downloaded from the GitHub page of Echo embeddings repository. To use the training script, the downloaded dataset should be placed in the cache directory.

Training

Stage I

You can simply run training Stage I using the code below:

torchrun --nproc_per_node=8 experiments/run_stage_I.py train_configs/stage_I/MetaLlama3.2_1b_q2d_d2q.json

Note

Our main contribution lies in the llm2vec/llm2vec_q2d_d2q.py script, which implements two bidirectional reconstruction tasks — EBQ2D and EBD2Q — via anchor embeddings.

Stage II

torchrun --nproc_per_node=8 experiments/run_stage_II.py train_configs/stage_II/MetaLlama3.2_1b_stage1_2000steps_stage2.json

Baseline

torchrun --nproc_per_node=8 experiments/run_stage_II.py train_configs/baseline/MetaLlama3.2_1b_baseline_stage2.json

Evaluation

Evaluating on MTEB:

python experiments/mteb_eval_custom.py \
  --base_model_name_or_path <path_or_name_of_base_model> \
  --peft_model_name_or_path <path_or_name_of_peft_model> \
  --task_name ${TASK_NAME} \
  --task_to_instructions_fp test_configs/mteb/task_to_instructions.json \
  --output_dir <output_directory>

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
experiments		experiments
llm2vec		llm2vec
test_configs/mteb		test_configs/mteb
train_configs		train_configs
.gitignore		.gitignore
README.md		README.md
method.png		method.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anchor Embedding

Introduction

Dataset preparation

Training

Stage I

Stage II

Baseline

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Anchor Embedding

Introduction

Dataset preparation

Training

Stage I

Stage II

Baseline

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages