Add MiniLM-L12-v2 sentence transformer #89

antimora · 2026-01-23T21:11:46Z

Adds minilm-burn crate implementing the all-MiniLM-L12-v2 sentence transformer model.

Features

Load pretrained weights from HuggingFace with simple API: MiniLmModel::pretrained(&device)
Mean pooling and L2 normalization for sentence embeddings
Multi-backend support: ndarray, wgpu, tch-cpu, tch-gpu, cuda
Config loaded from HuggingFace's config.json via serde

Usage

let (model, tokenizer) = MiniLmModel::<B>::pretrained(&device)?;
let output = model.forward(input_ids, attention_mask.clone(), None);
let embeddings = mean_pooling(output.hidden_states, attention_mask);
let embeddings = normalize_l2(embeddings);

Benchmarks (Apple M3 Max)

Benchmark	ndarray	wgpu	tch-cpu
forward (batch=1)	102 ms	35 ms	26 ms
forward (batch=16)	1.54 s	73 ms	130 ms

Testing

Unit tests: cargo test --features ndarray
Integration tests verify outputs match Python sentence-transformers within 1e-4 tolerance

Implements the all-MiniLM-L12-v2 model using Burn's built-in TransformerEncoder and burn-store for weight loading from safetensors. - Load config from HuggingFace's config.json via serde - Key remapping from HuggingFace BERT to Burn TransformerEncoder - Mean pooling for sentence embeddings - Example with HuggingFace download and cosine similarity

Reformatted code in loader.rs, model.rs, and pooling.rs for improved readability and consistency. Adjusted import order and indentation, and expanded some array initializations for clarity in tests. No functional changes were made.

Results measured on Apple M3 Max showing performance comparison across all supported backends.

Copilot

Pull request overview

Introduces a new minilm-burn crate implementing the all-MiniLM-L12-v2 sentence-transformer model on top of Burn, with support for multiple backends, pretrained weight loading from Hugging Face, and documentation/examples/benchmarks.

Changes:

Add MiniLM-specific embedding, encoder, pooling, and normalization modules plus a MiniLmModel configuration and forward pass.
Implement HF Hub-based weight and tokenizer loading, along with a pretrained convenience API, examples, and benchmarks across backends.
Add integration tests against Python sentence-transformers outputs and update repo-level documentation/README to list the new model.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`minilm-burn/src/embedding.rs`	Defines `MiniLmEmbeddingsConfig` and `MiniLmEmbeddings` (word/position/token-type embeddings + layer norm + dropout) matching the MiniLM/BERT-style embedding stack.
`minilm-burn/src/model.rs`	Adds `MiniLmConfig`, `MiniLmModel`, and `MiniLmOutput`, wiring Burn’s `TransformerEncoder` to MiniLM’s config and attention mask semantics.
`minilm-burn/src/pooling.rs`	Implements `mean_pooling` and `normalize_l2` utilities plus a unit test for mean pooling on the ndarray backend.
`minilm-burn/src/loader.rs`	Introduces `LoadError`, HF safetensor key remapping and loading, HF Hub download utilities, config loading, and `MiniLmModel::pretrained`.
`minilm-burn/src/lib.rs`	Exposes the MiniLM public API and adds crate-level documentation and a usage example.
`minilm-burn/tests/integration_test.rs`	Adds ndarray-based integration tests that compare MiniLM Rust embeddings and cosine similarities against Python `sentence-transformers` references.
`minilm-burn/scripts/generate_reference.py`	Script to generate reference embeddings and cosine similarities from Python `sentence-transformers` for use in integration tests.
`minilm-burn/scripts/debug_embeddings.py`	Small helper script to inspect raw MiniLM embeddings and norms in Python for debugging.
`minilm-burn/examples/inference.rs`	Demonstrates end-to-end inference with the pretrained MiniLM model, tokenization, pooling, and cosine similarity computation on the ndarray backend.
`minilm-burn/benches/inference.rs`	Adds Criterion benchmarks for forward passes, batching, full pipeline, and pooling/normalization across multiple backends.
`minilm-burn/README.md`	Documents the new crate’s usage, features, testing strategy, and benchmark results.
`minilm-burn/Cargo.toml`	Declares the new crate, its features (including multi-backend and pretrained support), and dependencies (Burn, burn-store, tokenizers, hf-hub, tokio, etc.).
`README.md`	Updates the root repository overview and tables to include the MiniLM model and its subcrate, and switches to reference-style links.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

minilm-burn/src/lib.rs

minilm-burn/src/loader.rs

minilm-burn/tests/integration_test.rs

- Fix doc example to use MiniLmModel::pretrained (not MiniLmConfig) - Update HfModelFiles doc to reflect struct with 3 fields - Fix generate_reference.py to use normalize_embeddings=True

- Use dirs::cache_dir() for platform-appropriate default location - Allow custom cache path via pretrained(device, Some(path)) - Downloads to ~/.cache/burn-models/ (Linux) or ~/Library/Caches/burn-models/ (macOS)

- Remove hardcoded hidden_size (384), derive from tensor dims - Add normalize_l2 to example (matches sentence-transformers default) - Remove debug_embeddings.py script

PyTorchToBurnAdapter handles weight→gamma and bias→beta automatically.

- Add MiniLmVariant enum (L6, L12) for model selection - L6: 6 layers, faster inference - L12: 12 layers, better quality (default) - Update pretrained() to accept variant parameter

L6 is ~2x faster than L12 across all backends: - ndarray: 53ms vs 105ms - wgpu: 18ms vs 35ms - tch-cpu: 14ms vs 27ms

Replaces use of `equal_elem(0)` with comparison to a zeros tensor for creating the padding mask. This ensures compatibility with tensor operations and device placement.

Refactored lines where MiniLmModel is loaded to improve code readability by reducing line length and aligning with Rust formatting conventions. No functional changes were made.

antimora added 9 commits January 23, 2026 13:39

Add README for minilm-burn

7e16e35

Simplify API: pretrained() returns model and tokenizer path

d230517

Refactor formatting and imports for consistency

45bef2c

Reformatted code in loader.rs, model.rs, and pooling.rs for improved readability and consistency. Adjusted import order and indentation, and expanded some array initializations for clarity in tests. No functional changes were made.

Add minilm-burn to models list

b92f8d9

Add integration tests comparing with Python sentence-transformers

f6bd618

Fix fmt

ecdb08b

Add benchmarks for inference performance

1927929

Add benchmark results for ndarray, wgpu, and tch-cpu backends

a635c87

Results measured on Apple M3 Max showing performance comparison across all supported backends.

antimora requested review from Copilot and laggui January 23, 2026 21:11

Copilot started reviewing on behalf of antimora January 23, 2026 21:12 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

antimora added 9 commits January 23, 2026 15:57

Fix documentation issues from PR review

a7faad7

- Fix doc example to use MiniLmModel::pretrained (not MiniLmConfig) - Update HfModelFiles doc to reflect struct with 3 fields - Fix generate_reference.py to use normalize_embeddings=True

Add configurable cache directory for model downloads

242c096

- Use dirs::cache_dir() for platform-appropriate default location - Allow custom cache path via pretrained(device, Some(path)) - Downloads to ~/.cache/burn-models/ (Linux) or ~/Library/Caches/burn-models/ (macOS)

Clean up code review issues

5d4adbf

- Remove hardcoded hidden_size (384), derive from tensor dims - Add normalize_l2 to example (matches sentence-transformers default) - Remove debug_embeddings.py script

Simplify key remappings for LayerNorm

3bd5698

PyTorchToBurnAdapter handles weight→gamma and bias→beta automatically.

Add support for all-MiniLM-L6-v2 variant

8e7b9e1

- Add MiniLmVariant enum (L6, L12) for model selection - L6: 6 layers, faster inference - L12: 12 layers, better quality (default) - Update pretrained() to accept variant parameter

Add integration test for L6 variant

ce9d4bb

Add L6 vs L12 variant benchmarks

b6190cf

L6 is ~2x faster than L12 across all backends: - ndarray: 53ms vs 105ms - wgpu: 18ms vs 35ms - tch-cpu: 14ms vs 27ms

Fix attention mask padding conversion in MiniLmModel

b36ed41

Replaces use of `equal_elem(0)` with comparison to a zeros tensor for creating the padding mask. This ensures compatibility with tensor operations and device placement.

Reformat model loading for improved readability

a9b0bb1

Refactored lines where MiniLmModel is loaded to improve code readability by reducing line length and aligning with Rust formatting conventions. No functional changes were made.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MiniLM-L12-v2 sentence transformer #89

Add MiniLM-L12-v2 sentence transformer #89

Uh oh!

antimora commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add MiniLM-L12-v2 sentence transformer #89

Are you sure you want to change the base?

Add MiniLM-L12-v2 sentence transformer #89

Uh oh!

Conversation

antimora commented Jan 23, 2026

Features

Usage

Benchmarks (Apple M3 Max)

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant