Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,32 @@
# v0.6.0

## New 🔥

**BM25 Full-Text Search** is now available via the new `Torus.bm25/5` macro!

[BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is a modern ranking algorithm that generally provides superior relevance scoring compared to traditional TF-IDF (used by `full_text/5`). This integration uses the [pg_textsearch](https://github.com/timescale/pg_textsearch) extension by Timescale.

Key features:

- State-of-the-art BM25 ranking with configurable index parameters (k1, b)
- Blazingly fast top-k queries via Block-Max WAND optimization (`Torus.bm25/5` + `limit`)
- Simple syntax: `Post |> Torus.bm25([p], p.body, "search term") |> limit(10)`
- Score selection with `:score_key` and post-filtering with `:score_threshold`
- Language/stemming configured at index creation via `text_config`

Requirements:

- PostgreSQL 17+
- pg_textsearch extension installed
- BM25 index on the search column (with `text_config` for language)

See the [BM25 Search Guide](https://dimamik.com/posts/bm25_search) for detailed setup instructions and examples.

**When to use BM25 vs full_text:**

- Use `bm25/5` for fast single-column search with modern relevance ranking
- Use `full_text/5` for multi-column search with weights or when using stored tsvector columns

# v0.5.3

## Fixes
Expand Down
52 changes: 36 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Post

See [`full_text/5`](https://hexdocs.pm/torus/Torus.html#full_text/5) for more details.

## 6 types of search:
## 7 types of search:

1. **Pattern matching**: Searches for a specific pattern in a string.

Expand All @@ -58,12 +58,13 @@ See [`full_text/5`](https://hexdocs.pm/torus/Torus.html#full_text/5) for more de
1. **Similarity:** Searches for records that closely match the input text using trigram distance.

```elixir
iex> insert_posts!(["Hogwarts Secrets", "Quidditch Fever", "Hogwart’s Secret"])
...> Post
...> |> Torus.similarity([p], [p.title], "hoggwarrds")
...> |> limit(2)
...> |> select([p], p.title)
...> |> Repo.all()
insert_posts!(["Hogwarts Secrets", "Quidditch Fever", "Hogwart’s Secret"])

Post
|> Torus.similarity([p], [p.title], "hoggwarrds")
|> limit(2)
|> select([p], p.title)
|> Repo.all()
["Hogwarts Secrets", "Hogwart’s Secret"]
```

Expand All @@ -74,20 +75,39 @@ See [`full_text/5`](https://hexdocs.pm/torus/Torus.html#full_text/5) for more de
1. **Full text**: Uses term-document matrix vectors for, enabling efficient querying and ranking based on term frequency. Supports prefix search and is great for large datasets to quickly return relevant results. See [PostgreSQL Full Text Search](https://www.postgresql.org/docs/current/textsearch.html) for internal implementation details.

```elixir
iex> insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.")
...> insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.")
...> insert_post!(title: "Completely unrelated", body: "No magic here!")
...> Post
...> |> Torus.full_text([p], [p.title, p.body], "uncov hogwar")
...> |> select([p], p.title)
...> |> Repo.all()
insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.")
insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.")
insert_post!(title: "Completely unrelated", body: "No magic here!")

Post
|> Torus.full_text([p], [p.title, p.body], "uncov hogwar")
|> select([p], p.title)
|> Repo.all()
["Diagon Bombshell"]
```

Use it when you dont care about spelling, the documents are long, or if you need to order the results by rank.
Use it when you don't care about spelling, the documents are long, you need multi-column search with weights, or if you need to order the results by rank.

See [`full_text/5`](https://hexdocs.pm/torus/Torus.html#full_text/5) for more details.

1. **BM25 full text**: Modern BM25 ranking algorithm for superior relevance scoring using the [pg_textsearch](https://github.com/timescale/pg_textsearch) extension. BM25 generally provides better ranking than traditional built-in TF-IDF full text search and is optimized for top-k queries.

```elixir
insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.")
insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.")
insert_post!(title: "Completely unrelated", body: "No magic here!")

Post
|> Torus.bm25([p], p.body, "secrets hogwarts")
|> select([p], p.title)
|> Repo.all()
["Diagon Bombshell"]
```

Use it when you need state-of-the-art relevance ranking for single-column search, especially with LIMIT clauses. Requires PostgreSQL 17+.

See [`bm25/5`](https://hexdocs.pm/torus/Torus.html#bm25/5) and the [BM25 Search Guide](https://dimamik.com/posts/bm25_search) for detailed setup instructions and examples.

1. **Semantic Search**: Understands the contextual meaning of queries to match and retrieve related content utilizing natural language processing. Read more about semantic search in [Semantic search with Torus guide](/guides/semantic_search.md).

```elixir
Expand Down Expand Up @@ -131,7 +151,7 @@ Torus offers a few helpers to debug, explain, and analyze your queries before us

## Torus support

For now, Torus supports pattern match, similarity, full-text, and semantic search, with plans to expand support further. These docs will be updated with more examples on which search type to choose and how to make them more performant (by adding indexes or using specific functions).
For now, Torus supports pattern match, similarity, full-text (TF-IDF and BM25), and semantic search, with plans to expand support further. These docs will be updated with more examples on which search type to choose and how to make them more performant (by adding indexes or using specific functions).

<!-- MDOC -->

Expand Down
151 changes: 151 additions & 0 deletions lib/torus.ex
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,157 @@ defmodule Torus do
Torus.Search.FullText.to_tsquery(column, query_text, opts)
end

@doc group: "Full text"
@doc """
BM25 ranked full-text search using the [pg_textsearch](https://github.com/timescale/pg_textsearch) extension.

BM25 is a modern ranking function that generally provides better relevance than traditional
TF-IDF (used by `full_text/5`). It's particularly effective for top-k queries with LIMIT clauses
due to Block-Max WAND optimization.

For detailed usage examples, performance tips, and migration guide, see the [BM25 Search guide](https://dimamik.com/posts/bm25_search).

> #### Requirements {: .warning}
>
> - Requires the `pg_textsearch` extension to be installed
> - PostgreSQL 17+ only
> - Requires a BM25 index on the search column
> - **Single column only** - unlike `full_text/5`, BM25 indexes work on one column at a time
> - **Language is set at index creation** - use `text_config` in the index `WITH` clause
>
> ```elixir
> defmodule YourApp.Repo.Migrations.CreatePgTextsearchExtension do
> use Ecto.Migration
>
> def change do
> execute "CREATE EXTENSION IF NOT EXISTS pg_textsearch", "DROP EXTENSION IF EXISTS pg_textsearch"
>
> # Create BM25 index with language configuration
> execute \"\"\"
> CREATE INDEX posts_body_bm25_idx ON posts
> USING bm25(body) WITH (text_config='english')
> \"\"\", "DROP INDEX posts_body_bm25_idx"
> end
> end
> ```

## Options

* `:order` - Ordering of results. Note that BM25 returns **negative scores** (lower is better):
- `:asc` (default) - orders by score ascending (best matches first)
- `:desc` - orders by score descending (worst matches first)
- `:none` - no ordering applied
* `:index_name` - Explicit index name. Required when using `score_threshold`.
* `:score_key` - Atom key to select the BM25 score into the result map.
- `:none` (default) - score is not selected
- `atom` - selects score as this key (use with `select_merge/3`)
* `:score_threshold` - Post-filter results by BM25 score (applied after ORDER BY).
Since scores are negative and lower is better, use negative thresholds (e.g., `-3.0`
keeps only results with score < -3.0, i.e., scores like -4.0, -5.0 which are better matches).
May return fewer results than LIMIT.
* `:pre_filter` - Whether to exclude non-matching rows.
- `false` (default) - no pre-filtering
- `true` - adds a `WHERE score < 0` clause to exclude non-matches

## Examples

Basic search - returns top 10 most relevant posts:

Post
|> Torus.bm25([p], p.body, "database search")
|> limit(10)
|> select([p], p.body)
|> Repo.all()

With score selection:

Post
|> Torus.bm25([p], p.body, "database", score_key: :relevance)
|> limit(5)
|> select([p], %{body: p.body})
|> Repo.all()
# => [%{body: "...", relevance: -2.5}, ...]

With WHERE clause pre-filtering:

Post
|> where([p], p.category_id == 123)
|> Torus.bm25([p], p.body, "database")
|> limit(10)
|> Repo.all()

With score threshold (post-filtering, may return fewer than LIMIT, `index_name` is required):

Post
|> Torus.bm25([p], p.body, "database", score_threshold: -5.0, index_name: "posts_body_idx")
|> limit(10)
|> Repo.all()

## When to use `bm25/5` vs `full_text/5`

**Use `bm25/5` when:**
- You need better relevance ranking than TF-IDF
- You need faster search with large datasets
- You have large result sets with LIMIT (top-k queries)
- Single column search is sufficient
- You're on PostgreSQL 17+

**Use `full_text/5` when:**
- You need multi-column search with different weights per column
- You want to use stored tsvector columns
- You're on PostgreSQL < 17
- You need the `concat` filter type

## Multi-column search workaround

Since BM25 indexes work on single columns, you can create a generated column:

```sql
ALTER TABLE posts
ADD COLUMN searchable_text TEXT
GENERATED ALWAYS AS (title || ' ' || body) STORED;

CREATE INDEX posts_searchable_bm25_idx
ON posts USING bm25(searchable_text)
WITH (text_config='english');
```

Then search the generated column:

```elixir
Post
|> Torus.bm25([p], p.searchable_text, "search term")
|> limit(10)
|> Repo.all()
```

## Index options

BM25 indexes support these parameters in the `WITH` clause:

- `text_config` - PostgreSQL text search configuration (required). This determines
the language/stemming rules. Available configs: `'english'`, `'french'`, `'german'`,
`'simple'` (no stemming), etc. Run `SELECT cfgname FROM pg_ts_config;` to list all.
- `k1` - Term frequency saturation (default: 1.2, range: 0.1-10.0)
- `b` - Length normalization (default: 0.75, range: 0.0-1.0)

```sql
CREATE INDEX custom_idx ON documents
USING bm25(content)
WITH (text_config='english', k1=1.5, b=0.8);
```

## Performance tips

- BM25 is most efficient with `ORDER BY + LIMIT` (enables Block-Max WAND optimization)
- For filtered searches, create a separate B-tree index on the filter column
- Pre-filtering works best when the filter is selective (<10% of rows)
- Post-filtering with `score_threshold` may return fewer results than LIMIT
"""
defmacro bm25(query, bindings, qualifier, term, opts \\ []) do
Torus.Search.BM25.bm25(query, bindings, qualifier, term, opts)
end

@doc group: "Pattern matching"
@doc """
The substring function with three parameters provides extraction of a substring
Expand Down
122 changes: 122 additions & 0 deletions lib/torus/search/bm25.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
defmodule Torus.Search.BM25 do
@moduledoc false
import Torus.Search.Common
import Ecto.Query, warn: false

@order_types ~w[asc desc none]a

def bm25(query, bindings, qualifier, term, opts \\ []) do
order = get_arg!(opts, :order, :asc, @order_types)
index_name = Keyword.get(opts, :index_name, nil)
pre_filter = get_arg!(opts, :pre_filter, false, [true, false])
score_key = Keyword.get(opts, :score_key, :none)
score_threshold = Keyword.get(opts, :score_threshold, nil)

raise_if(
score_key != :none and not is_atom(score_key),
"The `score_key` option must be an atom or :none."
)

raise_if(
score_threshold != nil and index_name == nil,
"The `index_name` option is required when using `score_threshold`."
)

# Build the BM25 query fragments
# When index_name is provided, use to_bm25query(?, ?) for explicit index specification
# Otherwise use bare string literal (?) to let PostgreSQL auto-detect the index
{bm25query_fragment, bm25query_params} =
if index_name do
{
"to_bm25query(?, ?)",
[term, index_name]
}
else
{
"?",
[term]
}
end

# Score fragment for ordering and selection
score_fragment_string = "? <@> #{bm25query_fragment}"

# Build score fragment AST
score_fragment =
quote do
fragment(
unquote(score_fragment_string),
unquote(qualifier),
unquote_splicing(
Enum.map(bm25query_params, fn param ->
quote do: ^unquote(param)
end)
)
)
end

# Build order fragment if needed
order_fragment =
if order != :none do
asc_desc = if order == :desc, do: :desc, else: :asc

quote do
[{unquote(asc_desc), unquote(score_fragment)}]
end
end

# BM25 scores are negative (lower = better), so "better than threshold" means score < threshold
# (e.g., -5.0 is better than -2.0, so threshold -3.0 keeps scores < -3.0 like -4.0, -5.0)
threshold_fragment_string = "? <@> #{bm25query_fragment} < ?"

# Pre-filtering by match (excludes non-matches)
# Non-matches have score = 0, matches have score < 0
pre_filter_fragment_string = "? <@> #{bm25query_fragment} < 0"

# Build the query
quote do
unquote(query)
|> apply_if(unquote(pre_filter), fn q ->
where(
q,
[unquote_splicing(bindings)],
fragment(
unquote(pre_filter_fragment_string),
unquote(qualifier),
unquote_splicing(
Enum.map(bm25query_params, fn param ->
quote do: ^unquote(param)
end)
)
)
)
end)
|> apply_if(unquote(score_threshold) != nil, fn q ->
where(
q,
[unquote_splicing(bindings)],
fragment(
unquote(threshold_fragment_string),
unquote(qualifier),
unquote_splicing(
Enum.map(bm25query_params, fn param ->
quote do: ^unquote(param)
end)
),
^unquote(score_threshold)
)
)
end)
|> apply_if(unquote(order) != :none, fn q ->
order_by(q, [unquote_splicing(bindings)], unquote(order_fragment))
end)
|> apply_if(unquote(score_key) != :none, fn q ->
select_merge(
q,
[unquote_splicing(bindings)],
%{unquote(score_key) => unquote(score_fragment)}
)
end)
end
end
end
Loading
Loading