Skip to content
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -207,3 +207,5 @@ marimo/_lsp/
__marimo__/

local

.DS_Store
139 changes: 137 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
<a href="https://pypi.org/project/pyversity/"><img src="https://img.shields.io/pypi/v/pyversity?color=%23007ec6&label=pypi%20package" alt="Package version"></a>
<a href="https://app.codecov.io/gh/Pringled/pyversity">
<img src="https://codecov.io/gh/Pringled/pyversity/graph/badge.svg?token=2CV5W0ZT7T" alt="Codecov">
</a>
<a href="https://github.com/Pringled/pyversity/blob/main/LICENSE">
<img src="https://img.shields.io/badge/license-MIT-green" alt="License - MIT">
</a>
Expand All @@ -17,7 +18,9 @@

[Quickstart](#quickstart) •
[Supported Strategies](#supported-strategies) •
[Motivation](#motivation)
[Motivation](#motivation) •
[Examples](#examples) •
[References](#references)

</div>

Expand Down Expand Up @@ -71,6 +74,7 @@ The following table describes the supported strategies, how they work, their tim
| **MSD** (Max Sum of Distances) | Prefers items that are both relevant and far from *all* previous selections. | **O(k · n · d)** | Use when you want stronger spread, i.e. results that cover a wider range of topics or styles. |
| **DPP** (Determinantal Point Process) | Samples diverse yet relevant items using probabilistic “repulsion.” | **O(k · n · d + n · k²)** | Ideal when you want to eliminate redundancy or ensure diversity is built-in to selection. |
| **COVER** (Facility-Location) | Ensures selected items collectively represent the full dataset’s structure. | **O(k · n²)** | Great for topic coverage or clustering scenarios, but slower for large `n`. |
| **SSD** (Sliding Spectrum Decomposition) | Sequence‑aware diversification: rewards novelty relative to recently shown items. | **O(k · n · d)** | Great for content feeds & infinite scroll, e.g. social/news/product feeds where users consume sequentially, as well as conversational RAG to avoid showing similar chunks within the recent window.


## Motivation
Expand All @@ -82,10 +86,138 @@ Each new item is chosen not only because it’s relevant, but also because it ad

This improves exploration, user satisfaction, and coverage across many domains, for example:

- E-commerce: Show different product styles, not multiple copies of the same black pants.
- E-commerce: Show different product styles, not multiple copies of the same product.
- News search: Highlight articles from different outlets or viewpoints.
- Academic retrieval: Surface papers from different subfields or methods.
- RAG / LLM contexts: Avoid feeding the model near-duplicate passages.
- Recommendation feeds: Keep content diverse and engaging over time.

## Examples

The following examples illustrate how to apply different diversification strategies in various scenarios.

<details> <summary><b>Product / Web Search</b> — Simple diversification with MMR or DPP</summary> <br>

MMR and DPP are great general-purpose diversification strategies. They are fast, easy to use, and work well in many scenarios.
For example, in a product search setting where you want to show diverse items to a user, you can diversify the top results as follows:

```python
from pyversity import diversify, Strategy

# Suppose you have:
# - item_embeddings: embeddings of the retrieved products
# - item_scores: relevance scores for these products

# Re-rank with MMR
result = diversify(
embeddings=item_embeddings,
scores=item_scores,
k=10,
strategy=Strategy.MMR,
)
```
</details>

<details> <summary><b>Literature Search </b> — Represent the full topic space with COVER</summary> <br>

COVER (Facility-Location) is well-suited for scenarios where you want to ensure that the selected items collectively represent the entire dataset’s structure. For instance, when searching for academic papers on a broad topic, you might want to cover various subfields and methodologies:

```python
from pyversity import diversify, Strategy

# Suppose you have:
# - paper_embeddings: embeddings of the retrieved papers
# - paper_scores: relevance scores for these papers

# Re-rank with COVER
result = diversify(
embeddings=paper_embeddings,
scores=paper_scores,
k=10,
strategy=Strategy.COVER,
)
```
</details>

<details>
<summary><b>Conversational RAG</b> — Avoid redundant chunks with SSD</summary>
<br>

In retrieval-augmented generation (RAG) for conversational AI, it’s crucial to avoid feeding the model redundant or similar chunks of information within the recent conversation context. The SSD (Sliding Spectrum Decomposition) strategy is designed for sequence-aware diversification, making it ideal for this use case:

```python
import numpy as np
from pyversity import diversify, Strategy

# Suppose you have:
# - chunk_embeddings (for retrieved chunks this turn)
# - chunk_scores (relevance scores for these chunks)
# - recent_chunk_embeddings (chunks shown in the last few turns (oldest→newest)

# Re-rank with SSD (sequence-aware)
result = diversify(
embeddings=chunk_embeddings,
scores=chunk_scores,
k=10,
strategy=Strategy.SSD,
recent_embeddings=recent_chunk_embeddings,
)

# Maintain the rolling context window for recent chunks
recent_chunk_embeddings = np.vstack([recent_chunk_embeddings, chunk_embeddings[result.indices]])
```
</details>


<details> <summary><b>Infinite Scroll / Recommendation Feed</b> — Sequence-aware novelty with SSD</summary> <br>

In content feeds or infinite scroll scenarios, users consume items sequentially. To keep the experience engaging, it’s important to introduce novelty relative to recently shown items. The SSD strategy is well-suited for this:

```python
import numpy as np
from pyversity import diversify, Strategy

# Suppose you have:
# - feed_embeddings: embeddings of candidate items for the feed
# - feed_scores: relevance scores for these items
# - recent_feed_embeddings: embeddings of recently shown items in the feed (oldest→newest)

# Sequence-aware re-ranking with Sliding Spectrum Decomposition (SSD)
result = diversify(
embeddings=feed_embeddings,
scores=feed_scores,
k=10,
strategy=Strategy.SSD,
recent_embeddings=recent_feed_embeddings,
)

# Maintain the rolling context window for recent items
recent_feed_embeddings = np.vstack([recent_feed_embeddings, feed_embeddings[result.indices]])
```
</details>


<details> <summary><b>Single Long Document</b> — Pick diverse sections with MSD</summary> <br>

When summarizing or extracting information from a single long document, it’s beneficial to select sections that are both relevant and cover different parts of the document. The MSD strategy helps achieve this by preferring items that are far apart from each other:

```python
from pyversity import diversify, Strategy

# Suppose you have:
# - doc_chunk_embeddings: embeddings of document chunks
# - doc_chunk_scores: relevance scores for these chunks

# Re-rank with MSD
result = diversify(
embeddings=doc_chunk_embeddings,
scores=doc_chunk_scores,
k=10,
strategy=Strategy.MSD,
)
```

</details>

## References

Expand All @@ -102,6 +234,9 @@ The implementations in this package are based on the following research papers:
- **DPP (efficient greedy implementation)**: Chen, L., Zhang, G., & Zhou, H. (2018). Fast greedy MAP inference for determinantal point process to improve recommendation diversity.
[Link](https://arxiv.org/pdf/1709.05135)

- **SSD**: Huang, Y., Wang, W., Zhang, L., & Xu, R. (2021). Sliding Spectrum Decomposition for Diversified
Recommendation. [Link](https://arxiv.org/pdf/2107.05204)

## Author

Thomas van Dongen
15 changes: 13 additions & 2 deletions src/pyversity/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
from pyversity.datatypes import DiversificationResult, Metric, Strategy
from pyversity.pyversity import diversify
from pyversity.strategies import cover, dpp, mmr, msd
from pyversity.strategies import cover, dpp, mmr, msd, ssd
from pyversity.version import __version__

__all__ = ["diversify", "Strategy", "Metric", "DiversificationResult", "mmr", "msd", "cover", "dpp", "__version__"]
__all__ = [
"diversify",
"Strategy",
"Metric",
"DiversificationResult",
"mmr",
"msd",
"cover",
"dpp",
"ssd",
"__version__",
]
1 change: 1 addition & 0 deletions src/pyversity/datatypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ class Strategy(str, Enum):
MSD = "msd"
COVER = "cover"
DPP = "dpp"
SSD = "ssd"


class Metric(str, Enum):
Expand Down
4 changes: 3 additions & 1 deletion src/pyversity/pyversity.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import numpy as np

from pyversity.datatypes import DiversificationResult, Strategy
from pyversity.strategies import cover, dpp, mmr, msd
from pyversity.strategies import cover, dpp, mmr, msd, ssd


def diversify(
Expand Down Expand Up @@ -36,4 +36,6 @@ def diversify(
return cover(embeddings, scores, k, diversity, **kwargs)
if strategy == Strategy.DPP:
return dpp(embeddings, scores, k, diversity, **kwargs)
if strategy == Strategy.SSD:
return ssd(embeddings, scores, k, diversity, **kwargs)
raise ValueError(f"Unknown strategy: {strategy}")
3 changes: 2 additions & 1 deletion src/pyversity/strategies/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@
from pyversity.strategies.dpp import dpp
from pyversity.strategies.mmr import mmr
from pyversity.strategies.msd import msd
from pyversity.strategies.ssd import ssd

__all__ = ["mmr", "msd", "cover", "dpp"]
__all__ = ["mmr", "msd", "cover", "dpp", "ssd"]
2 changes: 1 addition & 1 deletion src/pyversity/strategies/cover.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def cover(
normalize: bool = True,
) -> DiversificationResult:
"""
Select a subset of items that balances relevance and coverage/diversity.
Cover (Facility Location) selection.

This strategy chooses `k` items by combining pure relevance with
diversity-driven coverage using a concave submodular formulation.
Expand Down
Loading