search-x-likes

Search posts from x that have liked yourself using the archive download files

Github repository: https://github.com/cast42/search-x-likes/
Documentation https://cast42.github.io/search-x-likes/
Pypy package https://pypi.org/project/search-x-likes/
https://huggingface.co/spaces/cast42/x_likes_search/

Getting started with your project

1. Create a New Repository

First, create a repository on GitHub with the same name as this project, and then run the following commands:

git init -b main
git add .
git commit -m "init commit"
git remote add origin git@github.com:cast42/search-x-likes.git
git push -u origin main

2. Set Up Your Development Environment

Then, install the environment and the pre-commit hooks with

make install

This will also generate your uv.lock file

3. Run the pre-commit hooks

Initially, the CI/CD pipeline might be failing due to formatting issues. To resolve those run:

uv run pre-commit run -a

4. Commit the changes

Lastly, commit the changes made by the two steps above to your repository.

git add .
git commit -m 'Fix formatting issues'
git push origin main

5. Set OPENAI_API_KEY key

export OPENAI_API_KEY=<your key>

You are now ready to start development on your project! The CI/CD pipeline will be triggered when you open a pull request, merge to main, or when you create a new release.

To finalize the set-up for publishing to PyPI, see here. For activating the automatic documentation with MkDocs, see here. To enable the code coverage reports, see here.

Releasing a new version

Create an API Token on PyPI.
Add the API Token to your projects secrets with the name PYPI_TOKEN by visiting this page.
Create a new release on Github.
Create a new tag in the form *.*.*.

For more details, see here.

Development

Use ruff for linting and formatting, mypy for static code analysis, and pytest for testing.

The documentation is built with mkdocs, mkdocs-material and mkdocstrings.

Datasets

To run uv run python search_x_likes/fix_datasets.py set:

export PYTORCH_ENABLE_MPS_FALLBACK=1

Search approaches

Exact search

Retrieve the first k exact matches. This approach is implemented as a textual TUI in search_x_likes/exact_search.py

k = 5 # retrieve k documents
retrieved = []
for idx, document in enumerate(documents):
    if query in document:
        retrieved.append(document)
    if idx > k:
        break

BM25S

BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices.

This approach is implemented in search_x_likes/bm25_search.py

BM25 for Python: Achieving high performance while simplifying dependencies with BM25S⚡

Xing Han Lù, BM25S: Orders of magnitude faster lexical search via eager sparse scoring

Retrieve top-k documents scored with cosine similarity of their embeddings

Given the embedding of a document A and an embedding of a query B, score it's similarity as the normalized dot product of the two vectors:

$$ \text{cosine similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} $$

This approach is implemented in search_x_likes/cosine_search.py

To evaluate the different retrieval methods, a synthetic dataset is created with llm

The code to generate the synthetic dataset with gpt-4o-mini is in search_x_likes/generate_synthetic_eval_dataset.py It uses input the dataset that contains the post on x that are liked:

The evaluation results are:

Retrieval Results (Colab CPU & GPU T4)

Model	MRR	Recall@5	NDCG@5	Wall Time (CPU)	Wall Time (GPU)
BM25s	0.7711	0.8367	0.3376	0.2s	0.4s
sentence-transformers/all-MiniLM-L6-v2	0.6517	0.9246	0.3964	20s	4.09s
nomic-ai/modernbert-embed-base	0.6654	0.9472	0.4044	3m01s	6.82s
intfloat/multilingual-e5-large	0.7063	0.9246	0.3823	7m57s	12.5s
minishlab/potion-retrieval-32M	0.6346	0.8894	0.3813	2s	1.64s
minishlab/potion-base-8M	0.6128	0.8794	0.3887	0.7s	1.99s
tomaarsen/static-retrieval-mrl-en-v1	0.5958	0.8543	0.3771	3.23s	1.87s

Quality results with rerankers:

Model	MRR	Recall@5	NDCG@5
BM25s - no reranker	0.7711	0.8367	0.3376
bi-encoder sentence-transformers/all-MiniLM-L6-v2	0.7106	0.7889	0.3243
bi-encoder all-mpnet-base-v2	0.6778	0.7789	0.3315
bi-encoder minishlab/potion-retrieval-32M	0.5973	0.7638	0.3396
bi-encoder nomic-ai/modernbert-embed-base	0.7210	0.8065	0.3347
cross encoder cross-encoder/ms-marco-MiniLM-L-6-v2	0.7958	0.8417	0.3347
corss encoder mixedbread-ai/mxbai-rerank-xsmall-v1	0.7836	0.8417	0.3422
cross encoder mixedbread-ai/mxbai-rerank-base-v1	0.7708	0.8417	0.3409
cross encoder mixedbread-ai/mxbai-rerank-large-v1	0.7605	0.8342	0.3362

Contributing

All contributions are welcome, including more documentation, examples, code, and tests. Even questions.

License - MIT

The package is open-sourced under the conditions of the MIT license.

Repository initiated with fpgmaas/cookiecutter-uv.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.devcontainer		.devcontainer
.github		.github
docs		docs
search_x_likes		search_x_likes
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
codecov.yaml		codecov.yaml
metrics.md		metrics.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
tox.ini		tox.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

search-x-likes

Getting started with your project

1. Create a New Repository

2. Set Up Your Development Environment

3. Run the pre-commit hooks

4. Commit the changes

5. Set OPENAI_API_KEY key

Releasing a new version

Development

Datasets

Search approaches

Exact search

BM25S

Retrieve top-k documents scored with cosine similarity of their embeddings

To evaluate the different retrieval methods, a synthetic dataset is created with llm

Retrieval Results (Colab CPU & GPU T4)

Contributing

License - MIT

Interesting papers

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

search-x-likes

Getting started with your project

1. Create a New Repository

2. Set Up Your Development Environment

3. Run the pre-commit hooks

4. Commit the changes

5. Set OPENAI_API_KEY key

Releasing a new version

Development

Datasets

Search approaches

Exact search

BM25S

Retrieve top-k documents scored with cosine similarity of their embeddings

To evaluate the different retrieval methods, a synthetic dataset is created with llm

Retrieval Results (Colab CPU & GPU T4)

Contributing

License - MIT

Interesting papers

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages