Search posts from x that have liked yourself using the archive download files
- Github repository: https://github.com/cast42/search-x-likes/
- Documentation https://cast42.github.io/search-x-likes/
- Pypy package https://pypi.org/project/search-x-likes/
https://huggingface.co/spaces/cast42/x_likes_search/
First, create a repository on GitHub with the same name as this project, and then run the following commands:
git init -b main
git add .
git commit -m "init commit"
git remote add origin git@github.com:cast42/search-x-likes.git
git push -u origin mainThen, install the environment and the pre-commit hooks with
make installThis will also generate your uv.lock file
Initially, the CI/CD pipeline might be failing due to formatting issues. To resolve those run:
uv run pre-commit run -aLastly, commit the changes made by the two steps above to your repository.
git add .
git commit -m 'Fix formatting issues'
git push origin mainexport OPENAI_API_KEY=<your key>You are now ready to start development on your project! The CI/CD pipeline will be triggered when you open a pull request, merge to main, or when you create a new release.
To finalize the set-up for publishing to PyPI, see here. For activating the automatic documentation with MkDocs, see here. To enable the code coverage reports, see here.
- Create an API Token on PyPI.
- Add the API Token to your projects secrets with the name
PYPI_TOKENby visiting this page. - Create a new release on Github.
- Create a new tag in the form
*.*.*.
For more details, see here.
Use ruff for linting and formatting, mypy for static code analysis, and pytest for testing.
The documentation is built with mkdocs, mkdocs-material and mkdocstrings.
To run uv run python search_x_likes/fix_datasets.py set:
export PYTORCH_ENABLE_MPS_FALLBACK=1
Retrieve the first k exact matches. This approach is implemented as a textual TUI in search_x_likes/exact_search.py
k = 5 # retrieve k documents
retrieved = []
for idx, document in enumerate(documents):
if query in document:
retrieved.append(document)
if idx > k:
breakBM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices.
This approach is implemented in search_x_likes/bm25_search.py
BM25 for Python: Achieving high performance while simplifying dependencies with BM25S⚡
Xing Han Lù, BM25S: Orders of magnitude faster lexical search via eager sparse scoring
Given the embedding of a document A and an embedding of a query B, score it's similarity as the normalized dot product of the two vectors:
This approach is implemented in search_x_likes/cosine_search.py
The code to generate the synthetic dataset with gpt-4o-mini is in search_x_likes/generate_synthetic_eval_dataset.py It uses input the dataset that contains the post on x that are liked:
The evaluation results are:
| Model | MRR | Recall@5 | NDCG@5 | Wall Time (CPU) | Wall Time (GPU) |
|---|---|---|---|---|---|
| BM25s | 0.7711 | 0.8367 | 0.3376 | 0.2s | 0.4s |
| sentence-transformers/all-MiniLM-L6-v2 | 0.6517 | 0.9246 | 0.3964 | 20s | 4.09s |
| nomic-ai/modernbert-embed-base | 0.6654 | 0.9472 | 0.4044 | 3m01s | 6.82s |
| intfloat/multilingual-e5-large | 0.7063 | 0.9246 | 0.3823 | 7m57s | 12.5s |
| minishlab/potion-retrieval-32M | 0.6346 | 0.8894 | 0.3813 | 2s | 1.64s |
| minishlab/potion-base-8M | 0.6128 | 0.8794 | 0.3887 | 0.7s | 1.99s |
| tomaarsen/static-retrieval-mrl-en-v1 | 0.5958 | 0.8543 | 0.3771 | 3.23s | 1.87s |
Quality results with rerankers:
| Model | MRR | Recall@5 | NDCG@5 |
|---|---|---|---|
| BM25s - no reranker | 0.7711 | 0.8367 | 0.3376 |
| bi-encoder sentence-transformers/all-MiniLM-L6-v2 | 0.7106 | 0.7889 | 0.3243 |
| bi-encoder all-mpnet-base-v2 | 0.6778 | 0.7789 | 0.3315 |
| bi-encoder minishlab/potion-retrieval-32M | 0.5973 | 0.7638 | 0.3396 |
| bi-encoder nomic-ai/modernbert-embed-base | 0.7210 | 0.8065 | 0.3347 |
| cross encoder cross-encoder/ms-marco-MiniLM-L-6-v2 | 0.7958 | 0.8417 | 0.3347 |
| corss encoder mixedbread-ai/mxbai-rerank-xsmall-v1 | 0.7836 | 0.8417 | 0.3422 |
| cross encoder mixedbread-ai/mxbai-rerank-base-v1 | 0.7708 | 0.8417 | 0.3409 |
| cross encoder mixedbread-ai/mxbai-rerank-large-v1 | 0.7605 | 0.8342 | 0.3362 |
All contributions are welcome, including more documentation, examples, code, and tests. Even questions.
The package is open-sourced under the conditions of the MIT license.
Repository initiated with fpgmaas/cookiecutter-uv.