This repository contains small utilities to test and evaluate OpenSearch text embedding models on Yahoo Answers data.
- A running OpenSearch cluster with the ML Commons / OpenSearch-ML plugin installed
curl,jq, and Python 3 available on your machine
scripts/setup_embedding.sh– Configure ML cluster settings, register a pretrained TEXT_EMBEDDING model, deploy it, and run a small sanity-check prediction.scripts/test_text_embedding.sh– Call the TEXT_EMBEDDING_predictAPI once with two sample sentences, pretty-print the JSON response, and save it tooutput/embeddings/test_prediction.json.scripts/run_yahoo_embedding.sh– Batch-embed Yahoo Answers questions and write results tooutput/embeddings/yahoo_vecs.jsonl.
From the scripts directory:
cd scripts
./setup_embedding.shThis script will:
- Check cluster health and that the ML plugin is installed.
- Configure basic ML-related cluster settings.
- Register a pretrained TEXT_EMBEDDING model.
- Wait for registration to finish and retrieve the
model_id. - Deploy the model and wait until it is
COMPLETED. - Run a single embedding request to verify the model works.
Environment variables you can override:
OS_HOST– OpenSearch URL (default:http://localhost:9200)OS_USER,OS_PASS– Optional basic auth credentials
After a model is deployed, you can run a quick manual prediction:
cd scripts
./test_text_embedding.shThe script will automatically search OpenSearch for a deployed TEXT_EMBEDDING model and use its model_id. The raw JSON response is stored in:
output/embeddings/test_prediction.json
You can change the output path by setting the OUTPUT environment variable before running the script.
To generate embeddings for the Yahoo Answers dataset:
cd scripts
./run_yahoo_embedding.shThis script will:
- Look up a deployed
TEXT_EMBEDDINGmodel in OpenSearch and obtain itsmodel_id. - Call the Python script
src/yahoo_qeury_embedding.pyto batch-embed queries. - Write one JSON object per line to
output/embeddings/yahoo_vecs.jsonlwith fieldsid,query, andembedding.
Configurable parameters (via environment variables):
OS_URL– OpenSearch URL (default:http://localhost:9200)INPUT– Path to the Yahoo Answers JSONL input fileOUTPUT– Output file path for embeddings (default:output/embeddings/yahoo_vecs.jsonl)BATCH_SIZE– Batch size for embedding requests (default:100)MAX_QUERIES– If set, only the first N queries are embedded (useful for quick tests)
The underlying Python script prints total embedding time and QPS, as well as a simple progress bar while processing.
The main embedding logic lives in src/yahoo_qeury_embedding.py:
- Reads the Yahoo JSONL file and extracts question text.
- Batches queries and sends them to OpenSearch via
OpenSearchTextEmbedderinsrc/opensearch/os_embeeding_client.py. - Writes out per-query JSON lines with
id,query, andembedding.
You can also invoke it directly, for example:
python src/yahoo_qeury_embedding.py \
--input /path/to/yahoo_answers_title_answer.jsonl \
--output output/embeddings/yahoo_vecs.jsonl \
--os-url http://localhost:9200 \
--model-id <your_model_id> \
--batch-size 100 \
--max-queries 1000Optional arguments:
--username,--password– Basic auth for OpenSearch--insecure– Disable TLS verification if needed
This README only covers the basic workflow; you can adjust scripts and parameters to fit your own datasets or models.