USE 355 - index current embeddings for a source #377

ghukill · 2026-01-27T16:12:25Z

Purpose and background context

This PR allows us to use TIM to bulk update pre-existing documents for a given source with all current embeddings for that source.

As noted in both the ticket and commits, the first pass focused only on indexing a single ETL run, which remains the common path in the ETL StepFunction. But until we're relying on the ETL StepFunction entirely, there is value in having the ability to update a source in Opensearch with current embeddings for that source. More specifically, to do so without fully reindexing the source as reindex-source would do.

How can a reviewer manually see the effects of these changes?

1- Set Dev1 AWS credentials

2- Run a bulk updating for a small source like libguides:

pipenv run tim --verbose \
bulk-update-embeddings \
--source libguides \
s3://timdex-extract-dev-222053980223/dataset

When complete, observe the following results:

{"updated": 0, "skipped": 279, "errors": 0, "total": 279}

Because embeddings already existed, and were identical to the ones used for updating, they are all skipped.

You could instead perform a full re-index of a source, e.g. gismit:

pipenv run tim --verbose reindex-source --source gismit s3://timdex-extract-dev-222053980223/dataset

Results:

{"index": {"created": 2043, "updated": 0, "errors": 16, "total": 2059}, "update": {"updated": 2043, "errors": 16, "total": 2059, "skipped": 0}}

This is interesting for a couple of reasons:

we see that during indexing (creation) of documents we have 16 errors
during updating of documents we have 16 errors because the associated Opensearch document didn't exist
but the other 2043 are updated successfully

Includes new or updated dependencies?

YES | NO

Changes expectations for external applications?

YES

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-355

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: The first pass at bulk updating pre-existing documents, encapsulated in the command `bulk-update-embeddings` required passing a `--run-id` to target a specific ETL run. This aligns with the most common use case of indexing embeddings within an ETL run. However, we have use cases now for indexing all current embeddings for a given source into Opensearch. These current embeddings may span multiple ETL runs. How this addresses that need: Updates the `bulk-update-embeddings` CLI command to require only `--source`, defaulting to retrieving all current embeddings for that source. This logic is identical to what `reindex-source` was already doing, but is decoupled from re-indexing the documents themselves which is not always required. While working on this, it was decided that raising an exception for a missing document when performing updates is not ideal. Some sources have indexing issues, and we have historically skipped those records. When we get to bulk updates, it's possible that we have embeddings for documents that were never indexed; we will log and skip them now in a similar fashion. Side effects of this change: * CLI supports ad-hoc indexing of all current embeddings for a source Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-355

coveralls · 2026-01-27T16:15:13Z

Pull Request Test Coverage Report for Build 21405528265

Details

10 of 12 (83.33%) changed or added relevant lines in 2 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.2%) to 95.573%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
tim/opensearch.py	2	4	50.0%

Files with Coverage Reduction	New Missed Lines	%
tim/opensearch.py	1	93.81%

Totals
Change from base Build 20438471979:	-0.2%
Covered Lines:	475
Relevant Lines:	497

💛 - Coveralls

Why these changes are being introduced: When bulk updating documents the result can be "noop" which means no operation was performed. This can happen if the update would have zero effect. How this addresses that need: Handle `result=noop` during bulk updates and set a new "skipped" result counter in the results. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-355

ghukill added 2 commits January 27, 2026 09:39

Update dependencies

57696d3

ghukill marked this pull request as ready for review January 27, 2026 16:47

ghukill requested a review from a team as a code owner January 27, 2026 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE 355 - index current embeddings for a source #377

USE 355 - index current embeddings for a source #377

Uh oh!

ghukill commented Jan 27, 2026 •

edited

Loading

Uh oh!

coveralls commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

USE 355 - index current embeddings for a source #377

Are you sure you want to change the base?

USE 355 - index current embeddings for a source #377

Uh oh!

Conversation

ghukill commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

coveralls commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 21405528265

Details

💛 - Coveralls

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghukill commented Jan 27, 2026 •

edited

Loading

coveralls commented Jan 27, 2026 •

edited

Loading