Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Jan 27, 2026

Purpose and background context

This PR allows us to use TIM to bulk update pre-existing documents for a given source with all current embeddings for that source.

As noted in both the ticket and commits, the first pass focused only on indexing a single ETL run, which remains the common path in the ETL StepFunction. But until we're relying on the ETL StepFunction entirely, there is value in having the ability to update a source in Opensearch with current embeddings for that source. More specifically, to do so without fully reindexing the source as reindex-source would do.

How can a reviewer manually see the effects of these changes?

1- Set Dev1 AWS credentials

2- Run a bulk updating for a small source like libguides:

pipenv run tim --verbose \
bulk-update-embeddings \
--source libguides \
s3://timdex-extract-dev-222053980223/dataset

When complete, observe the following results:

{"updated": 0, "skipped": 279, "errors": 0, "total": 279}

Because embeddings already existed, and were identical to the ones used for updating, they are all skipped.

You could instead perform a full re-index of a source, e.g. gismit:

pipenv run tim --verbose reindex-source --source gismit s3://timdex-extract-dev-222053980223/dataset

Results:

{"index": {"created": 2043, "updated": 0, "errors": 16, "total": 2059}, "update": {"updated": 2043, "errors": 16, "total": 2059, "skipped": 0}}

This is interesting for a couple of reasons:

  • we see that during indexing (creation) of documents we have 16 errors
  • during updating of documents we have 16 errors because the associated Opensearch document didn't exist
  • but the other 2043 are updated successfully

Includes new or updated dependencies?

YES | NO

Changes expectations for external applications?

YES

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

The first pass at bulk updating pre-existing documents, encapsulated in the
command `bulk-update-embeddings` required passing a `--run-id` to target a specific
ETL run.  This aligns with the most common use case of indexing embeddings within
an ETL run.

However, we have use cases now for indexing all current embeddings for a given source
into Opensearch.  These current embeddings may span multiple ETL runs.

How this addresses that need:

Updates the `bulk-update-embeddings` CLI command to require only `--source`,
defaulting to retrieving all current embeddings for that source.  This logic is
identical to what `reindex-source` was already doing, but is decoupled from
re-indexing the documents themselves which is not always required.

While working on this, it was decided that raising an exception for a missing
document when performing updates is not ideal.  Some sources have indexing issues,
and we have historically skipped those records.  When we get to bulk updates, it's
possible that we have embeddings for documents that were never indexed; we will log
and skip them now in a similar fashion.

Side effects of this change:
* CLI supports ad-hoc indexing of all current embeddings for a source

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-355
@coveralls
Copy link

coveralls commented Jan 27, 2026

Pull Request Test Coverage Report for Build 21405528265

Details

  • 10 of 12 (83.33%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 95.573%

Changes Missing Coverage Covered Lines Changed/Added Lines %
tim/opensearch.py 2 4 50.0%
Files with Coverage Reduction New Missed Lines %
tim/opensearch.py 1 93.81%
Totals Coverage Status
Change from base Build 20438471979: -0.2%
Covered Lines: 475
Relevant Lines: 497

💛 - Coveralls

Why these changes are being introduced:

When bulk updating documents the result can be "noop" which means
no operation was performed.  This can happen if the update would have zero
effect.

How this addresses that need:

Handle `result=noop` during bulk updates and set a new "skipped" result
counter in the results.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-355
@ghukill ghukill marked this pull request as ready for review January 27, 2026 16:47
@ghukill ghukill requested a review from a team as a code owner January 27, 2026 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants