USE 355 - index current embeddings for a source #377
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose and background context
This PR allows us to use TIM to bulk update pre-existing documents for a given source with all current embeddings for that source.
As noted in both the ticket and commits, the first pass focused only on indexing a single ETL run, which remains the common path in the ETL StepFunction. But until we're relying on the ETL StepFunction entirely, there is value in having the ability to update a source in Opensearch with current embeddings for that source. More specifically, to do so without fully reindexing the source as
reindex-sourcewould do.How can a reviewer manually see the effects of these changes?
1- Set Dev1 AWS credentials
2- Run a bulk updating for a small source like
libguides:When complete, observe the following results:
{"updated": 0, "skipped": 279, "errors": 0, "total": 279}Because embeddings already existed, and were identical to the ones used for updating, they are all skipped.
You could instead perform a full re-index of a source, e.g.
gismit:Results:
{"index": {"created": 2043, "updated": 0, "errors": 16, "total": 2059}, "update": {"updated": 2043, "errors": 16, "total": 2059, "skipped": 0}}This is interesting for a couple of reasons:
Includes new or updated dependencies?
YES | NO
Changes expectations for external applications?
YES
What are the relevant tickets?
Code review