Skip to content

optimization of master key initialization#1

Open
zane-neo wants to merge 32 commits intomainfrom
optimize-master-blocking-issue
Open

optimization of master key initialization#1
zane-neo wants to merge 32 commits intomainfrom
optimize-master-blocking-issue

Conversation

@zane-neo
Copy link
Copy Markdown
Owner

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@zane-neo zane-neo force-pushed the optimize-master-blocking-issue branch from 325878a to c46e970 Compare November 27, 2025 07:25
@zane-neo
Copy link
Copy Markdown
Owner Author

@CodiumAI-Agent /improve

@zane-neo zane-neo force-pushed the optimize-master-blocking-issue branch from c46e970 to a77a844 Compare December 9, 2025 01:28
zane-neo and others added 27 commits February 2, 2026 13:47
…t get error when executing PER agent (opensearch-project#4579)

* use dedicated thread pool in response handler

Signed-off-by: zane-neo <zaniu@amazon.com>

* address comments

Signed-off-by: zane-neo <zaniu@amazon.com>

* optimize code

Signed-off-by: zane-neo <zaniu@amazon.com>

---------

Signed-off-by: zane-neo <zaniu@amazon.com>
)

Signed-off-by: Sicheng Song <sicheng.song@outlook.com>
…pensearch-project#4586)

* [Gemini Model Support] Filter Agent final response, address comments opensearch-project#4570

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

* address comment

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

* Address comment + add coverage

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

---------

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>
Signed-off-by: Sicheng Song <sicheng.song@outlook.com>
…rch-project#4591)

* Support OpenAI Chat Completions API with new Agent Interface

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

* Address comments

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

---------

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>
…4599)

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>
…ntic memory queries (opensearch-project#4597)

* Add feature flag for remote agentic memory type

Signed-off-by: Sicheng Song <sicheng.song@outlook.com>

* fix: get message in agentic memory not working

Signed-off-by: Sicheng Song <sicheng.song@outlook.com>

* adress comments

Signed-off-by: Sicheng Song <sicheng.song@outlook.com>

---------

Signed-off-by: Sicheng Song <sicheng.song@outlook.com>
* use global variables and add validations

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* change exception type

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* fix client stashContext

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* consolidate test

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

---------

Signed-off-by: Mingshi Liu <mingshl@amazon.com>
… of llm connectors (opensearch-project#4394)

* [FEATURE] Add an option to turn on and off the certificate validation in ML Commons

Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

* [FEATURE] Add an option to turn on and off the certificate validation in ML Commons

Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

* Fixed coderabbitai comments
Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

* Added test cases

Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

* Fixed review comments

Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

* Fixed review comments

Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

* Fixed review comments

Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

* Converting log based tests to arguent check

Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

* Converting log based tests to arguent check

Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

* Converting log based tests to arguent check

Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

* Fixing test failures

Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

* changing info logs to warn

Resolves opensearch-project#4371

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>

---------

Signed-off-by: Abdul Muneer Kolarkunnu <muneer.kolarkunnu@netapp.com>
Signed-off-by: Muneer Kolarkunnu <33829651+akolarkunnu@users.noreply.github.com>
Co-authored-by: Dhrubo Saha <dhrubo@amazon.com>
Co-authored-by: Mingshi Liu <mingshl@amazon.com>
* fix previous tool results missing

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* add tool message support in agent revamp + update AGUI processing

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* support image in streaming

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* add/fix tests

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* add support for tool messages for OpenAI

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

---------

Signed-off-by: Jiaping Zeng <jpz@amazon.com>
Signed-off-by: zane-neo <zaniu@amazon.com>
… Access (opensearch-project#4608)

* Fix: Restore Thread Context in MLAgentExecutor properly to fix Memory Access

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

* Fix Rebase issue

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

---------

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>
…ic memory (opensearch-project#4621)

Signed-off-by: Sicheng Song <sicheng.song@outlook.com>
…oject#4626)

* add overload constructor to unblock skills plugin

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* add test

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

---------

Signed-off-by: Jiaping Zeng <jpz@amazon.com>
…uring agent register (opensearch-project#4637)

* add more tests

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

add test notations

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* apply spotless

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

* force run CI

Signed-off-by: Mingshi Liu <mingshl@amazon.com>

---------

Signed-off-by: Mingshi Liu <mingshl@amazon.com>
* Add 3.5.0 release notes

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* Update release notes with latest changes

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* update release notes with latest bug fixes

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

---------

Signed-off-by: Jiaping Zeng <jpz@amazon.com>
Signed-off-by: opensearch-ci-bot <opensearch-infra@amazon.com>
Co-authored-by: opensearch-ci-bot <opensearch-infra@amazon.com>
…make ml-common fips build param aware (opensearch-project#4654)

* Fix ML build with 1) adapt to gradle shadow plugin v9 upgrade and 2) make ml-common fips build param aware

Signed-off-by: Craig Perkins <cwperx@amazon.com>

* Add to build tasks as well

Signed-off-by: Craig Perkins <cwperx@amazon.com>

* Add FipsBuildParam in plugin/build.gradle

Signed-off-by: Craig Perkins <cwperx@amazon.com>

---------

Signed-off-by: Craig Perkins <cwperx@amazon.com>
…h-project#4659)

Use "-Pcrypto.standard=FIPS-140-3" (quoted) instead of
-Pcrypto.standard=FIPS-140-3 for consistency with other OpenSearch
plugin repositories (e.g. flow-framework PR opensearch-project#1322).

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
Signed-off-by: Nathalie Jonathan <nathhjo@amazon.com>
…tion (opensearch-project#4656)

* fix: fix integ test for ML inference range query rewrite

- RestMLInferenceSearchRequestProcessorIT: Remove pre/post process
  functions from Bedrock connector so raw response is available as
  dataAsMap. Use embedding.length() to get embedding dimension as
  integer for the range query. Use diary_embedding_size_int (integer
  field) instead of diary_embedding_size (keyword field).

- RestMLRAGSearchProcessorIT: Update Cohere model from
  command-a-03-2025 (v2 API only) to command-r-08-2024 (v1 API).

- plugin/build.gradle: Add bc-fips to unit test classpath in FIPS mode
  via detached configuration to fix NoClassDefFoundError.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* fix: revert manual substitution, restore StringSubstitutor

The manual substitution was unnecessary. The correct fix is removing
the post_process_function from the connector so the raw Bedrock response
is available as dataAsMap, allowing embedding.length() to work directly.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* test: remove unnecessary unit test for numeric type preservation

The test was added when the fix was in MLInferenceSearchRequestProcessor
but since the fix is now in the integration test (removing post-process
function from connector), this unit test adds no value.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* test: restore missing javadoc for testExecute_rewriteListFromTermQueryToGeometryQuerySuccess

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

---------

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
…ect#4666)

Signed-off-by: Peter Zhu <zhujiaxi@amazon.com>
Co-authored-by: Dhrubo Saha <dhrubo@amazon.com>
…asters (opensearch-project#4665)

* fix: improve integration test stability

- SearchModelGroupITTests: add supportsDedicatedMasters=false to prevent
  suite timeout when test framework randomly adds dedicated cluster-manager
  nodes based on random seed

- BedRockConnectorBodies.json, RestMLInferenceSearchResponseProcessorIT,
  RestMLRAGSearchProcessorIT: increase max_connection to 200 in Bedrock
  connector configs to prevent connection pool exhaustion under high
  concurrent request rates in CI

- RestMLRAGSearchProcessorIT: update Cohere connector model from
  command-a-03-2025 (v2 API only) to command-r-08-2024 (v1 API)

- MLCommonsRestTestCase: add isServiceReachable(hostname) helper for
  skipping tests when external services are unreachable

- RestMLInferenceIngestProcessorIT: skip OpenAI tests when api.openai.com
  is not reachable in addition to OPENAI_KEY null check

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* fix: fail fast in waitForTask when task reaches terminal failure state

When a model download fails (e.g. network error), the task goes to
FAILED state. The previous waitForTask only checked for the target
state (COMPLETED), so it would loop until CUSTOM_MODEL_TIMEOUT
(20,000 seconds), causing the 20-minute suite timeout to trigger first.

Now waitForTask also exits immediately on FAILED or CANCELLED states,
allowing the test to fail with a clear assertion error instead of a
suite timeout.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

---------

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
* fix: stats job collector

Signed-off-by: Pavan Yekbote <pybot@amazon.com>

* fix: test cases

Signed-off-by: Pavan Yekbote <pybot@amazon.com>

---------

Signed-off-by: Pavan Yekbote <pybot@amazon.com>
…x flaky IndexUtilsTests (opensearch-project#4668)

* fix: skip OpenAI RAG tests when api.openai.com is unreachable

When api.openai.com is not reachable on CI, model registration tasks
fail silently returning model_id=null, causing deployRemoteModel(null)
to hit /_plugins/_ml/models/null/_deploy and fail with 404.

Reuse the existing isServiceReachable() helper to skip all four OpenAI
tests when the service is not reachable.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* fix: make testGetNumberOfDocumentsInIndex_SearchQuery synchronous

The assertion was running inside an async ActionListener callback on a
search thread. If it threw AssertionError, it was caught as an uncaught
exception on that thread rather than propagated to the test thread,
causing the test to appear to pass or fail non-deterministically.

Use PlainActionFuture to block the test thread until the result is
available, then assert on the test thread.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

---------

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
rithin-pullela-aws and others added 5 commits February 26, 2026 16:50
…#4667)

* Optimize integration test setup to eliminate redundant per-test work

- Transport IT: use @SuiteScopeTestCase to run expensive model training
  and data loading once per class instead of per test method
- Memory IT: change scope=TEST to scope=SUITE to reuse cluster across
  tests instead of restarting a 2-node cluster per test method
- REST IT: add static guard around disableClusterConnectorAccessControl()
  + Thread.sleep(20000) so cluster settings and sleep run once per class

Measured 82% reduction across tested classes (776s → 140s).
16 files changed, 0 test regressions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

* Fix REST IT: remove base class settings guard, only guard Thread.sleep

OpenSearchRestTestCase.cleanUpCluster() calls wipeClusterSettings()
after every test method, clearing all persistent cluster settings.
The previous static guard in setupSettings() prevented re-applying
them, causing failures in MLModelAutoReDeployerIT, RestMLDeleteTaskActionIT,
RestMLMemoryCircuitBreakerIT, and others.

Fix:
- Remove baseSettingsInitialized guard from MLCommonsRestTestCase.setupSettings()
  (4 cheap REST calls, must run every test since settings get wiped)
- In subclasses, move disableClusterConnectorAccessControl() outside
  the guard (must re-run after wipe), only guard Thread.sleep(20000)
  (expensive, only needed once for initial propagation)

Verified: full integTest suite passes (same pre-existing OpenAI API
key failures as baseline, 0 new regressions).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

* Replace Thread.sleep(20000) with active cluster settings polling

The 20-second sleep after disableClusterConnectorAccessControl() was a
brute-force wait for cluster settings propagation. Since PUT _cluster/settings
returns after master acknowledgment, the setting is available almost immediately
on 1-2 node test clusters.

Added waitForClusterSettingPropagation() utility in MLCommonsRestTestCase that
polls GET _cluster/settings?flat_settings=true until the setting appears, with
a 10-second timeout as safety net. Resolves the existing TODO comments asking
whether the sleep could be replaced with a cluster state check.

Measured: RestBedRockInferenceIT dropped from 32.6s to 15.4s. Full integTest
suite dropped from 8m22s to 6m11s.

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

* Validate cluster setting value in polling check

Update waitForClusterSettingPropagation to verify the setting has the
expected value (not just that the key exists) for correctness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

* Fix waitForTask timeout bug and skip redundant remote model deploy

The CUSTOM_MODEL_TIMEOUT (20_000) was passed with TimeUnit.SECONDS,
creating a 20,000-second (5.5 hour) effective timeout instead of the
intended 20 seconds. This caused tests to hang until suite timeout
killed them. Fixed by using TimeUnit.MILLISECONDS.

Also skip the deploy step for remote model registration in integration
tests since remote models do not require explicit deployment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>

---------

Signed-off-by: rithin-pullela-aws <rithinp@amazon.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…nt (opensearch-project#4645)

* extend memory interface

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* clean up + support remote agentic memory

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* add tests

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* send MessagesSnapshot AGUI event

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* sort using only messageId

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* test: create memory session document using if provided sessionId does not exist

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* move initial memory saving logic from MLChatAgentRunner to MLAgentExecutor

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* fix streaming text accumulation

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* address comments

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* Revert "move initial memory saving logic from MLChatAgentRunner to MLAgentExecutor"

This reverts commit a364488.

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* disable conversation index memory when using messages array input

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* add messageId in MessagesSnapshot and remove page context from memory

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* Revert "disable conversation index memory when using messages array input"

This reverts commit 1e667ee.

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* refactor memory handling for unified interface

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* remove tool result from conv index memory

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

* store text in message with image in conv index memory

Signed-off-by: Jiaping Zeng <jpz@amazon.com>

---------

Signed-off-by: Jiaping Zeng <jpz@amazon.com>
Signed-off-by: zane-neo <zaniu@amazon.com>
Signed-off-by: zane-neo <zaniu@amazon.com>
Signed-off-by: zane-neo <zaniu@amazon.com>
@zane-neo zane-neo force-pushed the optimize-master-blocking-issue branch from a77a844 to b223c0a Compare February 27, 2026 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.