Skip to content

Conversation

@plaidfinch
Copy link
Contributor

@plaidfinch plaidfinch commented Dec 21, 2025

Adds version 13 (ADD_TRUST_QUORUM) to the Sled Agent API with the following endpoints for trust quorum reconfiguration:

  • POST /trust-quorum/reconfigure - Initiate a reconfiguration
  • POST /trust-quorum/upgrade-from-lrtq - Upgrade from low-rent (legacy) trust quorum
  • POST /trust-quorum/commit - Commit a trust-quorum
  • GET /trust-quorum/coordinator-status - Get coordinator status
  • POST /trust-quorum/prepare-and-commit - Prepare and commit a configuration
  • POST /trust-quorum/proxy/commit - Proxy a commit request to another TQ node
  • POST /trust-quorum/proxy/prepare-and-commit - Proxy a prepare-and-commit request to another TQ node
  • POST /trust-quorum/proxy/status - Request status from another TQ node

Types are organized per RFD 619 (via feeding Claude the RFD):

  • API types defined in sled-agent-types-versions/src/add_trust_quorum/
  • Re-exported via latest.rs and sled-agent-types/src/trust_quorum.rs
  • API trait uses latest:: paths for all trust quorum types

Also exports EncryptedRackSecrets, Salt, and Sha3_256Digest from trust-quorum-protocol for use in the prepare_and_commit handler.

@plaidfinch plaidfinch force-pushed the sled-agent-trust-quorum-api branch from cfdce9d to 79e918f Compare December 21, 2025 00:51
The following endpoints are created for trust quorum reconfiguration:

- POST `/trust-quorum/reconfigure` - Initiate a reconfiguration
- POST `/trust-quorum/upgrade-from-lrtq` - Upgrade from low-rent (legacy) trust quorum
- POST `/trust-quorum/commit` - Commit a trust-quorum
- GET `/trust-quorum/coordinator-status` - Get coordinator status
- POST `/trust-quorum/prepare-and-commit` - Prepare and commit a configuration

Types are organized per RFD 619 (via feeding Claude the RFD):

- API types defined in `sled-agent-types-versions/src/add_trust_quorum/`
- Re-exported via `latest.rs` and `sled-agent-types/src/trust_quorum.rs`
- API trait uses `latest::` paths for all trust quorum types

Also exports `EncryptedRackSecrets`, `Salt`, and `Sha3_256Digest` from `trust-quorum-protocol` for use in the `prepare_and_commit` handler.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@plaidfinch plaidfinch force-pushed the sled-agent-trust-quorum-api branch from 79e918f to 5ccf78a Compare December 21, 2025 00:52
@plaidfinch
Copy link
Contributor Author

This is still missing the proxy methods; these will be added in a separate commit. Planning to do a self-review of the PR to identify areas of uncertainty and cross-check for any LLM-introduced confabulations during the RFD 619 refactor.

TrustQuorumCommitResponse::Committed
}
trust_quorum::CommitStatus::Pending => {
TrustQuorumCommitResponse::Pending
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewjstone You mentioned that this response is always a fatal error during reconfiguration of the TQ. Does that mean we should return an error response here, or should that be generated higher-up in Nexus?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I don't think we'll ever actually get a response of Pending back here. We should end up with an error.

Pending is really only relevant for PrepareAndCommit operations, since those have to compute their own key share before they can commit. I'm unsure why I made the status the same. We should probably change that to just return () instead of CommitStatus, unless I remember why I did that.

TrustQuorumCommitResponse::Committed
}
trust_quorum::CommitStatus::Pending => {
TrustQuorumCommitResponse::Pending
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here: should we ever return Pending from this level of the API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is expected. Pending is a normal case for this. There's a bunch of background work to compute the key share - by requesting shares from other nodes - that require this response.

Comment on lines +45 to +46
EncryptedRackSecrets, RackSecret, ReconstructedRackSecret, Salt,
Sha3_256Digest,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to export Salt and Sha3_256Digest here? Or is there a different or better way to handle the need for these in the API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, totally fine. These types aren't going to change. We can create new types as needed. I did the same elsewhere.

@plaidfinch
Copy link
Contributor Author

plaidfinch commented Dec 21, 2025

Looks like the build is failing in CI because of: a missing implementation of the added SledAgentApi methods in nexus/mgs-updates/src/test_util/host_phase_2_test_state.rs but this does not show up on cargo build in omicron on my local workstation — is this file feature-gated? How should these methods be filled in for HostPhase2SledAgentImpl? It looks like the LRTQ bootstore-related methods are all unimplemented!(); should I just do the same here? Tentatively I have done this in 40eecf0.

plaidfinch and others added 2 commits December 20, 2025 20:38
Add three new proxy endpoints to the sled-agent API that allow forwarding
trust quorum operations to other nodes:

- POST `/trust-quorum/proxy/commit`
- POST `/trust-quorum/proxy/prepare-and-commit`
- POST `/trust-quorum/proxy/status`

Also refactors the trust quorum types to use `serde_with` with `Hex` encoding
for cleaner serialization of digests and encrypted rack secrets, replacing
manual hex string parsing with automatic encoding/decoding.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@plaidfinch plaidfinch force-pushed the sled-agent-trust-quorum-api branch from 60f26bb to 9f00be2 Compare December 21, 2025 23:30
Comment on lines +940 to +1037
async fn trust_quorum_reconfigure(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<
sled_agent_types::trust_quorum::TrustQuorumReconfigureRequest,
>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
unimplemented!()
}

async fn trust_quorum_upgrade_from_lrtq(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<
sled_agent_types::trust_quorum::TrustQuorumLrtqUpgradeRequest,
>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
unimplemented!()
}

async fn trust_quorum_commit(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<
sled_agent_types::trust_quorum::TrustQuorumCommitRequest,
>,
) -> Result<
HttpResponseOk<
sled_agent_types::trust_quorum::TrustQuorumCommitResponse,
>,
HttpError,
> {
unimplemented!()
}

async fn trust_quorum_coordinator_status(
_request_context: RequestContext<Self::Context>,
) -> Result<
HttpResponseOk<
Option<sled_agent_types::trust_quorum::TrustQuorumCoordinatorStatus>,
>,
HttpError,
>{
unimplemented!()
}

async fn trust_quorum_prepare_and_commit(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<
sled_agent_types::trust_quorum::TrustQuorumPrepareAndCommitRequest,
>,
) -> Result<
HttpResponseOk<
sled_agent_types::trust_quorum::TrustQuorumCommitResponse,
>,
HttpError,
> {
unimplemented!()
}

async fn trust_quorum_proxy_commit(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<
sled_agent_types::trust_quorum::TrustQuorumProxyCommitRequest,
>,
) -> Result<
HttpResponseOk<
sled_agent_types::trust_quorum::TrustQuorumCommitResponse,
>,
HttpError,
> {
unimplemented!()
}

async fn trust_quorum_proxy_prepare_and_commit(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<
sled_agent_types::trust_quorum::TrustQuorumProxyPrepareAndCommitRequest,
>,
) -> Result<
HttpResponseOk<
sled_agent_types::trust_quorum::TrustQuorumCommitResponse,
>,
HttpError,
> {
unimplemented!()
}

async fn trust_quorum_proxy_status(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<
sled_agent_types::trust_quorum::TrustQuorumProxyStatusRequest,
>,
) -> Result<
HttpResponseOk<
sled_agent_types::trust_quorum::TrustQuorumNodeStatus,
>,
HttpError,
> {
unimplemented!()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tentatively leaving these unimplemented!() because the rest of this impl appears to do the same.

/// Proxy a commit operation to another trust quorum node
#[endpoint {
method = POST,
path = "/trust-quorum/proxy/commit",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a nested path, or should it be more like /trust-quorum/proxy-commit? Is the proxy a "resource" from the REST point of view?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I think I like it the way you have it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be a PUT request.

TrustQuorumCommitResponse::Committed
}
trust_quorum::CommitStatus::Pending => {
TrustQuorumCommitResponse::Pending
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place where we return Pending: should this be an error?

Comment on lines +1437 to +1499
// We can't define a `From` impl between Alarm (the internal protocol type) and
// TrustQuorumAlarm (the external API type) due to orphan rules, and this conversion is only
// used here to bridge the API to the underlying trust-quorum types, so we'll just define a
// couple free functions in this scope for conversion:

fn convert_configuration(
config: trust_quorum_protocol::Configuration,
) -> TrustQuorumConfiguration {
TrustQuorumConfiguration {
rack_id: config.rack_id,
epoch: config.epoch.0,
coordinator: config.coordinator,
members: config
.members
.into_iter()
.map(|(id, digest)| (id, digest.0))
.collect(),
threshold: config.threshold.0,
}
}

fn convert_alarm(
alarm: trust_quorum_protocol::Alarm,
) -> TrustQuorumAlarm {
match alarm {
trust_quorum_protocol::Alarm::MismatchedConfigurations {
config1,
config2,
from,
} => TrustQuorumAlarm::MismatchedConfigurations {
config1: convert_configuration(config1),
config2: convert_configuration(config2),
from,
},
trust_quorum_protocol::Alarm::ShareComputationFailed { epoch, err } => {
TrustQuorumAlarm::ShareComputationFailed {
epoch: epoch.0,
error: err.to_string(),
}
}
trust_quorum_protocol::Alarm::CommittedConfigurationLost {
latest_committed_epoch,
collecting_epoch,
} => TrustQuorumAlarm::CommittedConfigurationLost {
latest_committed_epoch: latest_committed_epoch.0,
collecting_epoch: collecting_epoch.0,
},
trust_quorum_protocol::Alarm::RackSecretDecryptionFailed {
epoch,
err,
} => TrustQuorumAlarm::RackSecretDecryptionFailed {
epoch: epoch.0,
error: err.to_string(),
},
trust_quorum_protocol::Alarm::RackSecretReconstructionFailed {
epoch,
err,
} => TrustQuorumAlarm::RackSecretReconstructionFailed {
epoch: epoch.0,
error: err.to_string(),
},
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is annoyingly verbose to do, but I'm not sure how to do better while preserving API/protocol decoupling. Input?

Comment on lines +933 to +987
async fn trust_quorum_reconfigure(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<TrustQuorumReconfigureRequest>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
method_unimplemented()
}

async fn trust_quorum_upgrade_from_lrtq(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<TrustQuorumLrtqUpgradeRequest>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
method_unimplemented()
}

async fn trust_quorum_commit(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<TrustQuorumCommitRequest>,
) -> Result<HttpResponseOk<TrustQuorumCommitResponse>, HttpError> {
method_unimplemented()
}

async fn trust_quorum_coordinator_status(
_request_context: RequestContext<Self::Context>,
) -> Result<HttpResponseOk<Option<TrustQuorumCoordinatorStatus>>, HttpError>
{
method_unimplemented()
}

async fn trust_quorum_prepare_and_commit(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<TrustQuorumPrepareAndCommitRequest>,
) -> Result<HttpResponseOk<TrustQuorumCommitResponse>, HttpError> {
method_unimplemented()
}

async fn trust_quorum_proxy_commit(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<TrustQuorumProxyCommitRequest>,
) -> Result<HttpResponseOk<TrustQuorumCommitResponse>, HttpError> {
method_unimplemented()
}

async fn trust_quorum_proxy_prepare_and_commit(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<TrustQuorumProxyPrepareAndCommitRequest>,
) -> Result<HttpResponseOk<TrustQuorumCommitResponse>, HttpError> {
method_unimplemented()
}

async fn trust_quorum_proxy_status(
_request_context: RequestContext<Self::Context>,
_body: TypedBody<TrustQuorumProxyStatusRequest>,
) -> Result<HttpResponseOk<TrustQuorumNodeStatus>, HttpError> {
method_unimplemented()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all marked as unimplemented!() because the majority of the rest of this file is. I'm uncertain how to do anything other — should they be filled in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary. We'll probably want a live test at some point, but those operate against a real system.

Comment on lines +72 to +74
#[serde_as(as = "BTreeMap<_, Hex>")]
#[schemars(with = "BTreeMap<BaseboardId, String>")]
pub members: BTreeMap<BaseboardId, [u8; 32]>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right way to go about saying "this is a hex-encoded string" in the OpenAPI spec? Is there a more precise way to mark the encoding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +90 to +92
#[serde_as(as = "BTreeMap<_, Hex>")]
#[schemars(with = "BTreeMap<BaseboardId, String>")]
pub members: BTreeMap<BaseboardId, [u8; 32]>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, another hex-encoded string, which will not be precisely annotated as such in the OpenAPI spec. Can we do better?

Comment on lines +103 to +110
/// 32-byte salt used to derive the encryption key.
#[serde_as(as = "Hex")]
#[schemars(with = "String")]
pub salt: [u8; 32],
/// Encrypted data.
#[serde_as(as = "Hex")]
#[schemars(with = "String")]
pub data: Vec<u8>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More hex-encoded strings not marked as such in the OpenAPI spec: can we do better?

Comment on lines +137 to +139
#[serde_as(as = "BTreeMap<_, Hex>")]
#[schemars(with = "BTreeMap<BaseboardId, String>")]
pub members: BTreeMap<BaseboardId, [u8; 32]>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet another hex-encoded string not marked as such in the OpenAPI spec.

/// An alarm indicating a protocol invariant violation in trust quorum.
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
#[serde(rename_all = "snake_case")]
pub enum TrustQuorumAlarm {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly copied over from the Alarm type but made so that it can be JsonSchema. I don't like how much duplication and conversion code this introduces, but I'm not sure how to maintain decoupling otherwise. Thoughts?

/// Initiate a trust quorum reconfiguration
#[endpoint {
method = POST,
path = "/trust-quorum/reconfigure",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These paths should all be named like resources - nouns, not verbs. I'd call this /trust-quorum/configuration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still waffling on whether we should pluralize the endpoints such as configurations.

/// Initiate an upgrade from LRTQ
#[endpoint {
method = POST,
path = "/trust-quorum/upgrade-from-lrtq",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
path = "/trust-quorum/upgrade-from-lrtq",
path = "/trust-quorum/upgrade",


/// Commit a trust quorum configuration
#[endpoint {
method = POST,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a PUT, as it's idempotent.

/// Proxy a commit operation to another trust quorum node
#[endpoint {
method = POST,
path = "/trust-quorum/proxy/commit",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I think I like it the way you have it.

/// Proxy a commit operation to another trust quorum node
#[endpoint {
method = POST,
path = "/trust-quorum/proxy/commit",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be a PUT request.

Comment on lines +45 to +46
EncryptedRackSecrets, RackSecret, ReconstructedRackSecret, Salt,
Sha3_256Digest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, totally fine. These types aren't going to change. We can create new types as needed. I did the same elsewhere.

Comment on lines +72 to +74
#[serde_as(as = "BTreeMap<_, Hex>")]
#[schemars(with = "BTreeMap<BaseboardId, String>")]
pub members: BTreeMap<BaseboardId, [u8; 32]>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrustQuorumCommitResponse::Committed
}
trust_quorum::CommitStatus::Pending => {
TrustQuorumCommitResponse::Pending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I don't think we'll ever actually get a response of Pending back here. We should end up with an error.

Pending is really only relevant for PrepareAndCommit operations, since those have to compute their own key share before they can commit. I'm unsure why I made the status the same. We should probably change that to just return () instead of CommitStatus, unless I remember why I did that.

TrustQuorumCommitResponse::Committed
}
trust_quorum::CommitStatus::Pending => {
TrustQuorumCommitResponse::Pending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is expected. Pending is a normal case for this. There's a bunch of background work to compute the key share - by requesting shares from other nodes - that require this response.

pub epoch: u64,
pub last_committed_epoch: Option<u64>,
pub members: BTreeSet<BaseboardId>,
pub threshold: u8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest using Epoch and Threshold from trust-quorum-protocol, but trust-quorum-protocolrelies onsled-agent-typesforBaseboardId`, so that would be a circular reference.

I'm starting to lean towards putting these these messages into trust-quorum-types-versions (and trust-quorum-types accordingly) and using them directly in the sled-agent API but not having sled-agent-types depend upon trust-quorum-types to break the circular dep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants