Skip to content

Conversation

@jsnctl
Copy link
Contributor

@jsnctl jsnctl commented Nov 21, 2025

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality

Cache behaviour for JWT-SVIDs on the Agent

Description of change

As described in #5994, the Agent will attempt to get a new JWT-SVID if a cached JWT-SVID has an impending expiry. This request has a default timeout of 30s, which can and does cause problems with other clients using much shorter timeouts.

This PR shortens the timeout of this cache hit case down to 1s, so the cached SVID can be made use of when the NewJWTSVID RPC on spire-server is unresponsive

Which issue this PR fixes

#5994


⚠️ I did not add new unit test functionality to this as I think this might need some thinking on how to properly assert the difference in the RPC request. Open to suggestions on how to best achieve this.

Signed-off-by: Jason Costello <jason@cofide.io>
Signed-off-by: Jason Costello <jason@cofide.io>
Signed-off-by: Jason Costello <jason@cofide.io>
rpcTimeout = 30 * time.Second
// We can be more aggressive with timeouts in cases where a valid SVID
// exists in the cache but is old enough to try for a new SVID quickly
rpcTimeoutWithCacheHit = 1 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want this to be a bit higher, 5 or at most 3 seconds. The higher we make the timeout the more likely it will be that someone will be hit by thy issue, but we also don't want it to be too low to make it impossible to get a SVID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've increased to 5 seconds

}

newSVID, err := m.client.NewJWTSVID(ctx, entry.EntryId, audience)
newSVID, err := m.client.NewJWTSVID(ctx, entry.EntryId, audience, ok)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you hinted in the text, we should include more things in the decision to use the lower timeout. If the JWT-SVID is expired we should use the full timeout since there's nothing for us to return to the user in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an explicit check to see if the JWT-SVID is expired and added this to the existing ok conditional - along with some commentary at the point this is used in the method

Signed-off-by: Jason Costello <jason@cofide.io>
@jsnctl jsnctl requested a review from sorindumitru November 28, 2025 21:07
@sorindumitru sorindumitru added this to the 1.14.1 milestone Dec 2, 2025
rpcTimeout = 30 * time.Second
// We can be more aggressive with timeouts in cases where a valid SVID
// exists in the cache but is old enough to try for a new SVID quickly
rpcTimeoutWithCacheHit = 5 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good in general, but to avoid breaking any existing users without any workaround and to allow some tuning for specific timeouts could we add an experimental config option to change this secondary timeout? It should allow setting it to any value smaller than the 30 seconds of the original timeout.

Setting the value to anything non-zero should also log a warning that this config is only temporary and that it's recommended to log an issue if you require a non-default timeout here. We can leave that option around for a few releases to see if anybody complaints and gather feedback and then we can remove it.

Copy link
Contributor Author

@jsnctl jsnctl Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Should we introduce an Experimental struct to the Agent's Config for this purpose? It doesn't appear to exist for the Unix Agent Config object as yet

spire/pkg/agent/config.go

Lines 24 to 118 in 46bd7ba

type Config struct {
// Address to bind the workload api to
BindAddress net.Addr
// Directory to store runtime data
DataDir string
// Directory to bind the admin api to
AdminBindAddress net.Addr
// The Validation Context resource name to use when fetching X.509 bundle together with federated bundles with Envoy SDS
DefaultAllBundlesName string
// The Validation Context resource name to use for the default X.509 bundle with Envoy SDS
DefaultBundleName string
// Disable custom Envoy SDS validator
DisableSPIFFECertValidation bool
// The TLS Certificate resource name to use for the default X509-SVID with Envoy SDS
DefaultSVIDName string
// How the agent will behave when seeing an unknown x509 cert from the server
RebootstrapMode string
// The agent will rebootstrap after configured amount of time on unknown x509 cert from the server
RebootstrapDelay time.Duration
// HealthChecks provides the configuration for health monitoring
HealthChecks health.Config
// Configurations for agent plugins
PluginConfigs catalog.PluginConfigs
Log logrus.FieldLogger
// LogReopener facilitates handling a signal to rotate log file.
LogReopener func(context.Context) error
// Address of SPIRE server
ServerAddress string
// SVID key type
WorkloadKeyType workloadkey.KeyType
// SyncInterval controls how often the agent sync synchronizer waits
SyncInterval time.Duration
// UseSyncAuthorizedEntries controls if the new SyncAuthorizedEntries RPC
// is used to sync entries from the server.
UseSyncAuthorizedEntries bool
// X509SVIDCacheMaxSize is a soft limit of max number of X509-SVIDs that would be stored in cache
X509SVIDCacheMaxSize int
// JWTSVIDCacheMaxSize is a soft limit of max number of JWT-SVIDs that would be stored in cache
JWTSVIDCacheMaxSize int
// Trust domain and associated CA bundle
TrustDomain spiffeid.TrustDomain
// Sources for getting Trust Bundles
TrustBundleSources trustbundlesources.Bundle
// Join token to use for attestation, if needed
JoinToken string
// If true enables profiling.
ProfilingEnabled bool
// Port used by the pprof web server when ProfilingEnabled == true
ProfilingPort int
// Frequency in seconds by which each profile file will be generated.
ProfilingFreq int
// Array of profiles names that will be generated on each profiling tick.
ProfilingNames []string
// Telemetry provides the configuration for metrics exporting
Telemetry telemetry.FileConfig
AllowUnauthenticatedVerifiers bool
// List of allowed claims response when calling ValidateJWTSVID using a foreign identity
AllowedForeignJWTClaims []string
AuthorizedDelegates []string
// AvailabilityTarget controls how frequently rotate SVIDs
AvailabilityTarget time.Duration
// TLSPolicy determines the post-quantum-safe TLS policy to apply to all TLS connections.
TLSPolicy tlspolicy.Policy
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That Config is generated from the file based config, which has an experimental section. It's ok to add the field to the config from pkg/agent/config.go, even if it's exported. We treat it as an internal detail and not something that users should import and use.

So it just need to be in the experimental section in the config file, but internally we can do whatever we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants