-
Notifications
You must be signed in to change notification settings - Fork 558
fix: NewJWTSVID has shorter timeout when a cached JWT exists #6454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: NewJWTSVID has shorter timeout when a cached JWT exists #6454
Conversation
Signed-off-by: Jason Costello <jason@cofide.io>
Signed-off-by: Jason Costello <jason@cofide.io>
pkg/agent/client/client.go
Outdated
| rpcTimeout = 30 * time.Second | ||
| // We can be more aggressive with timeouts in cases where a valid SVID | ||
| // exists in the cache but is old enough to try for a new SVID quickly | ||
| rpcTimeoutWithCacheHit = 1 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want this to be a bit higher, 5 or at most 3 seconds. The higher we make the timeout the more likely it will be that someone will be hit by thy issue, but we also don't want it to be too low to make it impossible to get a SVID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've increased to 5 seconds
pkg/agent/manager/manager.go
Outdated
| } | ||
|
|
||
| newSVID, err := m.client.NewJWTSVID(ctx, entry.EntryId, audience) | ||
| newSVID, err := m.client.NewJWTSVID(ctx, entry.EntryId, audience, ok) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you hinted in the text, we should include more things in the decision to use the lower timeout. If the JWT-SVID is expired we should use the full timeout since there's nothing for us to return to the user in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an explicit check to see if the JWT-SVID is expired and added this to the existing ok conditional - along with some commentary at the point this is used in the method
Signed-off-by: Jason Costello <jason@cofide.io>
| rpcTimeout = 30 * time.Second | ||
| // We can be more aggressive with timeouts in cases where a valid SVID | ||
| // exists in the cache but is old enough to try for a new SVID quickly | ||
| rpcTimeoutWithCacheHit = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good in general, but to avoid breaking any existing users without any workaround and to allow some tuning for specific timeouts could we add an experimental config option to change this secondary timeout? It should allow setting it to any value smaller than the 30 seconds of the original timeout.
Setting the value to anything non-zero should also log a warning that this config is only temporary and that it's recommended to log an issue if you require a non-default timeout here. We can leave that option around for a few releases to see if anybody complaints and gather feedback and then we can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Should we introduce an Experimental struct to the Agent's Config for this purpose? It doesn't appear to exist for the Unix Agent Config object as yet
Lines 24 to 118 in 46bd7ba
| type Config struct { | |
| // Address to bind the workload api to | |
| BindAddress net.Addr | |
| // Directory to store runtime data | |
| DataDir string | |
| // Directory to bind the admin api to | |
| AdminBindAddress net.Addr | |
| // The Validation Context resource name to use when fetching X.509 bundle together with federated bundles with Envoy SDS | |
| DefaultAllBundlesName string | |
| // The Validation Context resource name to use for the default X.509 bundle with Envoy SDS | |
| DefaultBundleName string | |
| // Disable custom Envoy SDS validator | |
| DisableSPIFFECertValidation bool | |
| // The TLS Certificate resource name to use for the default X509-SVID with Envoy SDS | |
| DefaultSVIDName string | |
| // How the agent will behave when seeing an unknown x509 cert from the server | |
| RebootstrapMode string | |
| // The agent will rebootstrap after configured amount of time on unknown x509 cert from the server | |
| RebootstrapDelay time.Duration | |
| // HealthChecks provides the configuration for health monitoring | |
| HealthChecks health.Config | |
| // Configurations for agent plugins | |
| PluginConfigs catalog.PluginConfigs | |
| Log logrus.FieldLogger | |
| // LogReopener facilitates handling a signal to rotate log file. | |
| LogReopener func(context.Context) error | |
| // Address of SPIRE server | |
| ServerAddress string | |
| // SVID key type | |
| WorkloadKeyType workloadkey.KeyType | |
| // SyncInterval controls how often the agent sync synchronizer waits | |
| SyncInterval time.Duration | |
| // UseSyncAuthorizedEntries controls if the new SyncAuthorizedEntries RPC | |
| // is used to sync entries from the server. | |
| UseSyncAuthorizedEntries bool | |
| // X509SVIDCacheMaxSize is a soft limit of max number of X509-SVIDs that would be stored in cache | |
| X509SVIDCacheMaxSize int | |
| // JWTSVIDCacheMaxSize is a soft limit of max number of JWT-SVIDs that would be stored in cache | |
| JWTSVIDCacheMaxSize int | |
| // Trust domain and associated CA bundle | |
| TrustDomain spiffeid.TrustDomain | |
| // Sources for getting Trust Bundles | |
| TrustBundleSources trustbundlesources.Bundle | |
| // Join token to use for attestation, if needed | |
| JoinToken string | |
| // If true enables profiling. | |
| ProfilingEnabled bool | |
| // Port used by the pprof web server when ProfilingEnabled == true | |
| ProfilingPort int | |
| // Frequency in seconds by which each profile file will be generated. | |
| ProfilingFreq int | |
| // Array of profiles names that will be generated on each profiling tick. | |
| ProfilingNames []string | |
| // Telemetry provides the configuration for metrics exporting | |
| Telemetry telemetry.FileConfig | |
| AllowUnauthenticatedVerifiers bool | |
| // List of allowed claims response when calling ValidateJWTSVID using a foreign identity | |
| AllowedForeignJWTClaims []string | |
| AuthorizedDelegates []string | |
| // AvailabilityTarget controls how frequently rotate SVIDs | |
| AvailabilityTarget time.Duration | |
| // TLSPolicy determines the post-quantum-safe TLS policy to apply to all TLS connections. | |
| TLSPolicy tlspolicy.Policy | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That Config is generated from the file based config, which has an experimental section. It's ok to add the field to the config from pkg/agent/config.go, even if it's exported. We treat it as an internal detail and not something that users should import and use.
So it just need to be in the experimental section in the config file, but internally we can do whatever we want.
Pull Request check list
Affected functionality
Cache behaviour for JWT-SVIDs on the Agent
Description of change
As described in #5994, the Agent will attempt to get a new JWT-SVID if a cached JWT-SVID has an impending expiry. This request has a default timeout of 30s, which can and does cause problems with other clients using much shorter timeouts.
This PR shortens the timeout of this cache hit case down to 1s, so the cached SVID can be made use of when the NewJWTSVID RPC on spire-server is unresponsive
Which issue this PR fixes
#5994