Skip to content

Add sticky sessions to premium target groups#989

Closed
milesAraya wants to merge 1731 commits intodevelop-mainfrom
fix/premium-assignment-manager
Closed

Add sticky sessions to premium target groups#989
milesAraya wants to merge 1731 commits intodevelop-mainfrom
fix/premium-assignment-manager

Conversation

@milesAraya
Copy link
Copy Markdown
Collaborator

Content

Summary

  • Premium target groups lack sticky session configuration, causing the ALB to round-robin requests across containers when a TG has multiple registered tasks — breaking in-memory state and log offsets mid-session.
  • Enable ALB cookie-based sticky sessions (lb_cookie, 300s) on every premium target group at creation time, matching the main TG config in compute.tf. Retroactively apply stickiness to pre-existing TGs via the DuplicateTargetGroupName fallback and assign_premium_user inline creation.
  • Call fix_incorrect_is_shared_flags() in scheduled monitoring before process_shared_instance_optimization() so stale flags are corrected before the optimizer attempts unnecessary migrations.
  • Add the ModifyTargetGroupAttributes IAM permission required by the new modify_target_group_attributes API call.

Design Decisions

  • Helper function _enable_sticky_sessions() instead of inline calls. Three call sites need the same modify_target_group_attributes invocation. A helper keeps them consistent and avoids duplicating the attribute list.
  • try/except in the helper, not at each call site. A TG without stickiness is better than no TG at all. If the API call fails (throttling, transient error), TG creation still succeeds and stickiness is retried on the next assignment via the DuplicateTargetGroupName handler.
  • Stickiness applied in the DuplicateTargetGroupName handler. Pre-existing TGs created before this change lack stickiness. The handler now applies it transparently — no manual migration needed.
  • fix_incorrect_is_shared_flags() called before process_shared_instance_optimization(). Correcting stale flags first prevents the optimizer from attempting unnecessary migrations for users already alone on their instance.
  • Isolated try/except for fix_incorrect_is_shared_flags(). Failure doesn't break the existing optimization step. Follows the file's convention of inline import traceback.
  • Module-level constant STICKY_SESSION_DURATION_SECONDS = 300. Follows codebase convention of extracting magic numbers into named constants alongside DEFAULT_IDLE_TIMEOUT_HOURS, LOCK_TIMEOUT_SECONDS, etc.

Evidence

  • compute.tf:85-89 defines the main TG stickiness config (lb_cookie, 300s) — this change mirrors it for premium TGs.
  • fix_incorrect_is_shared_flags() (line ~4558) is already implemented with @with_transaction safety.

Files changed

  • infrastructure/terraform/premium_manager.tf -- Add ModifyTargetGroupAttributes IAM permission to the Lambda's ELB policy; existing resource scoping (targetgroup/premium-*) already covers the target groups
  • infrastructure/terraform/premium_manager_package/premium_manager.py -- Add STICKY_SESSION_DURATION_SECONDS constant; add typed _enable_sticky_sessions() helper; call it after TG creation in create_or_get_target_group(), in the DuplicateTargetGroupName handler, and in assign_premium_user() inline creation; call fix_incorrect_is_shared_flags() before process_shared_instance_optimization() in step 10 of handle_scheduled_monitoring()

Manual Testcases

  • Assign a premium user — verify the new TG has stickiness enabled (aws elbv2 describe-target-group-attributes --target-group-arn <arn>)
  • Trigger assignment for a user whose TG already exists (DuplicateTargetGroupName path) — verify stickiness is applied to the pre-existing TG
  • Simulate modify_target_group_attributes failure (temporarily remove IAM permission) — verify TG creation still succeeds and a WARNING is logged
  • Manually set is_shared=1 for a user who is alone on their instance — trigger scheduled monitoring and verify the flag is corrected to 0
  • Verify process_shared_instance_optimization() still runs after fix_incorrect_is_shared_flags() succeeds or fails

Unit, Integration, Contract Test Coverage

  • No existing test suite for premium_manager.py Lambda code. No tests added or broken by this change.

Others

Difficulties

  • None.

Risk Assessment

Area Risk Notes
Sticky sessions Low Additive-only. Single-target TGs (most premium users) are unaffected. Multi-target TGs get consistent routing. Helper failure is swallowed with a warning — no degradation from current behavior.
IAM permission Low ModifyTargetGroupAttributes is scoped to existing targetgroup/premium-* and targetgroup/subscr-* resource ARNs. Must be applied via terraform apply before deploying the Lambda code.
is_shared cleanup Low Reuses existing fix_incorrect_is_shared_flags() with @with_transaction. Isolated try/except — failure doesn't break step 10. Idempotent.
Deployment order Medium Terraform IAM change must be applied before the Lambda deployment, otherwise modify_target_group_attributes calls will fail with AccessDenied. The helper's try/except makes this a soft failure (logged warning), but stickiness won't take effect until IAM is in place.

tsuchiyama-araya and others added 30 commits February 10, 2026 15:47
Run enabled before S3 upload finished
milesAraya and others added 27 commits March 2, 2026 13:25
… appropriate one (RemoteStorageSimpleReader -> RemoteStorageReader)
…mUser && and add endpoint to send to cloudwatch log
…riments

Publish experiments on Records page regardless of lock
Remove outdated dosctrings with references to development cases
Fix storage quota after upgrade premium
- Added dedicated reproduce api for private dataview
- Adjust the frontend to call the reproduce api of the private dataview
- Generalized find_dataview_record.DataviewService
- Add `Depends(is_workspace_available)` to dataview reproduce API
- Change the return type of _ensure_experiment_downloaded to the appropriate one
…d skip showing the assigning popup if the async call resolved quickly
Update frontend package.json to v1.0.0
@milesAraya milesAraya requested a review from itutu-tienday March 4, 2026 01:27
@milesAraya milesAraya closed this Mar 4, 2026
@milesAraya milesAraya deleted the fix/premium-assignment-manager branch March 5, 2026 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants