feat(seer): Add lightweight supergroups backfill task#112507
Conversation
Add RCASource enum and rca_source field to supergroup query requests so Seer knows which embedding space to query. The source is determined by the organizations:supergroups-lightweight-rca-clustering feature flag. Replace the supergroups.lightweight-enabled-orgs sentry-option with the feature flag for both the write path (post_process task dispatch) and read path (supergroup query endpoints), consistent with how all other supergroup features are gated.
Add an org-scoped Celery task that iterates all error groups in an organization (seen in last 90 days) and sends each to Seer's lightweight RCA clustering endpoint for supergroup backfilling. The task processes groups in batches of 50 with cursor-based pagination and self-chains until all groups are processed. Designed to be triggered from a getsentry job. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend Test FailuresFailures on
|
Add backfill_supergroups_lightweight to TASKWORKER_IMPORTS so the task is discovered in production. Fix mypy errors by asserting event.group is not None in tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ight-rca-backfill # Conflicts: # src/sentry/features/temporary.py # src/sentry/options/defaults.py # src/sentry/seer/signed_seer_api.py # src/sentry/seer/supergroups/endpoints/organization_supergroup_details.py # src/sentry/seer/supergroups/endpoints/organization_supergroups_by_group.py # src/sentry/tasks/post_process.py # tests/sentry/seer/supergroups/endpoints/test_organization_supergroup_details.py # tests/sentry/seer/supergroups/endpoints/test_organization_supergroups_by_group.py # tests/sentry/tasks/test_post_process.py
Replace per-group get_latest_event() calls with batched Snuba queries via bulk_snuba_queries for the event fetching phase. Uses a tight timestamp window around each group's last_seen. Also reduces inter-batch delay to 1s, rewrites cursor resumption test to verify only post-cursor groups are processed, and adds exact batch boundary edge case test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded substatus list with the canonical UNRESOLVED_SUBSTATUS_CHOICES constant from sentry.types.group. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cvxluo
left a comment
There was a problem hiding this comment.
looks generally good. i think we'll find some problems when we actually run the script, but we can resolve those as we go. thus, my primary concern is that we'll be sending requests to seer too fast, we get rate limited from snuba, and that we can't make this idempotent
| for group in groups: | ||
| # Use a tight window around the group's last_seen to minimize scan range, | ||
| # falling back to the full backfill window if last_seen is unavailable | ||
| group_start = group.last_seen - timedelta(hours=1) if group.last_seen else timestamp_start |
There was a problem hiding this comment.
do we need this fallback? seems like the original backfill job did not do this
There was a problem hiding this comment.
yea agreed didnt notice this got added, probably because of some test case, will remove
| ) | ||
|
|
||
|
|
||
| def _batch_fetch_events(groups: Sequence[Group], organization_id: int) -> list[tuple[Group, dict]]: |
There was a problem hiding this comment.
I think it's going to be pretty slow to make a query per group here. And you're probably also likely to start hitting snuba ratelimits.
Do you actually need the latest event for each group, or just any event? You could group by group_id, max(event_id) to just get some event id. I don't think snuba supports window queries or anything unfortunately
There was a problem hiding this comment.
I think what I do here is the way they did in the V1, though things might have changed for sure. Right now I think I am okay with naively taking any event, that might change though.
Intersting suggestion about like grouping with max, Ill try and see if its fast
There was a problem hiding this comment.
Yeah, I think that at least this way you can send a batch of 100/1000/whatever groups in the same project and just get a result back. You could still batch this into multiple queries as needed, but I think it'll be much faster if you can do on average 1 query per org (probably most orgs have less than 1k groups)
| success_count = 0 | ||
| viewer_context = SeerViewerContext(organization_id=organization_id) | ||
|
|
||
| for group, serialized_event in group_event_pairs: |
There was a problem hiding this comment.
Should we add a threadpool here so that we can parallelize requests?
There was a problem hiding this comment.
yea, v1 grouping had it, was again me trying to keep it simple but maybe ill just add it
There was a problem hiding this comment.
I think it'll just be horribly slow without this. Ideally the api would just accept multiple groups but if that's not worth the effort then at least using a threadpool speeds things up somewhat
There was a problem hiding this comment.
I realized that actually on seer side we just queue a task and return, so its going to be fast. We could probably add a way to bad send to reduce the overhaed of all the requests but its not like we are going to be waiting a ton of time on these, so I dont think its that important right now.
As I mentioned to Mark, I will probably need to optimize more before I run this for all orgs, this is just a task to be able to do it for Sentry and some more orgs perahps to test it out, trying to not overcomplicate.
There was a problem hiding this comment.
The problem isn't the speed of the api on the other side, it's that you're waiting for IO on this side to get anything done. The task on the other side could complete in 0 seconds and it'd still result in this being much slower. This isn't blocking though so I can approve
There was a problem hiding this comment.
yea I understand, I am actually fine with this being slow, I am more worried about being too fast for Seer
…reshold Refactor to process one project at a time using the (project, status, substatus, last_seen, id) composite index for efficient cursor pagination at any scale. Add MAX_FAILURES_PER_BATCH=20 to stop processing if Seer is consistently failing. Filter by status=UNRESOLVED. Remove dead timestamp fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| @instrumented_task( | ||
| name="sentry.tasks.seer.backfill_supergroups_lightweight.backfill_supergroups_lightweight_for_org", | ||
| namespace=seer_tasks, | ||
| processing_deadline_duration=15 * 60, | ||
| ) | ||
| def backfill_supergroups_lightweight_for_org( |
There was a problem hiding this comment.
How will this task be scheduled/spawned? Will we be able to spawn them incrementally over time so that we don't generate a big backlog all at once that consumes all the worker capacity preventing other tasks from running?
There was a problem hiding this comment.
I currently plan to add a custom run job on getsentry to trigger manually per org, however in the future when we plan to backfill everything we will just have a loop over orgs. I dont plan to run this on multiple orgs in parallel for now.
When this was done for AI grouping v1 I believe we actually did it all project by project and basically rate limited it, so it took a ton of time (months) but keeping the rate low meant we didnt overload worker capacity / bombard seer.
Right now this task doesnt use any parallelism and just spawns the next batch after a batch is done, so I dont think its capable of consuming all worker capacity for a single org. @wedamija commented on adding a threadpool, I am considering it so this task wouldnt be dead slow, however as you mention I will indeed need to make sure we dont spawn too many threads, both for Sentry and Seer's sake.
This task is meant to be an tool to get a few orgs/projects backfilled and be able to POC on this lightweight implementation. We will need to do more tweaking to be efficient when running it for everything.
| project_id=project.id, | ||
| type=DEFAULT_TYPE_ID, | ||
| id__gt=last_group_id, | ||
| last_seen__gte=cutoff, |
There was a problem hiding this comment.
fwiw I still think it's safer to remove this - you could also just filter out any groups outside this range on the python side, since they'll be a rare case.
There was a problem hiding this comment.
yea sure, I dont feel strongly about it, the snuba query will filter out anything it doesnt find events for anyway and its not like I really mind catching something as long as we retained it in Snuba I dont really have to filter this
| # Fetch full events from nodestore and serialize | ||
| group_event_pairs: list[tuple[Group, dict]] = [] | ||
| for group, result in zip(groups, results): |
There was a problem hiding this comment.
I think there's some bulk fetching stuff we can use here to make this a little faster
- Track last_processed_group_id so early break on max failures doesn't skip unprocessed groups - Stop self-chaining when max failures is reached to avoid hammering Seer when it's down - Add project_id and last_processed_group_id to max failures log for easier resume - Skip groups with failed event serialization instead of sending None - Remove last_seen cutoff filter; old groups are naturally skipped when their events are gone from Snuba/nodestore
Use bind_nodes() for a single nodestore multi-get instead of 50 sequential get_event_by_id calls. Bulk serialize all events in one serialize() call to batch get_attrs(). Cuts the event fetch phase from ~3-5s to ~500ms per batch.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d45bf7e. Configure here.
…ight-rca-backfill # Conflicts: # src/sentry/conf/server.py
last_processed_group_id only tracks groups with Snuba events, so eventless groups at the end of a batch would never be skipped, causing an infinite re-fetch loop. Since we now return early on max failures (no self-chain), groups[-1].id is safe for the cursor.

Summary
backfill_supergroups_lightweight_for_org) that iterates all error groups in an organization and sends each to Seer's lightweight RCA clustering endpoint for supergroup backfilling(project_id, group_id), self-chaining until completeTest plan