feat(seer): Add lightweight supergroups backfill task by yuvmen · Pull Request #112507 · getsentry/sentry

yuvmen · 2026-04-08T19:36:50Z

Summary

Add an org-scoped Celery task (backfill_supergroups_lightweight_for_org) that iterates all error groups in an organization and sends each to Seer's lightweight RCA clustering endpoint for supergroup backfilling
Processes groups in batches of 50 with cursor-based pagination across (project_id, group_id), self-chaining until complete
Filters to error groups seen in last 90 days with unresolved substatus
Includes killswitch option for emergency stop
Designed to be triggered from a getsentry job (no API endpoint)

Test plan

Unit tests for happy path, cross-project processing, self-chaining, killswitch, feature flag gating, failure handling, group filtering, and cursor resumption (10 tests)
Manual test with Sentry org (~6000 groups) via getsentry job

Add RCASource enum and rca_source field to supergroup query requests so Seer knows which embedding space to query. The source is determined by the organizations:supergroups-lightweight-rca-clustering feature flag. Replace the supergroups.lightweight-enabled-orgs sentry-option with the feature flag for both the write path (post_process task dispatch) and read path (supergroup query endpoints), consistent with how all other supergroup features are gated.

Add an org-scoped Celery task that iterates all error groups in an organization (seen in last 90 days) and sends each to Seer's lightweight RCA clustering endpoint for supergroup backfilling. The task processes groups in batches of 50 with cursor-based pagination and self-chains until all groups are processed. Designed to be triggered from a getsentry job. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-08T19:51:05Z

Backend Test Failures

Failures on 8c9c462 in this run:

tests/sentry/taskworker/test_config.py::test_all_instrumented_tasks_registered — log

[gw0] linux -- Python 3.13.1 /home/runner/work/sentry/sentry/.venv/bin/python3
tests/sentry/taskworker/test_config.py:120: in test_all_instrumented_tasks_registered
    raise AssertionError(
E   AssertionError: Found 1 module(s) with @instrumented_task that are NOT registered in TASKWORKER_IMPORTS.
E   These tasks will not be discovered by the taskworker in production!
E   
E   Missing modules:
E     - sentry.tasks.seer.backfill_supergroups_lightweight
E   
E   Add these to TASKWORKER_IMPORTS in src/sentry/conf/server.py

Add backfill_supergroups_lightweight to TASKWORKER_IMPORTS so the task is discovered in production. Fix mypy errors by asserting event.group is not None in tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ight-rca-backfill # Conflicts: # src/sentry/features/temporary.py # src/sentry/options/defaults.py # src/sentry/seer/signed_seer_api.py # src/sentry/seer/supergroups/endpoints/organization_supergroup_details.py # src/sentry/seer/supergroups/endpoints/organization_supergroups_by_group.py # src/sentry/tasks/post_process.py # tests/sentry/seer/supergroups/endpoints/test_organization_supergroup_details.py # tests/sentry/seer/supergroups/endpoints/test_organization_supergroups_by_group.py # tests/sentry/tasks/test_post_process.py

Replace per-group get_latest_event() calls with batched Snuba queries via bulk_snuba_queries for the event fetching phase. Uses a tight timestamp window around each group's last_seen. Also reduces inter-batch delay to 1s, rewrites cursor resumption test to verify only post-cursor groups are processed, and adds exact batch boundary edge case test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace hardcoded substatus list with the canonical UNRESOLVED_SUBSTATUS_CHOICES constant from sentry.types.group. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cvxluo

looks generally good. i think we'll find some problems when we actually run the script, but we can resolve those as we go. thus, my primary concern is that we'll be sending requests to seer too fast, we get rate limited from snuba, and that we can't make this idempotent

cvxluo · 2026-04-08T23:39:15Z

+    for group in groups:
+        # Use a tight window around the group's last_seen to minimize scan range,
+        # falling back to the full backfill window if last_seen is unavailable
+        group_start = group.last_seen - timedelta(hours=1) if group.last_seen else timestamp_start


do we need this fallback? seems like the original backfill job did not do this

yea agreed didnt notice this got added, probably because of some test case, will remove

wedamija · 2026-04-09T17:48:56Z

+        )
+
+
+def _batch_fetch_events(groups: Sequence[Group], organization_id: int) -> list[tuple[Group, dict]]:


I think it's going to be pretty slow to make a query per group here. And you're probably also likely to start hitting snuba ratelimits.

Do you actually need the latest event for each group, or just any event? You could group by group_id, max(event_id) to just get some event id. I don't think snuba supports window queries or anything unfortunately

I think what I do here is the way they did in the V1, though things might have changed for sure. Right now I think I am okay with naively taking any event, that might change though.
Intersting suggestion about like grouping with max, Ill try and see if its fast

Yeah, I think that at least this way you can send a batch of 100/1000/whatever groups in the same project and just get a result back. You could still batch this into multiple queries as needed, but I think it'll be much faster if you can do on average 1 query per org (probably most orgs have less than 1k groups)

wedamija · 2026-04-09T17:50:35Z

+    success_count = 0
+    viewer_context = SeerViewerContext(organization_id=organization_id)
+
+    for group, serialized_event in group_event_pairs:


Should we add a threadpool here so that we can parallelize requests?

yea, v1 grouping had it, was again me trying to keep it simple but maybe ill just add it

I think it'll just be horribly slow without this. Ideally the api would just accept multiple groups but if that's not worth the effort then at least using a threadpool speeds things up somewhat

I realized that actually on seer side we just queue a task and return, so its going to be fast. We could probably add a way to bad send to reduce the overhaed of all the requests but its not like we are going to be waiting a ton of time on these, so I dont think its that important right now.
As I mentioned to Mark, I will probably need to optimize more before I run this for all orgs, this is just a task to be able to do it for Sentry and some more orgs perahps to test it out, trying to not overcomplicate.

The problem isn't the speed of the api on the other side, it's that you're waiting for IO on this side to get anything done. The task on the other side could complete in 0 seconds and it'd still result in this being much slower. This isn't blocking though so I can approve

yea I understand, I am actually fine with this being slow, I am more worried about being too fast for Seer

…reshold Refactor to process one project at a time using the (project, status, substatus, last_seen, id) composite index for efficient cursor pagination at any scale. Add MAX_FAILURES_PER_BATCH=20 to stop processing if Seer is consistently failing. Filter by status=UNRESOLVED. Remove dead timestamp fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

markstory · 2026-04-10T14:14:57Z

+@instrumented_task(
+    name="sentry.tasks.seer.backfill_supergroups_lightweight.backfill_supergroups_lightweight_for_org",
+    namespace=seer_tasks,
+    processing_deadline_duration=15 * 60,
+)
+def backfill_supergroups_lightweight_for_org(


How will this task be scheduled/spawned? Will we be able to spawn them incrementally over time so that we don't generate a big backlog all at once that consumes all the worker capacity preventing other tasks from running?

I currently plan to add a custom run job on getsentry to trigger manually per org, however in the future when we plan to backfill everything we will just have a loop over orgs. I dont plan to run this on multiple orgs in parallel for now.
When this was done for AI grouping v1 I believe we actually did it all project by project and basically rate limited it, so it took a ton of time (months) but keeping the rate low meant we didnt overload worker capacity / bombard seer.
Right now this task doesnt use any parallelism and just spawns the next batch after a batch is done, so I dont think its capable of consuming all worker capacity for a single org. @wedamija commented on adding a threadpool, I am considering it so this task wouldnt be dead slow, however as you mention I will indeed need to make sure we dont spawn too many threads, both for Sentry and Seer's sake.

This task is meant to be an tool to get a few orgs/projects backfilled and be able to POC on this lightweight implementation. We will need to do more tweaking to be efficient when running it for everything.

wedamija · 2026-04-10T20:54:40Z

+            project_id=project.id,
+            type=DEFAULT_TYPE_ID,
+            id__gt=last_group_id,
+            last_seen__gte=cutoff,


fwiw I still think it's safer to remove this - you could also just filter out any groups outside this range on the python side, since they'll be a rare case.

yea sure, I dont feel strongly about it, the snuba query will filter out anything it doesnt find events for anyway and its not like I really mind catching something as long as we retained it in Snuba I dont really have to filter this

wedamija · 2026-04-10T20:57:13Z

+    # Fetch full events from nodestore and serialize
+    group_event_pairs: list[tuple[Group, dict]] = []
+    for group, result in zip(groups, results):


I think there's some bulk fetching stuff we can use here to make this a little faster

- Track last_processed_group_id so early break on max failures doesn't skip unprocessed groups - Stop self-chaining when max failures is reached to avoid hammering Seer when it's down - Add project_id and last_processed_group_id to max failures log for easier resume - Skip groups with failed event serialization instead of sending None - Remove last_seen cutoff filter; old groups are naturally skipped when their events are gone from Snuba/nodestore

Use bind_nodes() for a single nodestore multi-get instead of 50 sequential get_event_by_id calls. Bulk serialize all events in one serialize() call to batch get_attrs(). Cuts the event fetch phase from ~3-5s to ~500ms per batch.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit d45bf7e. Configure here.}

…ight-rca-backfill # Conflicts: # src/sentry/conf/server.py

last_processed_group_id only tracks groups with Snuba events, so eventless groups at the end of a batch would never be skipped, causing an infinite re-fetch loop. Since we now return early on max failures (no self-chain), groups[-1].id is safe for the cursor.

yuvmen and others added 2 commits April 8, 2026 09:51

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 8, 2026

vercel bot deployed to Preview April 8, 2026 19:38 View deployment

yuvmen and others added 2 commits April 8, 2026 13:18

fix(seer): Register backfill task and fix typing errors

694a5f8

Add backfill_supergroups_lightweight to TASKWORKER_IMPORTS so the task is discovered in production. Fix mypy errors by asserting event.group is not None in tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vercel bot deployed to Preview April 8, 2026 20:23 View deployment

vercel bot deployed to Preview April 8, 2026 21:12 View deployment

yuvmen marked this pull request as ready for review April 8, 2026 21:15

yuvmen requested review from a team as code owners April 8, 2026 21:15

cursor bot reviewed Apr 8, 2026

View reviewed changes

Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py Outdated

ref(seer): Use UNRESOLVED_SUBSTATUS_CHOICES constant

0f9a648

Replace hardcoded substatus list with the canonical UNRESOLVED_SUBSTATUS_CHOICES constant from sentry.types.group. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vercel bot deployed to Preview April 8, 2026 22:00 View deployment

cvxluo reviewed Apr 8, 2026

View reviewed changes

wedamija reviewed Apr 9, 2026

View reviewed changes

vercel bot deployed to Preview April 9, 2026 22:35 View deployment

sentry-warden bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py Outdated

sentry bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py

cursor bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py

markstory reviewed Apr 10, 2026

View reviewed changes

wedamija approved these changes Apr 10, 2026

View reviewed changes

yuvmen added 2 commits April 10, 2026 14:49

vercel bot deployed to Preview April 10, 2026 21:51 View deployment

sentry bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py Outdated

cursor bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py

Merge remote-tracking branch 'origin/master' into yuvmen/feat/lightwe…

e0e92a1

…ight-rca-backfill # Conflicts: # src/sentry/conf/server.py

vercel bot deployed to Preview April 10, 2026 22:30 View deployment

vercel bot deployed to Preview April 10, 2026 22:33 View deployment

sentry bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/sentry/tasks/seer/backfill_supergroups_lightweight.py

yuvmen enabled auto-merge (squash) April 10, 2026 22:48

yuvmen merged commit c0d6db4 into master Apr 10, 2026
77 checks passed

yuvmen deleted the yuvmen/feat/lightweight-rca-backfill branch April 10, 2026 22:49

sentry-release-bot bot mentioned this pull request Apr 15, 2026

publish: getsentry/sentry@26.4.0 getsentry/publish#7817

Closed

3 tasks

		)


		def _batch_fetch_events(groups: Sequence[Group], organization_id: int) -> list[tuple[Group, dict]]:

Uh oh!

Conversation

yuvmen commented Apr 8, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Apr 8, 2026

Backend Test Failures

Uh oh!

Uh oh!

cvxluo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wedamija Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wedamija Apr 9, 2026 •

edited

Loading