Support for clustering detections #818

mohamedelabbas1996 · 2025-04-29T16:02:41Z

Summary

This PR introduces clustering support for detections in a source image collection, based on precomputed feature vectors (features_2048) stored in Classification records. The clustering helps organize detections into potential taxonomic groups, especially for unknown species.

List of Changes

Integrated agglomerative clustering logic using features_2048 field stored on the Classification model.
Added a new job type (DetectionClusteringJob) to cluster detections in a source image collection.
Added logic to:

Select a consistent feature_extraction_algorithm across detections.

Allow the user to explicitly pass an algorithm, with fallback to the most used one.
Create new Taxon entries per cluster and automatically assign them to Occurrence determinations.
Added admin action on source image collections to trigger clustering.
Added cluster_detections action to the SourceImageCollectionViewSet to trigger clustering via API.
Added unknown_species boolean field to the Taxon model to flag automatically generated taxa.

Related Issues

#774

Detailed Description

This PR adds clustering support for detections within a source image collection using existing visual feature vectors. The goal is to help researchers group similar-looking insect detections together, especially in collections where many specimens may not match known species. This is particularly useful in the context of the Panama trip, where a large number of insect images were collected with limited labeling. By clustering detections based on visual similarity, we can automatically group potential unknown species and assign them temporary taxa, making it easier for experts to review and identify patterns. The clustering job can be triggered from the admin interface or API.

The clustering works as follows:
First, Detection objects are filtered by the selected collection. Only detections that have at least one Classification containing a feature vector (features_2048) are considered. If the user provides a feature_extraction_algorithm, only features generated by that algorithm are used. If not, we select the most commonly used algorithm in the collection. Additionally, detections are filtered by an out-of-distribution (OOD) threshold: only those with an associated occurrence whose determination_ood_score exceeds the specified threshold are included.

Once the set of valid detections is prepared, we apply PCA dimensionality reduction to the feature vectors (default: 384 components). The reduced vectors are then passed to the selected clustering algorithm (currently, agglomerative clustering is supported). The algorithm groups similar detections into clusters based on the provided clustering parameters (e.g., distance_threshold).

After clustering:

A new Taxon is created for each cluster, marked with the unknown_species=True flag.
A new Classification is created for each detection, pointing to the corresponding cluster taxon and assigned a score=1.0. This score ensures that the associated Occurrence will automatically update its determination to use the new classification.

The cluster_detections action accepts several parameters:

ood_threshold (float, default: 0.0): Filters out detections that have already been confidently classified. Only detections with a higher out-of-distribution (OOD) score will be included in the clustering.

feature_extraction_algorithm (string, optional): Specifies the name of the feature extraction algorithm to use. If not provided, the system will automatically select the most commonly used algorithm within the collection.

algorithm (string, default: "agglomerative"): The clustering algorithm to use. Currently supports "agglomerative".
algorithm_kwargs (dict, default: {"distance_threshold": 0.5}): Extra configuration for the clustering algorithm, such as the distance threshold for agglomerative clustering.
pca (dict, default: {"n_components": 384}): Dimensionality reduction settings applied before clustering.

Sample request:
POST /api/collections/42/cluster_detections/
Content-Type: application/json

{
"ood_threshold": 0.4,
"feature_extraction_algorithm": "resnet18",
"algorithm": "agglomerative",
"algorithm_kwargs": {
"distance_threshold": 0.45
},
"pca": {
"n_components": 256
}
}

How to Test the Changes

Extract features for the target collection by running a pipeline that supports feature extraction (e.g. the Global pipeline, or Panama Plus).
Go to the Django admin panel and trigger the "cluster detections" action on the collection.
Confirm that:

a. A clustering job is created and runs successfully.

b. New taxa are created for clusters.

c. Associated occurrences receive updated determinations.

d. Results can be reviewed via the occurrences and taxa views.

Screenshots

Checklist

I have tested these changes appropriately.
I have added and/or modified relevant tests.
I updated relevant documentation or comments.
I have verified that this PR follows the project's coding standards.
Any dependent changes have already been merged to main.

…ionAdmin model

…onse and save it to Classification object

…r-compose.yml

…tions

…page

netlify · 2025-04-29T16:03:03Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`5fb5c43`
🔍 Latest deploy log	https://app.netlify.com/sites/antenna-preview/deploys/681ab7557736d80008bf3c47

…nickLab/antenna into feat/add-clustering

netlify · 2025-04-30T05:45:46Z

✅ Deploy Preview for antenna-ood canceled.

Name	Link
🔨 Latest commit	`5fb5c43`
🔍 Latest deploy log	https://app.netlify.com/sites/antenna-ood/deploys/681ab75592fbef00088f5174

…nickLab/antenna into feat/add-clustering

…ntenna into feat/add-clustering

ami/ml/clustering_algorithms/cluster_detections.py

ami/main/api/views.py

…nickLab/antenna into feat/add-clustering

… directly from the request objects

mihow · 2025-05-06T22:06:20Z

ami/ml/clustering_algorithms/cluster_detections.py

+        if not Algorithm.objects.filter(key=feature_extraction_algorithm).exists():
+            raise ValueError(f"Invalid feature extraction algorithm key: {feature_extraction_algorithm}")
+    else:
+        # Fallback to the most used feature extraction algorithm in this collection


Thanks for implementing this fallback to most used feature extractor!

mihow · 2025-05-06T22:09:21Z

ami/main/api/serializers.py

        ]
+
+
+class ClusterDetectionsSerializer(serializers.Serializer):


I feel much better about using a serializer for the params! thank you 🙏

…nickLab/antenna into feat/add-clustering

todo: add frontend filter to toggle this

mihow

Well done @mohamedelabbas1996!! Merging!

mohamedelabbas1996 added 13 commits April 14, 2025 11:11

feat: Added pgvector extension

132f9dc

feat: Added features field to the Classification model

dc48ccc

changed taxon and detection to autocomplete fields in the Classificat…

8dc0c00

…ionAdmin model

feat: added similar action to the ClassificationViewset

4bf07b3

chore: changed features vector field name to features_2048

b258c9b

chore: changed features vector field name to features_2048

9490045

feat: read features vector from processing service ClassificationResp…

0ff569f

…onse and save it to Classification object

test: added tests for PGVector distance metrics

89a3b6c

updated docker-compose.ci.yml to use the same postgres image

9e13cc4

updated docker-compose.ci.yml to use the same postgres image as docke…

5a51593

…r-compose.yml

updated docker-compose.ci.yml to use the same postgres image as docke…

9efff5f

…r-compose.yml

feat: Added support for clustering detections for source image collec…

1c66f34

…tions

feat: Allowed triggering collection detections clustering from admin …

99a7f3f

…page

mohamedelabbas1996 changed the title ~~Add support for clustering detections from a source image collection~~ [Draft] Add support for clustering detections from a source image collection Apr 29, 2025

mihow added 3 commits April 29, 2025 13:21

fix: show unobserved Taxa in view for now

83f2c08

fix: create & update occurrence determinations after clustering

5420f85

feat: add unknown species filter to admin

6b0020d

Base automatically changed from feat/save-classification-features to deployments/ood.antenna.insectai.org April 30, 2025 01:55

mihow force-pushed the deployments/ood.antenna.insectai.org branch from e7bed1a to 74d606b Compare April 30, 2025 03:39

Merge branch 'deployments/ood.antenna.insectai.org' of github.com:Rol…

856035d

…nickLab/antenna into feat/add-clustering

mihow and others added 6 commits April 29, 2025 23:00

Merge branch 'deployments/ood.antenna.insectai.org' of github.com:Rol…

036d81d

…nickLab/antenna into feat/add-clustering

fix: circular import

4f8b09b

fix: update migration ordering

d255085

Integrated Agglomerative clustering

2e12b56

Merge branch 'feat/add-clustering' of https://github.com/RolnickLab/a…

a301dc7

…ntenna into feat/add-clustering

updated clustering request params

10820bb

mihow reviewed Apr 30, 2025

View reviewed changes

ami/ml/clustering_algorithms/cluster_detections.py Outdated Show resolved Hide resolved

mihow reviewed Apr 30, 2025

View reviewed changes

ami/main/api/views.py Show resolved Hide resolved

mihow and others added 7 commits April 30, 2025 23:28

feat: allow sorting by OOD score

bf67d06

Merge branch 'deployments/ood.antenna.insectai.org' of github.com:Rol…

ce08f6a

…nickLab/antenna into feat/add-clustering

feat: add unknown species and other fields to Taxon serializer

853b69d

fix: remove missing field

6586872

fix: migration conflicts

b242079

feat: fields for investigating occurrence classifications in admin

fe744f0

fix: filter by feature extraction algorithm

4ac88da

mohamedelabbas1996 self-assigned this May 5, 2025

mohamedelabbas1996 added 3 commits May 5, 2025 10:36

chore: Used a serializer to handle job params instead of reading them…

6d44bdb

… directly from the request objects

set default ood threshold to 0.0

12b4ee4

test: added tests for clustering

2c73795

mohamedelabbas1996 linked an issue May 6, 2025 that may be closed by this pull request

Add clustering algo to Antenna #774

Closed

mihow added this to the OOD Integration milestone May 6, 2025

mihow reviewed May 6, 2025

View reviewed changes

mihow added 4 commits May 6, 2025 15:10

chore: migration for new algorithm type

e5d7ff0

Merge branch 'deployments/ood.antenna.insectai.org' of github.com:Rol…

9ad77f7

…nickLab/antenna into feat/add-clustering

fix: remove cluster action in Event admin until its ready

fdbbf75

chore: move algorithm selection to dedicated function

0e92904

mihow marked this pull request as ready for review May 7, 2025 00:09

mihow changed the title ~~[Draft] Add support for clustering detections from a source image collection~~ Add support for clustering detections May 7, 2025

mihow changed the title ~~Add support for clustering detections~~ Support for clustering detections May 7, 2025

mihow added 4 commits May 6, 2025 18:10

fix: update clustering tests and types

b26fbe0

chore: remove external network config in processing services

4032aff

feat: update GitHub workflows to run tests on other branches

0f8c544

fix: hide unobserved taxa by default

5fb5c43

todo: add frontend filter to toggle this

mihow approved these changes May 7, 2025

View reviewed changes

mihow merged commit edac77b into deployments/ood.antenna.insectai.org May 7, 2025
6 checks passed

mihow deleted the feat/add-clustering branch May 7, 2025 01:34

mihow mentioned this pull request May 12, 2025

Add clustering algo to Antenna #774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for clustering detections #818

Support for clustering detections #818

Uh oh!

mohamedelabbas1996 commented Apr 29, 2025 •

edited

Loading

Uh oh!

netlify bot commented Apr 29, 2025 •

edited

Loading

Uh oh!

netlify bot commented Apr 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

mihow May 6, 2025

Uh oh!

mihow May 6, 2025

Uh oh!

mihow left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		]


		class ClusterDetectionsSerializer(serializers.Serializer):

Support for clustering detections #818

Support for clustering detections #818

Uh oh!

Conversation

mohamedelabbas1996 commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of Changes

Related Issues

Detailed Description

How to Test the Changes

Screenshots

Checklist

Uh oh!

netlify bot commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

netlify bot commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ood canceled.

Uh oh!

Uh oh!

Uh oh!

mihow May 6, 2025

Choose a reason for hiding this comment

Uh oh!

mihow May 6, 2025

Choose a reason for hiding this comment

Uh oh!

mihow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mohamedelabbas1996 commented Apr 29, 2025 •

edited

Loading

netlify bot commented Apr 29, 2025 •

edited

Loading

netlify bot commented Apr 30, 2025 •

edited

Loading