-
Notifications
You must be signed in to change notification settings - Fork 11
Support for clustering detections #818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for clustering detections #818
Conversation
…onse and save it to Classification object
✅ Deploy Preview for antenna-preview canceled.
|
e7bed1a to
74d606b
Compare
…nickLab/antenna into feat/add-clustering
✅ Deploy Preview for antenna-ood canceled.
|
…nickLab/antenna into feat/add-clustering
…ntenna into feat/add-clustering
…nickLab/antenna into feat/add-clustering
… directly from the request objects
| if not Algorithm.objects.filter(key=feature_extraction_algorithm).exists(): | ||
| raise ValueError(f"Invalid feature extraction algorithm key: {feature_extraction_algorithm}") | ||
| else: | ||
| # Fallback to the most used feature extraction algorithm in this collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this fallback to most used feature extractor!
| ] | ||
|
|
||
|
|
||
| class ClusterDetectionsSerializer(serializers.Serializer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel much better about using a serializer for the params! thank you 🙏
mihow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done @mohamedelabbas1996!! Merging!
Summary
This PR introduces clustering support for detections in a source image collection, based on precomputed feature vectors (features_2048) stored in
Classificationrecords. The clustering helps organize detections into potential taxonomic groups, especially for unknown species.List of Changes
Integrated agglomerative clustering logic using
features_2048field stored on theClassificationmodel.Added a new job type (
DetectionClusteringJob) to cluster detections in a source image collection.Added logic to:
Select a consistent feature_extraction_algorithm across detections.
Allow the user to explicitly pass an algorithm, with fallback to the most used one.
Create new Taxon entries per cluster and automatically assign them to Occurrence determinations.
Added admin action on source image collections to trigger clustering.
Added
cluster_detectionsaction to the SourceImageCollectionViewSet to trigger clustering via API.Added
unknown_speciesboolean field to theTaxonmodel to flag automatically generated taxa.Related Issues
#774
Detailed Description
This PR adds clustering support for detections within a source image collection using existing visual feature vectors. The goal is to help researchers group similar-looking insect detections together, especially in collections where many specimens may not match known species. This is particularly useful in the context of the Panama trip, where a large number of insect images were collected with limited labeling. By clustering detections based on visual similarity, we can automatically group potential unknown species and assign them temporary taxa, making it easier for experts to review and identify patterns. The clustering job can be triggered from the admin interface or API.
The clustering works as follows:
First, Detection objects are filtered by the selected collection. Only detections that have at least one Classification containing a feature vector (features_2048) are considered. If the user provides a feature_extraction_algorithm, only features generated by that algorithm are used. If not, we select the most commonly used algorithm in the collection. Additionally, detections are filtered by an out-of-distribution (OOD) threshold: only those with an associated occurrence whose determination_ood_score exceeds the specified threshold are included.
Once the set of valid detections is prepared, we apply PCA dimensionality reduction to the feature vectors (default: 384 components). The reduced vectors are then passed to the selected clustering algorithm (currently, agglomerative clustering is supported). The algorithm groups similar detections into clusters based on the provided clustering parameters (e.g., distance_threshold).
After clustering:
A new Taxon is created for each cluster, marked with the unknown_species=True flag.
A new Classification is created for each detection, pointing to the corresponding cluster taxon and assigned a score=1.0. This score ensures that the associated Occurrence will automatically update its determination to use the new classification.
The
cluster_detectionsaction accepts several parameters:ood_threshold(float, default: 0.0): Filters out detections that have already been confidently classified. Only detections with a higher out-of-distribution (OOD) score will be included in the clustering.feature_extraction_algorithm(string, optional): Specifies the name of the feature extraction algorithm to use. If not provided, the system will automatically select the most commonly used algorithm within the collection.algorithm(string, default: "agglomerative"): The clustering algorithm to use. Currently supports "agglomerative".algorithm_kwargs(dict, default: {"distance_threshold": 0.5}): Extra configuration for the clustering algorithm, such as the distance threshold for agglomerative clustering.pca(dict, default: {"n_components": 384}): Dimensionality reduction settings applied before clustering.Sample request:
POST /api/collections/42/cluster_detections/
Content-Type: application/json
{
"ood_threshold": 0.4,
"feature_extraction_algorithm": "resnet18",
"algorithm": "agglomerative",
"algorithm_kwargs": {
"distance_threshold": 0.45
},
"pca": {
"n_components": 256
}
}
How to Test the Changes
Extract features for the target collection by running a pipeline that supports feature extraction (e.g. the Global pipeline, or Panama Plus).
Go to the Django admin panel and trigger the "cluster detections" action on the collection.
Confirm that:
a. A clustering job is created and runs successfully.
b. New taxa are created for clusters.
c. Associated occurrences receive updated determinations.
d. Results can be reviewed via the occurrences and taxa views.
Screenshots
Checklist