Skip to content

Conversation

@mohamedelabbas1996
Copy link
Contributor

@mohamedelabbas1996 mohamedelabbas1996 commented Apr 29, 2025

Summary

This PR introduces clustering support for detections in a source image collection, based on precomputed feature vectors (features_2048) stored in Classification records. The clustering helps organize detections into potential taxonomic groups, especially for unknown species.

List of Changes

  • Integrated agglomerative clustering logic using features_2048 field stored on the Classification model.

  • Added a new job type (DetectionClusteringJob) to cluster detections in a source image collection.

  • Added logic to:

    Select a consistent feature_extraction_algorithm across detections.

    Allow the user to explicitly pass an algorithm, with fallback to the most used one.

  • Create new Taxon entries per cluster and automatically assign them to Occurrence determinations.

  • Added admin action on source image collections to trigger clustering.

  • Added cluster_detections action to the SourceImageCollectionViewSet to trigger clustering via API.

  • Added unknown_species boolean field to the Taxon model to flag automatically generated taxa.

Related Issues

#774

Detailed Description

This PR adds clustering support for detections within a source image collection using existing visual feature vectors. The goal is to help researchers group similar-looking insect detections together, especially in collections where many specimens may not match known species. This is particularly useful in the context of the Panama trip, where a large number of insect images were collected with limited labeling. By clustering detections based on visual similarity, we can automatically group potential unknown species and assign them temporary taxa, making it easier for experts to review and identify patterns. The clustering job can be triggered from the admin interface or API.

The clustering works as follows:
First, Detection objects are filtered by the selected collection. Only detections that have at least one Classification containing a feature vector (features_2048) are considered. If the user provides a feature_extraction_algorithm, only features generated by that algorithm are used. If not, we select the most commonly used algorithm in the collection. Additionally, detections are filtered by an out-of-distribution (OOD) threshold: only those with an associated occurrence whose determination_ood_score exceeds the specified threshold are included.

Once the set of valid detections is prepared, we apply PCA dimensionality reduction to the feature vectors (default: 384 components). The reduced vectors are then passed to the selected clustering algorithm (currently, agglomerative clustering is supported). The algorithm groups similar detections into clusters based on the provided clustering parameters (e.g., distance_threshold).

After clustering:

  • A new Taxon is created for each cluster, marked with the unknown_species=True flag.

  • A new Classification is created for each detection, pointing to the corresponding cluster taxon and assigned a score=1.0. This score ensures that the associated Occurrence will automatically update its determination to use the new classification.

The cluster_detections action accepts several parameters:

ood_threshold (float, default: 0.0): Filters out detections that have already been confidently classified. Only detections with a higher out-of-distribution (OOD) score will be included in the clustering.

feature_extraction_algorithm (string, optional): Specifies the name of the feature extraction algorithm to use. If not provided, the system will automatically select the most commonly used algorithm within the collection.

algorithm (string, default: "agglomerative"): The clustering algorithm to use. Currently supports "agglomerative".
algorithm_kwargs (dict, default: {"distance_threshold": 0.5}): Extra configuration for the clustering algorithm, such as the distance threshold for agglomerative clustering.
pca (dict, default: {"n_components": 384}): Dimensionality reduction settings applied before clustering.

Sample request:
POST /api/collections/42/cluster_detections/
Content-Type: application/json

{
"ood_threshold": 0.4,
"feature_extraction_algorithm": "resnet18",
"algorithm": "agglomerative",
"algorithm_kwargs": {
"distance_threshold": 0.45
},
"pca": {
"n_components": 256
}
}

How to Test the Changes

  1. Extract features for the target collection by running a pipeline that supports feature extraction (e.g. the Global pipeline, or Panama Plus).

  2. Go to the Django admin panel and trigger the "cluster detections" action on the collection.

  3. Confirm that:

    a. A clustering job is created and runs successfully.

    b. New taxa are created for clusters.

    c. Associated occurrences receive updated determinations.

    d. Results can be reviewed via the occurrences and taxa views.

Screenshots

image image

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

@netlify
Copy link

netlify bot commented Apr 29, 2025

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 5fb5c43
🔍 Latest deploy log https://app.netlify.com/sites/antenna-preview/deploys/681ab7557736d80008bf3c47

@mohamedelabbas1996 mohamedelabbas1996 changed the title Add support for clustering detections from a source image collection [Draft] Add support for clustering detections from a source image collection Apr 29, 2025
Base automatically changed from feat/save-classification-features to deployments/ood.antenna.insectai.org April 30, 2025 01:55
@mihow mihow force-pushed the deployments/ood.antenna.insectai.org branch from e7bed1a to 74d606b Compare April 30, 2025 03:39
@netlify
Copy link

netlify bot commented Apr 30, 2025

Deploy Preview for antenna-ood canceled.

Name Link
🔨 Latest commit 5fb5c43
🔍 Latest deploy log https://app.netlify.com/sites/antenna-ood/deploys/681ab75592fbef00088f5174

@mohamedelabbas1996 mohamedelabbas1996 self-assigned this May 5, 2025
@mohamedelabbas1996 mohamedelabbas1996 linked an issue May 6, 2025 that may be closed by this pull request
@mihow mihow added this to the OOD Integration milestone May 6, 2025
if not Algorithm.objects.filter(key=feature_extraction_algorithm).exists():
raise ValueError(f"Invalid feature extraction algorithm key: {feature_extraction_algorithm}")
else:
# Fallback to the most used feature extraction algorithm in this collection
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this fallback to most used feature extractor!

]


class ClusterDetectionsSerializer(serializers.Serializer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel much better about using a serializer for the params! thank you 🙏

@mihow mihow marked this pull request as ready for review May 7, 2025 00:09
@mihow mihow changed the title [Draft] Add support for clustering detections from a source image collection Add support for clustering detections May 7, 2025
@mihow mihow changed the title Add support for clustering detections Support for clustering detections May 7, 2025
Copy link
Collaborator

@mihow mihow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done @mohamedelabbas1996!! Merging!

@mihow mihow merged commit edac77b into deployments/ood.antenna.insectai.org May 7, 2025
6 checks passed
@mihow mihow deleted the feat/add-clustering branch May 7, 2025 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add clustering algo to Antenna

3 participants