Skip to content

Add pipeline summary metrics to face clustering logs #878

@Sainava

Description

@Sainava

Summary

The face clustering pipeline in backend/app/utils/face_clusters.py performs several critical filtering and decision steps, but lacks high-level summary metrics that would help developers understand clustering outcomes during debugging, testing, and development.

This issue proposes adding lightweight summary logging at key decision boundaries to improve observability without changing any clustering behavior.


Current State: What Already Exists

The clustering pipeline already has good logging for individual operations:

Line 239: "Total valid faces to cluster: X"
Line 233: "Filtered out X invalid embeddings"
Line 258: "Applied similarity threshold: X"
Line 271: "DBSCAN found X clusters"
Line 498: Individual cluster merge notifications

These logs provide step-by-step visibility into the pipeline's execution.


What's Missing: Summary Metrics

While individual steps are logged, aggregate outcomes are not. This makes it difficult to answer:

  • "How many faces were excluded as DBSCAN noise?"
  • "Did most faces get clustered or rejected?"
  • "How much did post-merge clustering reduce the cluster count?"
  • "Did the pipeline behave as expected end-to-end?"

Specifically missing:

DBSCAN noise count - Number of faces marked as outliers (label == -1)
Pre/post-merge comparison - Cluster count before and after similarity merging
Final pipeline summary - Total faces processed and final cluster count


Proposed Additions

Add three high-level summary logs:

1. After DBSCAN clustering (~line 280)

noise_count = sum(1 for label in cluster_labels if label == -1)
logger.info(f"DBSCAN results: {len(set(cluster_labels)) - 1} clusters, {noise_count} noise faces")

2. Post-merge summary (~line 313)

pre_merge_count = len(set(r.cluster_uuid for r in results))  # Before merge
 ... after merge ...
post_merge_count = len(set(r.cluster_uuid for r in results))
logger.info(f"Post-merge: {pre_merge_count}{post_merge_count} clusters")

3. Final pipeline summary (end of function)

logger.info(f"Clustering complete: {len(results)} faces assigned across {post_merge_count} clusters")

Expected Output Example

[INFO] Total valid faces to cluster: 412
[INFO] Filtered out 27 invalid embeddings
[INFO] Applied similarity threshold: 0.85 (max_distance: 0.150)
[INFO] DBSCAN results: 38 clusters, 91 noise faces
[INFO] Post-merge: 38 → 31 clusters
[INFO] Clustering complete: 321 faces assigned across 31 clusters

Why This Matters

This is not an ML accuracy issue. The clustering logic may be working perfectly—but without summary diagnostics:

Developers cannot easily verify expected pipeline behavior
Debugging requires breakpoints or manual print statements
Clustering outcomes appear "opaque" when investigating issues
New contributors have difficulty understanding system behavior

Open Questions

Before implementing, I wanted to confirm:

Would this level of clustering observability be useful?
Are there other metrics that would be valuable to surface?
Any preferences on log level (INFO vs DEBUG)?
Happy to adjust based on maintainer feedback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions