DeepLabCut Multi-animal Tracking with Improved re-ID
A Python pipeline for pose estimation, identity correction, tracklet stitching, and behavioral feature extraction from multi-animal videos using DeepLabCut. Originally developed for analysis of cichlid (fish) behavioral videos in the Streelman and McGrath labs at Georgia Tech.
- Pose Estimation: Multi-animal pose tracking using DeepLabCut
- Identity Correction: Per-video triplet loss CNN for consistent identity assignment across tracklets
- Tracklet Stitching: Combine short tracklets into continuous identity tracks
- Temporal Filtering: Smooth pose trajectories over time using DeepLabCut
- Feature Extraction: Automated behavioral feature detection and quantification
- Visualization: Create labeled videos with pose overlays (DeepLabCut) and other visualizations
- Statistical Analysis: Generate boxplots, correlation plots, and heatmaps for feature
This repository was developed for cichlid behavioral analysis and prioritizes applicability to that specific use case over generalizability. Many components (particularly feature extraction and analysis) are specialized for our experimental setup and behaviors of interest.
However, the CNN-based identity correction and hard-partition tracklet stitching approaches may be broadly useful and could potentially be integrated into DeepLabCut's official library:
CNN-based ReID vs. DLC's Transformer ReID:
- Learns from RGB image patches rather than pose backbone features
- May perform better when pre-trained features don't capture visual differences
- Includes silhouette-based quality filtering for uncertainty quantification
- Simpler architecture (CNN vs. Transformer), faster training
- Both enable train-once-per-species, reID-per-video workflow
CNN-based ReID vs. DLC's Supervised Identity Tracking:
- No identity annotation required (unsupervised clustering)
- generalized pose model + specialized ID models, rather than a specialized pose+ID model for each video
- Better for experiments where individuals have similar body plans across videos, but the exact individuals in each video vary
Hard-partition stitching (vs. DLC's soft-constraint graph optimization):
- Strictly enforces identity boundaries by partitioning tracklets before stitching
- Simpler and more robust when identity preservation is critical
- Avoids the complexity and occasional failures (common on very long videos) of global graph optimization
See Comparison with DeepLabCut's Transformer ReID for detailed technical differences.
- Python 3.8 or higher
- CUDA-compatible GPU (recommended for faster processing)
git clone https://github.com/yourusername/DemBA.git
cd DemBA
pip install -e .Core dependencies include:
- numpy
- pandas
- opencv-python
- torch
- scikit-learn
- matplotlib
- tqdm
- deeplabcut
DemBA expects a specific project structure for optimal functionality. While some operations may work with non-standard layouts, using the recommended structure is strongly advised for full pipeline compatibility:
MyProject/
βββ config.yaml # DeepLabCut project config
βββ Analysis_<date>/ # Analysis directory (can be any name)
βββ Videos/ # Required: Trial video directories
β βββ trial1/ # Each trial in its own directory
β β βββ trial1.mp4 # Video file (name matches directory)
β β βββ trial1_roi.png # ROI image (optional)
β β βββ ... # Pipeline outputs created here
β βββ trial2/
β β βββ trial2.mp4
β β βββ ...
β βββ ...
βββ Annotations/ # Optional: Manual annotations
βββ quivering_annotations.xlsx # Auto-detected by batch mode
Key Requirements:
config.yamlmust be in or above the Analysis directory- Each trial must be in its own subdirectory under
Videos/ - Video filename must match the directory name (e.g.,
trial1/trial1.mp4) - Annotations directory is optional but auto-detected if present
DemBA provides a comprehensive CLI with modular stages, full pipeline mode, and batch mode.
# Run full pipeline on a single trial
python main.py full --video Videos/trial1/trial1.mp4 --dlc-config config.yaml
# Run batch mode on multiple trials (recommended for projects with many videos)
python main.py batch --project-dir /path/to/MyProject/Analysis_<date>
# Run individual stages
python main.py pose --video Videos/trial1/trial1.mp4 --dlc-config config.yaml
python main.py id-correction --video Videos/trial1/trial1.mp4 --dlc-config config.yaml
python main.py stitch --video Videos/trial1/trial1.mp4 --dlc-config config.yaml
python main.py filter --video Videos/trial1/trial1.mp4 --dlc-config config.yaml
python main.py features --video Videos/trial1/trial1.mp4 --dlc-config config.yaml
python main.py visualize --video Videos/trial1/trial1.mp4 --dlc-config config.yaml
python main.py analyze --project-dir /path/to/MyProject/Analysis_<date>import demba
# Pose estimation
demba.estimate_pose(
config_path='config.yaml',
video_path='trial1.mp4',
n_fish=2
)
# Identity correction
demba.prepare_id_correction('trial1_el.pickle')
demba.complete_id_correction('trial1_el.pickle')
# Tracklet stitching
demba.stitch_by_identity(
tracklet_pickle_path='trial1_el.pickle',
output_h5_path='trial1_el.h5',
n_tracks=2
)
# Feature extraction
demba.process_video(
video_path='trial1.mp4',
pose_h5_path='trial1_el.h5',
visualize=True
)DemBA provides 8 modular stages that can be run individually or combined via full (single trial) or batch (multiple trials) modes.
Runs standard DeepLabCut multi-animal pose estimation (dlc.analyze_videos) and simple SORT-style stitching (dlc.convert_detections2tracklets).
python main.py pose --video trial1.mp4 --dlc-config config.yaml --n-fish 2The Challenge: Multi-animal pose tracking produces identity swaps when individuals occlude each other, come into close proximity, or leave and return to the field of view
DemBA's Approach: A per-video triplet loss CNN that learns to distinguish individuals using the pose tracking data itselfβno manual annotations required beyond the pose annotations used to train the DLC pose estimation model
How It Works:
- Automatic Training Data Generation: Identifies "co-occupancy frames" where both animals are simultaneously detected, providing anchor-negative pairs for contrastive learning.
- Visual Embedding Learning: Trains a CNN to extract visual embeddings from image patches around each detection, using triplet loss to separate embeddings by identity.
- Unsupervised Clustering: Applies KMeans clustering to embeddings across all detections, grouping them into two identity clusters.
- Interactive Mapping: Creates a comparison video showing representative segments from each cluster. User labels which cluster corresponds to each individual.
- Quality-Filtered Assignment: Assigns identity labels using silhouette scores to filter low-confidence predictions.
Usage:
# Two-phase approach (recommended for large videos)
python main.py id-correction --tracklet-pickle trial1_el.pickle --n-epochs 150
# Or call phases separately for more control
# Phase 1: Train model and extract embeddings (non-interactive, can be cached)
demba.prepare_id_correction('trial1_el.pickle')
# Phase 2: Interactive cluster mapping and ID assignment
demba.complete_id_correction('trial1_el.pickle')Key Parameters:
--n-epochs: Training epochs (default: 200, but early stopping typically terminates much earlier)--min-silhouette: Quality threshold for ID assignment (default: 0.2, range: -1 to 1)--frame-stride: Sampling density for patch cache (default: 5, lower = more data)--min-overlap-frames: Minimum co-occupancy duration for training pairs (default: 30)
See Identity Correction: Technical Details for in-depth explanation.
Combines short tracklets into continuous tracks based on learned identities.
python main.py stitch --tracklet-pickle trial1_el.pickle --output-h5 trial1_el.h5 --n-tracks 2Applies standard DeepLabCut temporal smoothing to reduce jitter in pose predictions.
python main.py filter --video trial1.mp4 --dlc-config config.yamlExtracts behavioral features from the pose data. This module is specialized for our specific behaviors of interest, and would need modification for more generalized applications.
python main.py features --video trial1.mp4 --pose-h5 trial1_el.h5 --visualizeCreates labeled videos with skeleton overlays and identity labels (using dlc.create_labeled_video)
python main.py visualize --video trial1.mp4 --dlc-config config.yamlGenerates statistical plots and correlation analyses across all trials in a project.
python main.py analyze --project-dir /path/to/MyProject/AnalysisThe Challenge: When processing many videos, running the full pipeline sequentially requires you to be present for the interactive ID assignment step of each video, which interrupts the workflow.
Batch Mode Solution: Separates the pipeline into three phases:
- Phase 1 (Preparation): Runs pose estimation and ID model training for all videos (non-interactive, can run overnight)
- Phase 2 (Interactive): Interactive cluster mapping for all videos in one sitting
- Phase 3 (Finalization): Completes remaining pipeline stages for all videos (non-interactive)
This allows you to complete all interactive tasks at once, then let the rest run unattended.
Usage:
# Process entire project with one command
python main.py batch --project-dir /path/to/MyProject/Analysis
# Optional: manually specify annotations file
python main.py batch --project-dir /path/to/MyProject/Analysis \
--quivering-annotations custom_annotations.xlsxFeatures:
- Automatically discovers all trials in
Videos/directory - Auto-detects config file and annotations
- Tracks completion status per trial - can resume if interrupted
- Skips already-completed stages automatically
- Runs project-level analysis at the end
When to Use:
- β Multiple videos to process (>3 trials)
- β Want to minimize interactive time
- β Processing overnight or on remote server
- β Single trial (use
fullmode instead) - β Need fine control over individual stages
Identity correction addresses a common challenge in multi-animal behavioral analysis: maintaining consistent identity labels across long videos despite occlusions, close proximity, and visual similarity between individuals.
In my experience working with long cichlid videos (3.5 hours), DeepLabCut's standard tracking pipeline often failed when it came to reID and stitching short tracklets into long trajectories. Multiple approaches were attempted:
- Standard multi-animal tracking with SORT-style tracklet generation and graph-based tracklet stitching:
- Produced identity swaps throughout the video (as expected for videos where all animals can leave the frame)
- Often failed during graph-based stitching (possibly due to video length and large spatiotemporal distances between individual tracklets when animals leave frame)
- DLC's Transformer ReID + graph-based stitching: Did not achieve ReID rates much above random, possibly because:
- The pose estimation backbone features didn't capture the subtle visual differences between the two fish (distinguishable only to a skilled annotator)
- Graph optimization still struggled, and the addition of often incorrect pose data likely did more harm than good
- Identity-aware pose estimation model (training with
identity=Trueafter manually annotating male/female separately): Also unsuccessful- Compared to our ID-agnostic pose model (trained on ~850 images representing dozens of unique individuals) the supervised ID model (trained on 30 images from a single 2-individual trial) performed poorly at pose estimation
- Poor pose estimation results obscured any possible underlying ID success
- This method might work given larger training sets, but annotating 100+ images every time we run a new trial was not feasible for us.
The core challenge: maintaining consistent identity assignment across very long videos where visually similar animals frequently occlude each other and leave/return to the frame, while also minimizing manual annotation requirements
DemBA uses a per-video CNN trained with triplet loss to learn discriminative visual features for each individual, then clusters these features to assign consistent identities. The system generates its own training data from the tracklet structure itselfβno manual labeling required.
The system identifies frames where exactly two tracklets overlapβwhen both animals are visible but tracked separately. These "co-occupancy frames" are useful for contrastive learning because they contain:
- Same-individual examples (frames within a tracklet)
- Different-individual examples (frames from overlapping tracklets)
Parameters:
min_conf: Minimum detection confidence (default: 0.5)min_overlap_frames: Minimum overlap duration to consider a tracklet pair valid (default: 10 frames)
For each detection, extract a bounding box around all detected keypoints:
- Compute bbox from high-confidence keypoints (threshold: 0.25)
- Add padding (default: 10 pixels)
- Ensure minimum dimensions (4Γ padding) to avoid degenerate crops
- Resize to square patches (default: 128Γ128) maintaining aspect ratio
- Pad to square with black pixels if needed
This produces consistent visual representations centered on each animal regardless of pose or orientation.
For each training iteration, sample a triplet:
Anchor: Detection from Tracklet A at time tβ (co-occupancy frame)
Positive: Detection from Tracklet A at time tβ (different time, same animal)
Negative: Detection from Tracklet B at time tβ (same time, different animal)
The co-occupancy constraint ensures anchor and negative show different individuals, while temporal continuity within tracklets ensures anchor and positive show the same individual.
Data augmentation: Random 90Β° rotations during preprocessing.
A compact and simple convolutional encoder optimized for embedding extraction:
Input: 128Γ128Γ3 RGB patch
β Conv Block 1: 32 filters, BatchNorm, ReLU, MaxPool (β64Γ64)
β Conv Block 2: 64 filters, BatchNorm, ReLU, MaxPool (β32Γ32)
β Conv Block 3: 128 filters, BatchNorm, ReLU, MaxPool (β16Γ16)
β Conv Block 4: 256 filters, BatchNorm, ReLU, MaxPool (β8Γ8)
β Flatten + FC(512) + ReLU + Dropout(0.5)
β FC(embedding_dim) [default: 128]
β L2 Normalization
Output: 128-dimensional unit vector
Architecture Visualization:
SimpleCNN Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INPUT: 128Γ128Γ3 β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β Conv2d (3 β 32) β kernel=3, padding=1
β BatchNorm2d(32) β
β ReLU β
β MaxPool2d(2Γ2) β
ββββββββββββββ¬βββββββββββββ
β 64Γ64Γ32
ββββββββββββββΌβββββββββββββ
β Conv2d (32 β 64) β kernel=3, padding=1
β BatchNorm2d(64) β
β ReLU β
β MaxPool2d(2Γ2) β
ββββββββββββββ¬βββββββββββββ
β 32Γ32Γ64
ββββββββββββββΌβββββββββββββ
β Conv2d (64 β 128) β kernel=3, padding=1
β BatchNorm2d(128) β
β ReLU β
β MaxPool2d(2Γ2) β
ββββββββββββββ¬βββββββββββββ
β 16Γ16Γ128
ββββββββββββββΌβββββββββββββ
β Conv2d (128 β 256) β kernel=3, padding=1
β BatchNorm2d(256) β
β ReLU β
β MaxPool2d(2Γ2) β
ββββββββββββββ¬βββββββββββββ
β 8Γ8Γ256
ββββββββββββββΌβββββββββββββ
β Flatten β
ββββββββββββββ¬βββββββββββββ
β 16,384 features
ββββββββββββββΌβββββββββββββ
β Linear(16384 β 512) β
β ReLU β
β Dropout(p=0.5) β
ββββββββββββββ¬βββββββββββββ
β 512 features
ββββββββββββββΌβββββββββββββ
β Linear(512 β 128) β
ββββββββββββββ¬βββββββββββββ
β 128 features
ββββββββββββββΌβββββββββββββ
β L2 Normalize β
ββββββββββββββ¬βββββββββββββ
β
ββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β OUTPUT: 128-dim unit vector embedding β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total Parameters: ~8.9M
Input: RGB image patches (128Γ128Γ3)
Output: L2-normalized embeddings (128-dim)
Loss: Triplet loss with margin=1.0
The L2 normalization ensures embeddings lie on a hypersphere, making cosine/Euclidean distances equivalent and improving clustering.
The loss function:
L = max(0, ||f(anchor) - f(positive)||Β² - ||f(anchor) - f(negative)||Β² + margin)
This encourages:
- Small distances between same-individual pairs
- Large distances between different-individual pairs
- A margin of separation (default: 1.0)
Training features:
- Optimizer: Adam (lr=0.001)
- Learning rate scheduling: ReduceLROnPlateau (factor=0.5, patience=3)
- Early stopping: Stops if no improvement for 10 epochs
- Batch size: 32
- Epochs: 200 (but early stopping typically terminates much earlier)
- Samples per epoch: 1000 triplets
Patch Caching: To speed up training, patches are pre-extracted and cached with sparse sampling (every frame_stride frames, default: 5). This reduces cache size significantly while maintaining training diversity. Cache is saved to disk and reused across runs.
After training, extract embeddings for all detections in all tracklets (not just training samples). This produces an embedding for every detection in the video.
Cluster all embeddings into k=2 groups (assuming 2 animals):
kmeans = KMeans(n_clusters=2, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings)Silhouette Score: For each detection, compute its silhouette score:
s(i) = (b(i) - a(i)) / max(a(i), b(i))
where:
a(i): Mean distance to other points in same clusterb(i): Mean distance to nearest different cluster
Range: -1 (wrong cluster) to +1 (perfect cluster fit)
This provides a confidence metric for each ID assignment.
Since clusters are unlabeled (we don't know if cluster 0 is male or female), the system creates a side-by-side comparison video:
- Left panel: Segments from Cluster 0
- Right panel: Segments from Cluster 1
- Shows n_segments (default: 3) representative tracklets from each cluster
- Each segment is segment_duration_sec (default: 3) seconds long
- Only uses tracklets with >70% purity (most frames in one cluster)
User watches the video and provides mapping:
Cluster 0 is (male/female): male
Cluster 1 is (male/female): female
This interactive step establishes the clusterβidentity mapping for the entire video, and is the only manual input required for the entire pipeline.
Assign numeric IDs based on:
- Cluster membership (from KMeans)
- User mapping (cluster β semantic label β numeric ID)
- Silhouette threshold (default: 0.2)
if silhouette_score >= min_silhouette:
ID = label_to_id[cluster_mapping[cluster]] # 0 for male, 1 for female
else:
ID = -1 # Low confidence, keep unassignedThe original tracklet file is backed up with timestamp, then replaced with corrected IDs.
Generated in <video_dir>/id_correction/:
encoder_model.pth: Trained CNN weightspatch_cache.pkl: Pre-extracted image patches (~100-500 MB)embeddings.pkl: Extracted embeddings for all detectionstraining_loss.png: Loss curve and learning rate schedulecluster_comparison.mp4: Interactive comparison videoembedding_visualization.png: PCA/t-SNE projections with silhouette scoressummary_statistics.txt: Comprehensive statistics reportbackup_<timestamp>.pickle: Backup of original tracklets
Silhouette Score Interpretation:
s > 0.5: Strong confidence, well-separated0.2 < s β€ 0.5: Moderate confidence (default threshold: 0.2)0 < s β€ 0.2: Weak confidence, near cluster boundarys < 0: Likely misclassified
Typical Results (for well-separated individuals):
- Mean silhouette: 0.4-0.7
- Assigned IDs: 85-95% of detections
- Low confidence: 5-10% of detections
- Failed extractions: <5% of detections
Tuning min_silhouette:
0.0: Lenient, maximize coverage (use if individuals are very distinct)0.2: Moderate, balanced precision/recall (recommended default)0.5: Strict, maximize precision (use if identities are critical)
When Identity Correction Works Best:
- Individuals with visual differences (size, color, pattern)
- Videos with substantial co-occupancy (both animals visible together)
- Good pose estimation quality (high confidence keypoints)
- Consistent lighting and camera angle
Troubleshooting:
- Low silhouette scores (<0.3 mean): Individuals may be too similar visually. Try decreasing
frame_strideto get more diverse training samples, or loweringmin_silhouetteto accept more assignments. - Few co-occupancy frames (<100): Check if animals actually overlap in video. May need to adjust
min_confthreshold. - Poor clustering separation: Individuals may be visually identical. Consider using behavioral features (location preferences, movement patterns) for post-hoc ID refinement.
- High training loss: Increase
n_epochsor adjustlrfor more stable training.
Computational Requirements:
- GPU strongly recommended, especially for long videos
- default batch-size may need to be lowered for older GPU's
DemBA's identity correction was inspired by DeepLabCut's unsupervised ReID module but differs in key aspects of both the learning approach and how identity information is used during tracklet stitching.
DeepLabCut Transformer ReID:
- Architecture: Transformer-based (4 attention heads, 4 blocks, 768-dim) with MLP output to 128-dim embeddings
- Input features: 2,048-dimensional features from the last layer of the pre-trained pose estimation CNN backbone
- Feature type: High-level "keypoint embeddings" containing visual information around each keypoint
- Training data: Triplets sampled from co-occupancy frames (anchor-positive from same tracklet, anchor-negative from different tracklets)
- Output: 128-dimensional appearance embeddings per detection
DemBA Identity Correction:
- Architecture: Compact CNN (4 conv blocks, 256 filters max) with fully connected layers to 128-dim embeddings
- Input features: Raw RGB image patches (128Γ128) extracted from bounding boxes around all keypoints
- Feature type: Full visual appearance of the animal (color, texture, patterns)
- Training data: Same triplet strategy from co-occupancy frames, with sparse patch caching for efficiency
- Output: 128-dimensional L2-normalized embeddings per detection
Key Differences:
- DLC leverages features from your existing ID-agnostic pose estimation model; DemBA learns appearance features from scratch per video
- DLC uses keypoint-level features; DemBA uses whole-animal appearance
- DemBA includes silhouette scores for quality filtering; DLC uses cosine similarity directly
The more significant difference lies in how identity information is used during tracklet stitching:
DeepLabCut Approach (Soft Constraint):
- Builds a single graph containing all tracklets
- Uses ReID embeddings as an additional cost term in the graph optimization
- Identity similarity provides a soft bias: edges between same-identity tracklets get lower weights (0.01Γ base cost) while different-identity edges get higher weights (1.0Γ base cost)
- The global optimization can still connect different identities if spatial/temporal costs are favorable
- Relies on the graph optimization to balance identity, distance, motion, and temporal constraints
DemBA Approach (Hard Constraint):
- Identifies and splits conjoined tracklets: Detects tracklets that switch between tracking different individuals and splits them at identity boundaries (based on runs of consecutive frames with same ID)
- Partitions tracklets by identity before stitching: tracklets with ID=0 go into one group, ID=1 into another, ID=-1 (unassigned) are discarded
- Each identity group is stitched independently using simple temporal concatenation
- No graph optimization across identities: identity boundaries are strictly enforced
- Discards low-confidence detections (silhouette < threshold) rather than attempting to stitch them
- Much simpler stitching: just concatenates tracklets of the same ID in temporal order
Code Comparison:
DLC's soft constraint (from stitch.py:1239-1244):
with_id = any(tracklet.identity != -1 for tracklet in stitcher)
if with_id and weight_func is None:
def weight_func(t1, t2):
w = 0.01 if t1.identity == t2.identity else 1
return w * stitcher.calculate_edge_weight(t1, t2)DemBA's hard partition (from tracklet_stitching.py:93-151):
# Partition by identity
tracklets_by_id = {i: [] for i in range(n_tracks)}
tracklets_by_id[-1] = [] # For unassigned
for t in tracklets:
identity = t.identity
if identity in tracklets_by_id:
tracklets_by_id[identity].append(t)
# Stitch each identity group independently
for identity in valid_ids:
identity_tracklets = tracklets_by_id[identity]
combined_track = identity_tracklets_sorted[0]
for t in identity_tracklets_sorted[1:]:
if len(t) >= min_length:
combined_track = combined_track + tDefault parameters can be found in demba/config.py:
DEFAULT_N_FISH = 2
DEFAULT_MIN_LIKELIHOOD = 0.5
DEFAULT_PATCH_SIZE = 128
DEFAULT_MOUTHING_DIST_MM = 10
VIDEO_FPS = 30Parameters can be overridden via command-line arguments.
DemBA/
βββ demba/ # Main package
β βββ pose_estimation.py # DeepLabCut integration
β βββ identity_correction.py # Identity correction pipeline
β βββ tracklet_stitching.py # Tracklet combination
β βββ filtering.py # Temporal filtering
β βββ feature_extraction.py # Behavioral feature detection
β βββ visualization.py # Video visualization
β βββ analysis.py # Statistical analysis
β βββ file_manager.py # TrialManager & ProjectManager
β βββ config.py # Configuration constants
β βββ utils/ # Utility functions
β βββ dlc.py # DeepLabCut helpers
β βββ roi.py # ROI estimation
β βββ gen_utils.py # Batch mode utilities
β βββ metrics.py # Evaluation metrics
βββ main.py # CLI entry point
βββ setup.py # Package installation
βββ README.md # This file
Contributions are welcome! Please feel free to submit a Pull Request. I'm especially open to help/suggestions regarding feature integration into the official DLC library
This project is licensed under the GNU Lesser General Public License v3.0 License - see the LICENSE file for details.
If you use DemBA in your research, please cite:
@software{demba2023,
title = {DemBA: DeepLabCut-augmented Multi-animal Behavioral Analysis},
author = {Lancaster, Tucker},
year = {2023},
url = {https://github.com/tlancaster6/DemBA}
}- Built around, and heavily inspired by, DeepLabCut
- Data and biological inspiration courtesy of Kathryn Leatherbury, Streelman Lab, Georgia Tech