- 🔍 Proposes DANCE, a Concept Bottleneck Model framework for explainable video action recognition.
- 🔧 Uses disentangled concepts from motion dynamics, objects, and scenes.
- 🔄 Includes concept-level interventions, concept swapping, and concept ablations.
- 📈 Demonstrates strong interpretability and competitive performance on Penn Action, UCF101, etc.
We provide two ways to set up the environment:
# Create and activate environment
conda env create -f environment.yml
conda activate dance# Create and activate environment
conda create -n dance python=3.10 -y
conda activate dance
# Install PyTorch (modify CUDA version if needed)
conda install pytorch=2.5.1 torchvision=0.20.1 torchaudio=2.5.1 pytorch-cuda=12.1 -c pytorch -c nvidia
# Install additional dependencies
pip install -r requirements.txtDANCE/
│── CBM_training/
│ ├── train_video_cbm.py
│ ├── Feature_extraction/
│ ├── model/
│ └── ...
│── Concept_extraction/
│ ├── keyframe_selection/
│ ├── Motion_discovery/
│ └── ...
│── Dataset/
│── Experiments/
│ ├── Evaluation.ipynb
│ └── Intervention.ipynb
│── result/
│── requirements.txt
│──
│── README.mdWe used the following datasets in our experiments:
python Concept_discovery/main.py \
--dataset Penn_action \
--anno_path ./Dataset/Penn_action \
--json_path PATH_TO_SKELETON_JSON \
--output_path .result/Penn_Action_motion_label \
--keyframe_path ./result/Penn_Action_keyframe \
--num_subsequence 12 \
--len_subsequence 25 \
--use_partition_num 3 \
--subsampling_mode sim+conf \
--confidence 0.5 \
--save_fps 10 \
--clustering_mode partition \
--req_cluster 45python CBM_training/train_video_cbm.py \
--data_set penn-action \
--nb_classes 15 \
--spatial_concept_set .result/Penn_Action_text_concept/Penn_action_object_concept.txt \
--place_concept_set .result/Penn_Action_text_concept/Penn_action_scene_concept.txt \
--batch_size 64 \
--finetune PATH_TO_BACKBONE.pt \
--dual_encoder internvid_200m \
--activation_dir ./result/Penn_Action_result \
--save_dir ./result/Penn_Action_result \
--n_iters 30000 \
--interpretability_cutoff 0.3 \
--clip_cutoff 0.2 \
--backbone vmae_vit_base_patch16_224 \
--proj_steps 3000 \
--train_mode pose spatial place \
--data_path PATH_TO_DATASET \
--backbone_features .result/Penn_Action_feature/Video_feature/penn-action_train_vmae_vit_base_patch16_224.pt \
--vlm_features .result/Penn_Action_feature/VLM_feature/penn-action_train_internvid_200m.pt \
--pose_label .result/Penn_Action_motion_label \
--proj_batch_size 50000 \
--saga_batch_size 128 \
--loss_mode concept \
--use_mlpEvaluation scripts and concept intervention scripts are included in:
Experiments/Evaluation.ipynbExperiments/Intervention.ipynb
@inproceedings{,
title={Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition},
author={Lee, Jongseo and Lee, Wooil and Park, Gyeong-Moon and Kim, Seong Tae and Choi, Jinwoo},
booktitle={NeurIPS},
year={2025}
}This project is licensed under the CC BY 4.0.
This work builds upon:
Trustworthy-ML-Lab/Label-free-CBM


