Official implementation of the method described in:
“Depthtective: A Depth-Aware Framework for Spatio-Temporal Deepfake Detection”
Warning
Code Release Status: This paper is currently under review. The full source code and pre-trained models will be released publicly after the first review notification.
The documentation below serves as a preview of the framework's usage.
Depthtective is a data-efficient framework for the detection of manipulated facial videos based on the analysis of spatio-temporal inconsistencies in estimated depth. The method draws on the observation that modern deepfake generation techniques, while photorealistic, exhibit subtle violations of geometric coherence that become evident when comparing depth estimates between temporally adjacent frames.
Instead of relying on heavy temporal models such as 3D CNNs or Transformers, Depthtective focuses on the temporal residuals between two consecutive frames. The absolute differences in both RGB and depth domains are fused into a compact four-channel tensor that exposes motion-related inconsistencies and geometric distortions introduced by manipulation. This representation enables accurate video-level classification without the need for extended temporal sequences.
For each pair of aligned frames, a depth map is estimated through MiDaS (DPT-Large).
The temporal variation in appearance and geometry is quantified through the absolute inter-frame residuals in RGB and depth. Their fusion forms a four-channel tensor (RGBD residual) that serves as the sole input to the classifier.
The residual tensor is processed by an adapted Xception or ResNet50 architecture supporting four-channel input while retaining ImageNet pretraining. The network is fine-tuned to discriminate between authentic and manipulated videos using a standard binary classification objective.
Despite its simplicity, this formulation captures the core temporal inconsistencies typical of deepfake generation.
A second formulation adopts a contrastive representation learning approach.
The CNN is trained using a Triplet Loss to produce embeddings in which real and fake samples occupy well-separated regions of the latent space. A lightweight MLP head is then trained on top of the frozen encoder.
This strategy enhances separability especially for challenging manipulations such as NeuralTextures, where the artifacts are subtle and stochastic.
The effectiveness of Depthtective has been validated through experiments on the FaceForensics++ (FF++) benchmark (C23 compression) and the Celeb-DF (v2) dataset. We report the performance of our method implemented with standard CNN backbones (Xception, ResNet50) and the Contrastive Learning variant. The radar charts below illustrate the Accuracy, F1-Score, and Area Under the Curve (AUC) across all manipulation types.
git clone https://github.com/Luigina2001/Depthtective.git
cd DepthtectiveUsing Conda:
conda env create -f environment.yml
conda activate DepthtectiveUsing pip:
pip install -r requirements.txtDepthtective provides a unified script for classifying a video. The script performs frame extraction, depth estimation, residual construction, and prediction.
python main.py \
--video_path path/to/video.mp4 \
--contrastive_encoder_path models/best_contrastive_model.pth \
--classifier_head_path models/best_classifier_head.pth \
--hidden_features 256Example output:
Video: test_video.mp4
Prediction: Deepfake
Confidence: 98.45%



