-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Dear authors,
First of all, thank you for sharing the NarVid code and paper — it’s an excellent contribution to the text-video retrieval field.
I have reproduced the experiments following the instructions on your GitHub repository and paper:
Same ViT-B/32 backbone
Caption generation using LLaVA-7B (v1.5)
Same hyperparameters (batch size, epochs, filtering threshold p, etc.)
However, on MSR-VTT, I obtained the following results:
| Caption Model | R@1 | R@5 | R@10 |
|---|---|---|---|
| LLaVA-7B (v1.5) | 45.3 | 73.2 | 83.0 |
| Qwen2.5-VL-72B | 49.3 | 75.8 | 84.0 |
| Reported in paper | 51.0 | 76.4 | 85.2 |
May I ask:
Which exact LLaVA version and prompt type (e.g., A/B/C in Supplementary Table 7) were used for the reported results?
Were there any extra optimizations (e.g., caption post-processing, prompt engineering, or fine-tuned LLaVA checkpoints)?
Could there be dataset preprocessing or filtering differences (e.g., frame sampling strategy)?
Thank you again for your valuable research and open-sourced implementation!
Looking forward to your reply.
Best regards,
Ruizhi Zhao