Skip to content

Reproduction gap on MSR-VTT: LLaVA-7B (v1.5) yields R@1=45.3 vs. paper’s 51.0 #2

@zhaoruizhi

Description

@zhaoruizhi

Dear authors,

First of all, thank you for sharing the NarVid code and paper — it’s an excellent contribution to the text-video retrieval field.
I have reproduced the experiments following the instructions on your GitHub repository and paper:

Same ViT-B/32 backbone
Caption generation using LLaVA-7B (v1.5)
Same hyperparameters (batch size, epochs, filtering threshold p, etc.)

However, on MSR-VTT, I obtained the following results:

Caption Model R@1 R@5 R@10
LLaVA-7B (v1.5) 45.3 73.2 83.0
Qwen2.5-VL-72B 49.3 75.8 84.0
Reported in paper 51.0 76.4 85.2

May I ask:

Which exact LLaVA version and prompt type (e.g., A/B/C in Supplementary Table 7) were used for the reported results?
Were there any extra optimizations (e.g., caption post-processing, prompt engineering, or fine-tuned LLaVA checkpoints)?
Could there be dataset preprocessing or filtering differences (e.g., frame sampling strategy)?

Thank you again for your valuable research and open-sourced implementation!
Looking forward to your reply.

Best regards,
Ruizhi Zhao

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions