Hi, thanks for sharing the code!
I have some questions on how you construct the clip-caption from one video.
- what if one cation cross multiple clips?
In your paper, Figure2 shows the clip-caption pairs. The caption "two stiches on two and we'll slip stitch" corresponds to two clips, as your figure shows. Did you segment the video into shots, and assign each caption with its nearest caption? ( Another way is segmenting the video by captions, and for each caption, find its nearest video. I dont think you use this way, since in such way, one caption cannot match two clips.)
2 did you use all the clip-captions within one video?
Since one video might contain lots clip-caption pairs, suppose a video might contain 1000 clip-caption pairs. Did you use all 1000 pairs in the howto100M dataset? Is there any selection work on those pairs?
I would appreciate your reply. Thanks.
Hi, thanks for sharing the code!
I have some questions on how you construct the clip-caption from one video.
In your paper, Figure2 shows the clip-caption pairs. The caption "two stiches on two and we'll slip stitch" corresponds to two clips, as your figure shows. Did you segment the video into shots, and assign each caption with its nearest caption? ( Another way is segmenting the video by captions, and for each caption, find its nearest video. I dont think you use this way, since in such way, one caption cannot match two clips.)
2 did you use all the clip-captions within one video?
Since one video might contain lots clip-caption pairs, suppose a video might contain 1000 clip-caption pairs. Did you use all 1000 pairs in the howto100M dataset? Is there any selection work on those pairs?
I would appreciate your reply. Thanks.