Used encoderdecoder model with LSTM units to encode the temporal sequence of frames and generated a fixed length caption for every video. Captions are embedded using skip-thoughts vector for query time. Pre-trained image recognition model was used to extract features from frames.
For more information, please visit https://ravibansal.github.io/portfolio/portfolio-1/