AppliedDeepLearning

Repository for Applied Deep Learning Course at TU Wien

Topic: Computer Vision

Assignment I

Image Caption Generator

Nowadays, Computer Vision and Natural Language Processing have taken a huge step forward due to major advancements in the field of Deep Learning and large volumes of data available. Different types of tasks like image classification, object detection, caption generator, sentiment analysis etc., which before were thought impossible, can now be solved by smart models and computers. Having always been fascinated by these two branches, I decided to concentrate for this project on the topic of Image Captioning, which actually lies at the intersection of Computer Vision and Natural Language Processing.

Related Work

Show, Attend and Tell by Kelvin Xu et al. which was published back on 2015 served as a good introduction to this topic. It suggests a CNN-RNN(Convolutional Neural Network - Recurrent Neural Network) model for generating image captions for Flickr30k and COCO datasets. Another interesting paper I read was VLP by Zhou Luowei et al. which proposes a unified decoder-encoder model that uses a model pre-trained on another dataset, fine tunes it, and then uses it to generate captions for COCO and Flickr30k datasets.

Approach

According to my research CNN-RNN model combinations are used quite often for this kind of task. Therefore, I will initially try to build a similar model for my project.

Dataset

At first I wanted to use the COCO dataset for my project but then I decided against it due to its large volume of over 1 million and half captions describing over 330000 images. Even though, I am going to use Google Colab Pro Version for this project, computation might still be a problem when training neural networks. Therefore, I decided to use the Flickr30k dataset. It contains around 31000 images collected from Flickr, together with 5 reference sentences provided by human annotators.

Prediction of Work-Breakdown Structure

Dataset Collection: I will use an already available dataset.
Design and build of a model: I believe this will be the most challenging part of the project. It will require a lot of try and errors from my side and I think it might take up to 2 weeks.
Train of model: As I plan to use GPU of Google Colab I hope it won't take more than 3 days, considering that my model might not run on the first try.
Fine Tuning: This might also take me up to 1 week, as I have to do a lot of research and consider methods to improve my model.
Application: This might take up to 1 week. I will probably build a website where I can upload a picture and it will give as output a caption describing it.
Report and Presentation: 2 days.

Assignment II

Short Intro (Recap)

As mentioned above, Show, Attend and Tell was one of the main papers that served for me as a gate to the world of image captioning and I relied on it for the delivery of this project. Since it was my first time working with RNN and Attention Based models, my code was therefore heavily inspired by this paper but also by this Pytorch tutorial a-PyTorch-Tutorial-to-Image-Captioning . Even though this paper does not represent state-of-the-art, due to multiple mentions and implementations in various projects, I could understand its content more easily as compared to other ones.

Error Metric

For evaluating my models I used the so-called BLEU metric which is actually a standard in the image captioning generator architectures. The table below summarizes the results of my implementatons (on test set):

Implementation	BLEU-1	BLEU-2	BLEU-3	BLEU-4
Model I (w/o fine-tuning Encoder)	54.46	34.82	21.12	12.29
Model II (w/ fine-tuning Encoder)	55.78	35.71	22.13	13.82
Model III (Model II,trained w/ changed params for 3 more epochs)	55.93	34.82	21.12	13.71
Model IV ( sorted captions)	64.86	41.61	25.43	15.71
Model V (sorted captions & fine-tuning encoder)	65.27	42.49	26.39	16.43

For Model I, I basically implemented and used same parameters as in the Paper and did not fine-tune the encoder. However, somehow my results were different. A reason for that could be that I used Resnet50 as an Encoder while the paper uses VGGnet. In order to fine-tune my model and to somehow improve the results, I decided to fine-tune the used encoder. As seen in the table above this led to a slight improvement of the results. What was interesting during the training process of model II was that the model improved itself for the first few epochs and then it stagnated, so despite the fact that early stopping was triggered, I decided to continue training the model for a few other epochs. However, I changed some parameters, such as :

Decreased regularization parameter alpha_c (descreased the strength of regularization of the model)
Decreased lr_decay_factor which might lead to the model converging more slowly

Even though BLEU-4 of model III seemed to increase during training, this was not the case when evaluating the model on test set.

Soon after fitting model III, I realized that something could be slightly wrong with my implementation. In the beginning I had not sorted captions that were inputed to the Decoder based on their length, and I realized that this might really be important. Sorting captions allows them to be aligned with each-other and leads to the model focusing on important parts and not on the pad tokens.Therefore, I implemented this change in model IV and as a result, there was an increase in the BLEU-4 metric. Last but not least, I decided to train the model again and this time also fine-tune the encoder.Its results are represented in the last row of table above.

The table below summarizes the results of two papers I have mentioned above (VLP is state-of-the-art) and my best model:

Implementation	BLEU-1	BLEU-2	BLEU-3	BLEU-4
Show, Attend and Tell	66.7	43.4	28.8	19.1
VLP	-	-	-	31.1
My Best Implementation	65.27	42.49	26.39	16.43

Unfortunatey, I was hoping to achieve a BLEU-4 metric of 20, but the best I could get was 16.43.

Project Structure

loader.py - data preparation of my dataset for the model
vocab.py - creates vocabulary of my dataset with size 10000
model.py - contains Encoder, Attention and Decoder classes
model_utils.py - contains function for training, evaluating, function for saving best model
train&evaluate_model***.ipynb- the notebooks 1 to 5 are used for training the different models and evaluating them on test set
inference.py - the beam_search function for generating captions
inference_notebook.ipynb - notebook that displays models performance using beam_search function
model_files - files generated while training the models which contain information on epochs, train loss, valid loss and BLEU scores
Due to the large size the trained models are not uploaded

Actual Work-Breakdown Structure

Dataset Collection - I used an already available dataset
Data Preparation - took me up to 1.5 weeks to explore my data, and write the loader.py and vocab.py files.
Design and build of a model - As expected, it was the most challenging part as it was my first time creating a model from scratch. This part took me 2.5 weeks.
Train of the model - 1 week
Fine Tuning - 1 week

Assignment III

For this part of the assignment a small demo application was implemented to show the solution of the chosen task. For that I created a small website application using Streamlit library that can be found in app.py file. In order to get the application running, the user needs to run the following command: streamlit run app.py. Unfortuntately, due to size of the model larger than 25MB, it cannot be uploaded as a realease to this GitHub repo. Therefore, I decided to provide everyone that wants to give it a try with a download link https://drive.google.com/file/d/1-06OhCe1OSY69aLAs7tvbbVBy2jtCCPZ/view?usp=sharing .

As part of the assignment III, project report and presentation are also added to the repository.

References

@article{Show, Attend and Tell,
  title={Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention},
  author={Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio},
  journal={arXiv preprint arXiv:1502.03044 },
  year={2015}
}

@article{zhou2019vlp,
  title={Unified Vision-Language Pre-Training for Image Captioning and VQA},
  author={Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao},
  journal={arXiv preprint arXiv:1909.11059},
  year={2019}
}

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
model_files		model_files
.travis.yml		.travis.yml
README.md		README.md
Report.pdf		Report.pdf
app.py		app.py
download.sh		download.sh
inference.py		inference.py
inference_notebook.ipynb		inference_notebook.ipynb
loader.py		loader.py
model.py		model.py
model_utils.py		model_utils.py
presentation.pptx		presentation.pptx
requirements.txt		requirements.txt
testimages_folder.py		testimages_folder.py
train&evaluate_model1.ipynb		train&evaluate_model1.ipynb
train&evaluate_model2&3.ipynb		train&evaluate_model2&3.ipynb
train&evaluate_model4.ipynb		train&evaluate_model4.ipynb
train&evaluate_model5.ipynb		train&evaluate_model5.ipynb
vocab.py		vocab.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AppliedDeepLearning

Topic: Computer Vision

Assignment I

Image Caption Generator

Related Work

Approach

Dataset

Prediction of Work-Breakdown Structure

Assignment II

Short Intro (Recap)

Error Metric

Project Structure

Actual Work-Breakdown Structure

Assignment III

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

nadiaverdha/AppliedDeepLearning

Folders and files

Latest commit

History

Repository files navigation

AppliedDeepLearning

Topic: Computer Vision

Assignment I

Image Caption Generator

Related Work

Approach

Dataset

Prediction of Work-Breakdown Structure

Assignment II

Short Intro (Recap)

Error Metric

Project Structure

Actual Work-Breakdown Structure

Assignment III

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages