Skip to content

jmaxrdgz/YOLO-v1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YOLO-from-paper

Full YOLO computer vision model annotated code from the research paper.

1. Implementing the model

image

Two parts can be considered separatly :

  • the darknet (not sure why it is called this way), consisting of the first 24 convolutionnal layers.
  • the 2 fully-connected layers.

The darknet implementation is quite straightforward as it is fully explicited in the figure above. One nice hack I discovered while coding this project is using a list to instanciate the model's layers. Props to Aladdin Persson for showcasing it in his YOLO implementation video. We create a list using tuples for layer arguments, lists for redundant blocks of layers and characters for Max-Pooling layers. We then define a _create_layers function acting as a parser based on instance types. This will make it way easier when creating YOLO architecture variants such as tiny-YOLO for embedded systems or for pretraining the model.

The fully-connected part requires a bit of reading.

Image is divided into an S x S and for each grid cell predicts B bounding boxes, confidence for those boxes C class probabilities encoded as an S x S x (B x 5 + C) tensor.
[...]
For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7 × 7 × 30 tensor.

The prediction tensor for each cell is B x 5 + C, "5" corresponds to the coordinates of the bounding box plus the confidence score : c, x, y, w, h.

A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers.

We add a dropout between the two layers.

image

Leaky ReLu is used for all activations except for the output layer which is linear as it is considered a regression problem.

Note that the model is different during pretraining, this part is better described in the "training" folder.

2. Loss function

This was for me the hardest part about the project but ended up being an interesting challenge. YOLO's loss function is quite extensive, understanding and translating such a large mathematical expression and make it work was "fun".

3. Training

At the moment I don't have access to the hardware necessary to train YOLO as it was done back in 2017. Detailed steps from the paper are described in the training folder README. Briefly, training was done in several times. First the feature extractor part of the model consisting in the first 20 layers, an average pooling layer and a fully connected one was trained on ImageNet (224x224) resolution. Then the weights were transfered to the full model and trained on Pascal-VOC (448x448).

The feature extractor training lasted one week on a Titan X GPU. A quicker solution I aim at implementing and descibed on this blogpost is to use a pretrained resnet-50 as the feature extractor and train the YOLO classifier head with Pascal-VOC.

This approach is studied in the training folder. Nonetheless, the implementation of the "clean" YOLO (true to the original) was checked by overfitting a 100 example sample from the Pascal-VOC dataset.

About

Full YOLO computer vision model code from the research paper and annotated.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors