Small image captioning transformer

A small yet computationally efficient image captioning model with fast token processing. There is an image and a text encoding part that is fused together using a cross-attention module [2]. The captions are generated in an autoregressive manner.

The repository holds utility functions to export the traned captiong model to onnx, then to execute it.

model2onnx.py - convert model to onnx (opset selectable see arguments with --help)
run_onnx.py - executes the model on a single picture (see arguments)
run_onnx_wcam.py - executes the script using the webcam (cam id selectable over command line arguments).

Traing dataset is the COCO captioning dataset [1].

How does it work?

During the traning process a vocabulary will be generated, that is needed to decode correctly the text. Call the generate_caption function, provide an image and the model to the input, then the model will generate a list of captions, as shown in the images below.

test_model = model = ImageCaptioner(vocab_size=len(voc),
                       dim=768, 
                       num_heads=8,
                       num_layers=4, 
                       vis_out_dimension=512, 
                       vis_hxw_out=49,
                       max_len = MAX_LEN,
                       VisionEncoder=CNNEncoder).to(device)

test_model.load_state_dict(torch.load('./sic_model.pth', map_location=device), strict=False)

transform_nn = transforms.Compose([
        transforms.Resize((IMG_SIZE, IMG_SIZE)),
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
    ])

cap = generate_caption(test_model,image,voc,50, return_att=return_att)
print (cap[0])

Model complexity

Backbone	FLOPS[G]	Params[M]
MobileNetV2Encoder	1.48	6.56
CNNEncoder (resnet18 based)	6.41	39.31
VitEncoder (Vision Transformer)	11.46	87.40

Note: With the hybrid method (CNN + transformer), the model is small and fast, but has some limitations in capturing all the scene dependencies. Using a ViT, for a higher computational cost, better accuracy can be achieved.

References

1. COCO - Common Objects Context
2. Attention Is All You Need

/Enjoy.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
info		info
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
model2onnx.py		model2onnx.py
requirements.txt		requirements.txt
run_onnx.py		run_onnx.py
run_onnx_wcam.py		run_onnx_wcam.py
small_image_captioning_transformer.ipynb		small_image_captioning_transformer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small image captioning transformer

How does it work?

Model complexity

References

About

Uh oh!

Releases

Packages

Languages

License

fvilmos/small_image_captioning_transformer

Folders and files

Latest commit

History

Repository files navigation

Small image captioning transformer

How does it work?

Model complexity

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages