Skip to content
/ vit Public

A PyTorch implementation of 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale' [2020]

License

Notifications You must be signed in to change notification settings

andregaio/vit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vit

A PyTorch implementation of An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Installation

git clone https://github.com/andregaio/vit.git
cd vit
conda create -n vit python=3.8
conda activate vit
pip install -r requirements.txt

Models

Name Patch Size Accuracy Params[M] GMACS
ViTBase 4 86.432 78 5.1
ViTBase 8 81.151 78 1.3
ViTLarge 4 82.198 277 18.0

Dataset

Training

python train.py --model vit_base

Eval

python eval.py --model vit_base --weights weights/checkpoint_00070.pt

Inference

python infer.py --model vit_base --weights weights/checkpoint_00070.pt --image assets/cat.png

Notes

  • This implementation is not designed to be a complete replica of the original
  • Accuracy has been used to evaluate classification performance
  • Has been trained on CIFAR10
  • Input resolution has been changed to 32x32 to match dataset
  • Automatic Mixed Precision (AMP) training with gradient scaling and autocasting
  • Model architecture code has been borrowed from https://github.com/huggingface/pytorch-image-models

About

A PyTorch implementation of 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale' [2020]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages