vit

A PyTorch implementation of An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Installation

git clone https://github.com/andregaio/vit.git
cd vit
conda create -n vit python=3.8
conda activate vit
pip install -r requirements.txt

Models

Name	Patch Size	Accuracy	Params[M]	GMACS
ViTBase	4	86.432	78	5.1
ViTBase	8	81.151	78	1.3
ViTLarge	4	82.198	277	18.0

Dataset

CIFAR10

Training

python train.py --model vit_base

Eval

python eval.py --model vit_base --weights weights/checkpoint_00070.pt

Inference

python infer.py --model vit_base --weights weights/checkpoint_00070.pt --image assets/cat.png

Results

Notes

This implementation is not designed to be a complete replica of the original
Accuracy has been used to evaluate classification performance
Has been trained on CIFAR10
Input resolution has been changed to 32x32 to match dataset
Automatic Mixed Precision (AMP) training with gradient scaling and autocasting
Model architecture code has been borrowed from https://github.com/huggingface/pytorch-image-models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vit

Installation

Models

Dataset

Training

Eval

Inference

Results

Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
data		data
vit		vit
weights		weights
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

andregaio/vit

Folders and files

Latest commit

History

Repository files navigation

vit

Installation

Models

Dataset

Training

Eval

Inference

Results

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages