SFTok: Bridging the Performance Gap in Discrete Tokenizers

SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality.
If you are interested in the training stability of discrete tokenizers, please check out our previous work OptVQ; ReVQ.

News

| 2025-12-15: We release the training code of SFTok.
| 2025-12-17: We release the checkpoints of SFTok-B and SFTok-L.

Abstract

Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose SFTok, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating self-forcing guided visual reconstruction and debias-and-fitting training strategy, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).

Installation

pip3 install -r requirements.txt

Get Started

In the following, we use SFTok-B as an example to describe how to utilize SFTok for the training and evaluation of the image reconstruction task. The sole distinction between SFTok-L and SFTok-B lies in their decoder architectures: SFTok-L utilizes a ViT-Large framework while SFTok-B employs a ViT-Base framework.

Inference

Please download the pre-trained models from the following links:

Model	Link (Tsinghua)	Link (Hugging Face)
SFTok-B	Download	Download
SFTok-L	Download	Download

Option 1: Load from Hugging Face

You can load from the Hugging Face model hub by running the following code:

# Example: load the SFTok-B
from modeling.sftok import SFTok
model = SFTok.from_pretrained("AndyRaoTHU/SFTok-B")

Option 2: Load from the local checkpoint

You can also write the following code to load the pre-trained model locally:

# Example: load the SFTok-B
from omegaconf import OmegaConf
from modeling.sftok import SFTok
import torch
config = OmegaConf.load(config_path)
# setup the model
model = SFTok(config)

# load the pre-trained model
state_dict = torch.load(model_ckpt_path, map_location='cpu')
model.load_state_dict(state_dict, strict=True)

Perform inference

After loading the model, you can perform inference (reconstruction):

from omegaconf import OmegaConf
from modeling.sftok import SFTok
import torch
config = OmegaConf.load(config_path)
# setup the model
model = SFTok(config)

# load the pre-trained model
state_dict = torch.load(model_ckpt_path, map_location='cpu')
model.load_state_dict(state_dict, strict=True)

dataset = ...
image = dataset[...]
encoded_tokens = model.encode(image)[1]["min_encoding_indices"]
reconstructed_images = model.decode_tokens(encoded_tokens)

Alternatively, you can directly use the reconstruction example code we provide (toy_example/sftok_recon.py).

Training Preparation

We use webdataset format for data loading. To begin with, it is needed to convert the dataset into webdataset format.

Furthermore, the stage1 and stage2 training relies on a pre-trained MaskGIT as a teacher model. 需要 download the pre-trained MaskGIT model and use it to generate the proxy codes for the training dataset.

Train Tokenizer (SFTok-B as example)

We provide the following example commands for training SFTok (using SFTok-B as an instance). During training, you can monitor various metrics such as rFID through the TensorBoard logs, which contain curves tracking the progression of training statistics.

# Training for SFTok-B
# Stage 1
bash run_stage1.sh

# Stage 2
bash run_stage2.sh

# Stage 3
bash run_stage3.sh

Train Generator

We employ the generative model architecture proposed by MaskGIT as our foundational framework. Below, we provide example commands for training the SFTok generator, using SFTok-B as an instance.

Training (SFTok-B as example)

We provide example commands to train the SFTok generator as follows, using SFTok-B as an example:

bash run_generator.sh

Evaluation (SFTok-B as example)

We provide example commands to evaluate the SFTok generator as follows, using SFTok-B-64 as an example. The evaluation process is carried out using the open-source code and assessment files provided by OpenAI.

git clone https://github.com/openai/guided-diffusion.git
wget https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npz

bash run_eval_generation.sh

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
ckpt		ckpt
configs		configs
data		data
evaluator		evaluator
modeling		modeling
models/vgg_lpips.pth		models/vgg_lpips.pth
scripts		scripts
toy_example		toy_example
utils		utils
.DS_Store		.DS_Store
README.md		README.md
classify_image_graph_def.pb		classify_image_graph_def.pb
imagenet_classes.py		imagenet_classes.py
requirements.txt		requirements.txt
run_eval_generation.sh		run_eval_generation.sh
run_generator.sh		run_generator.sh
run_stage1.sh		run_stage1.sh
run_stage2.sh		run_stage2.sh
run_stage3.sh		run_stage3.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SFTok: Bridging the Performance Gap in Discrete Tokenizers

News

Abstract

Installation

Get Started

Inference

Option 1: Load from Hugging Face

Option 2: Load from the local checkpoint

Perform inference

Training Preparation

Train Tokenizer (SFTok-B as example)

Train Generator

Training (SFTok-B as example)

Evaluation (SFTok-B as example)

About

Uh oh!

Releases

Packages

Languages

Neur-IO/SFTok

Folders and files

Latest commit

History

Repository files navigation

SFTok: Bridging the Performance Gap in Discrete Tokenizers

News

Abstract

Installation

Get Started

Inference

Option 1: Load from Hugging Face

Option 2: Load from the local checkpoint

Perform inference

Training Preparation

Train Tokenizer (SFTok-B as example)

Train Generator

Training (SFTok-B as example)

Evaluation (SFTok-B as example)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages