This repo is the official implementation of "Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation". Flow2GAN enables high-fidelity audio generation from Mel-spectrograms or discrete audio tokens with efficient few-step (even one-step) inference.
Existing dominant methods for audio generation include Generative Adversarial Networks (GANs) and diffusion-based methods like Flow Matching. GANs suffer from slow convergence and potential mode collapse during training, while diffusion methods require multi-step inference that introduces considerable computational overhead. In this work, we introduce Flow2GAN, a two-stage framework that combines Flow Matching training for learning generative capabilities with GAN fine-tuning for efficient few-step inference. Specifically, given audio's unique properties, we first improve Flow Matching for audio modeling through: 1) reformulating the objective as endpoint estimation, avoiding velocity estimation difficulties when involving empty regions; 2) applying spectral energy-based loss scaling to emphasize perceptually salient quieter regions. Building on these Flow Matching adaptations, we demonstrate that a further stage of lightweight GAN fine-tuning enables us to obtain one-step generator that produces high-quality audio. In addition, we develop a multi-branch network architecture that processes Fourier coefficients at different time-frequency resolutions, which improves the modeling capabilities compared to prior single-resolution designs. Experimental results indicate that our Flow2GAN delivers high-fidelity audio generation from Mel-spectrograms or discrete audio tokens, achieving better quality-efficiency trade-offs than existing state-of-the-art GAN-based and Flow Matching-based methods.
Mel-conditioned results comparison on 24kHz universal audio test-set from RFWave repo
| Model | PESQ ↑ | ViSQOL ↑ |
|---|---|---|
| BigVGAN-v2 | 3.945 | 4.935 |
| Flow2GAN, 1-step | 3.925 | 4.918 |
| Flow2GAN, 2-step | 4.121 | 4.949 |
| Flow2GAN, 4-step | 4.203 | 4.958 |
2025/12/30: Flow2GAN with Mel-spectrogram conditioning is released, with checkpoints available at
2025/12/31: Chinese WeChat blog post about Flow2GAN: BigVGAN 平替:高保真声码器 Flow2GAN
2026/01/20: Mel-spectrogram conditioned model checkpoints trained on a large-scale universal audio dataset (24kHz, 44kHz) are now available at
git clone https://github.com/k2-fsa/Flow2GAN.git
cd Flow2GANpip install -r requirements.txtpip install -r requirements_eval.txt
# For ViSQOL, please follow instructions at https://github.com/google/visqolexport PYTHONPATH=$PWD:$PYTHONPATHThe following examples will automatically download the required pretrained models from HuggingFace Hub. If you have trobles accessing HuggingFace, you can consider setting up the HF mirror endpoint:
export HF_ENDPOINT=https://hf-mirror.comimport os
import soundfile as sf
import torch
import torchaudio
from flow2gan import get_model
from flow2gan.models.modules import LogMelSpectrogram
step = 4 # Could set step to 1,2,4
# 24kHz model
model_name = "mel_24k_base"
hf_model_name = f"libritts-mel-{step}-step"
# hf_model_name = f"universal-24k-mel-{step}-step" # For universal audio model
# 44kHz model
# model_name = "mel_44k_128band_512x_base"
# hf_model_name = f"universal-44k-mel-128band-512x-{step}-step" # For universal audio model
# Required model will be downloaded from HuggingFace Hub automatically
model, model_cfg = get_model(model_name=model_name, hf_model_name=hf_model_name)
cond_module = LogMelSpectrogram(
sampling_rate=model_cfg.sampling_rate,
n_fft=model_cfg.mel_n_fft,
hop_length=model_cfg.mel_hop_length,
n_mels=model_cfg.n_mels,
center=True,
power=1,
)
device = torch.device("cpu")
if torch.cuda.is_available():
device = torch.device("cuda", 0)
model.to(device)
model.eval()
cond_module.to(device)
cond_module.eval()
input_path = "./test_data/wav/1089_134686_000002_000000.wav"
# input_path = "./test_data/wav_44k/mixture.wav"
output_path = "output.wav"
audio, sr = torchaudio.load(input_path)
assert sr == model_cfg.sampling_rate
audio = audio.to(device)
if audio.shape[0] > 1:
audio = torch.mean(audio, dim=0, keepdim=True) # to mono
# audio: (1, samples)
with torch.inference_mode():
mel_spec = cond_module(audio) # (1, n_mels, frames)
pred_audio = model.infer(cond=mel_spec, n_timesteps=step, clamp_pred=True)
sf.write(output_path, pred_audio.cpu().squeeze(0).numpy(), sr)
print(f"Wrote output to {output_path}")import os
import soundfile as sf
import torch
from flow2gan import get_model
step = 4 # Could set step to 1,2,4
# 24kHz model
model_name = "mel_24k_base"
hf_model_name = f"libritts-mel-{step}-step"
# hf_model_name = f"universal-24k-mel-{step}-step" # For universal audio model
# 44kHz model
# model_name = "mel_44k_128band_512x_base"
# hf_model_name = f"universal-44k-mel-128band-512x-{step}-step" # For universal audio model
# Required model will be downloaded from HuggingFace Hub automatically
model, model_cfg = get_model(model_name=model_name, hf_model_name=hf_model_name)
device = torch.device("cpu")
if torch.cuda.is_available():
device = torch.device("cuda", 0)
model.to(device)
model.eval()
input_path = "./test_data/mel/1089_134686_000002_000000.pt"
# input_path = "./test_data/mel_44k_128band_512x/mixture.pt"
output_path = "output.wav"
mel_spec = torch.load(input_path) # (1, n_mels, frames)
mel_spec = mel_spec.to(device)
with torch.inference_mode():
pred_audio = model.infer(cond=mel_spec, n_timesteps=step, clamp_pred=True)
sf.write(output_path, pred_audio.cpu().squeeze(0).numpy(), model_cfg.sampling_rate)
print(f"Wrote output to {output_path}")step=4 # Could be 1,2,4
# 24kHz model
model_name=mel_24k_base
hf_model_name=libritts-mel-${step}-step
# hf_model_name=universal-24k-mel-${step}-step # For universal audio model
# 44kHz model
# model_name=mel_44k_128band_512x_base
# hf_model_name=universal-44k-mel-128band-512x-${step}-step # For universal audio model
# Required model will be downloaded from HuggingFace Hub automatically
# Infer from a directory of mel files (*.pt)
python -m flow2gan.bin.infer_dir \
--model-name $model_name \
--n-timesteps $step \
--hf-model-name $hf_model_name \
--input-type mel \
--input-dir ./test_data/mel/ \
--output-dir ./output_from_mel/
# Infer from a directory of audio files (*.wav)
python -m flow2gan.bin.infer_dir \
--model-name $model_name \
--n-timesteps $step \
--hf-model-name $hf_model_name \
--input-type audio \
--input-dir ./test_data/wav/ \
--output-dir ./output_from_wav/Run the full LibriTTS pipeline to reproduce the reported results, including data preparation, training, inference, and metric evaluation (see details in run_libritts.sh).
bash run_libritts.sh
# You can also extend it to your own dataset with necessary modifications.You can directly discuss on Github Issues.
You can also scan the QR code to join our wechat group or follow our wechat official account.
| Wechat Group | Wechat Official Account |
|---|---|
![]() |
![]() |
@article{yao2025flow2gan,
title={Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation},
author={Zengwei Yao, Wei Kang, Han Zhu, Liyong Guo, Lingxuan Ye, Fangjun Kuang, Weiji Zhuang, Zhaoqing Li, Zhifeng Han, Long Lin, Daniel Povey},
journal={arXiv preprint arXiv:2512.23278},
year={2025}
}



