Phantom-Data: Towards a General Subject-Consistent Video Generation Dataset
Zhuowei Chen * , Bingchuan Li * †, Tianxiang Ma * , Lijie Liu * , Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, Xinglong Wu
* Equal contribution, † Project lead
Intelligent Creation Lab, ByteDance
- We released the dataset, built upon koala-36M, on Huggingface Phantom-data-Koala36M.
- Add more detailed instruction on how to use this dataset after the national vacation.
Download the meta info from Phantom-data-Koala36M. There are two files:
-
koala36M_multi_ref_meta_info_merged.parquet: This file contains the metadata of all clips. The columns are mostly from Koala36M, with one additionally added columnvidto uniquely identify each clip. -
koala36M_multi_ref_merged_filtered.parquet: This file contains the training data meta info. The columns are:vid: The target clip identifier.video_caption: The caption describing the clip content.cross_pair: A dictionary mapping noun phrases fromvideo_captionto cross-modal reference data. Each entry contains:obj_from_tgt_video: Source objects detected in the target cliprefer_result: List of matching reference images with bounding boxes
Download all clips from koala36M_multi_ref_meta_info_merged.parquet. The clips can be downloaded using youtube_url + timestamp. We refer to Panda-70M for the download implementation.
After downloading the clips, extract reference images from koala36M_multi_ref_merged_filtered.parquet. The refer_result field contains lists of reference images with their corresponding bounding boxes.
- Get the refer frame: The frame is from the
vidand the index can be calculated as:frame_index = int(num_frames * frame_idx)and resize withresize_image(frame_list[frame_index], long_size=768). Theresize_imagefunc is
def resize_image(img_pil, long_size=1024):
width, height = img_pil.size
# Check if the longest side exceeds the limit (long_size)
if max(width, height) > long_size:
# Calculate new dimensions
if width > height:
new_width = long_size
new_height = int((new_width / width) * height)
else:
new_height = long_size
new_width = int((new_height / height) * width)
# Resize the image
img_pil = img_pil.resize((new_width, new_height), Image.LANCZOS)
return img_pil
- Get refer subjects with the bounding box. The bounding box in the parquet is organized as
<x_min, y_min, x_max, y_max>.
Finally we can get the <reference objects, video_caption> ===> target videos triplet pairs.
We would like to thank the Koala-36M team for their valuable work. And we would like to thank our excellent engineering team Ronggui Peng, Bingqian Yi, Xiaojun Lin for their engineering support.
Our team does not use this dataset for any commercial purposes. The Phantom-Data dataset is released for non-commercial research purposes only.
If Phantom-Data is helpful, please help to ⭐ the repo.
If you find this project useful for your research, please consider citing our paper.
@article{chen2025phantom-data,
title={Phantom-Data: Towards a General Subject-Consistent Video Generation Dataset},
author={Chen, Zhuowei and Li, Bingchuan and Ma, Tianxiang and Liu, Lijie and Liu, Mingcong and Zhang, Yi and Li, Gen and Li, Xinghui and Zhou, Siyu and He, Qian and Wu, Xinglong},
journal={arXiv preprint arXiv:2506.18851},
year={2025}
}If you have any comments or questions regarding this open-source project, please open a new issue or contact Zhuowei Chen.