-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Hi, I am reproducing the pretraining in your work. The eta shows that it needs to over 20 days to comleting the pretraining on webvid2.5M+cc3M for 10 epochs, which is far from the 1.8 days reported in the paper. Here is all of configs I think is relative.
8 * A10 cards without slurm, each card has 23GB (similar with the A5000)
OMP_NUM_THREADS=64 # for torchrun
Dataset = webvid2.5M+cc3M (use the .sqlite.db file), and the data are pre-processed by the preprocess/compress.py. Video are sampled by 2 fps. The resolution is 224.
num_workers = 32
batch_size = 64
Model: BEIT-base + BERT-base
Now the ETA for one epoch is over 2 days, so 20+ days for 10 epochs. The following is part of the training log:
utils.basic_utils: Train Epoch: [0] [ 200/10175] eta: 2 days, 11:04:48
lr: 0.000002 temperature: 0.0702 image-loss_vtc: 6.2285
video-loss_vtc: 6.2430 image-loss_mlm: 5.3662 video-loss_mlm: 5.8240 image-loss_vtm: 0.6576 video-loss_vtm: 0.6384
time: 40.1906 data: 38.0570 max mem: 10768 res mem: 11456
In addition, I follow #9 to set Dataloader(multiprocessing_context="spawn", ....) during pretraining, but it also has bug:
Traceback (most recent call last):
File "tasks/pretrain.py", line 285, in <module>
main(cfg)
File "tasks/pretrain.py", line 214, in main
config,
File "tasks/pretrain.py", line 59, in train
train_loader = MetaLoader(name2loader=dict(list(zip(media_types, train_loaders))))
File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in __init__
self.name2iter = {name: iter(l) for name, l in name2loader.items()}
File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in <dictcomp>
self.name2iter = {name: iter(l) for name, l in name2loader.items()}
File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__
self._iterator = self._get_iterator()
File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1077, in __init__
w.start()
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'create_dataset.<locals>.<lambda>'
How is that happended? Thank you for your time!
Metadata
Metadata
Assignees
Labels
No labels