Skip to content

Problems about speed of pretraining  #10

@ChaofanTao

Description

@ChaofanTao

Hi, I am reproducing the pretraining in your work. The eta shows that it needs to over 20 days to comleting the pretraining on webvid2.5M+cc3M for 10 epochs, which is far from the 1.8 days reported in the paper. Here is all of configs I think is relative.

8 * A10 cards without slurm, each card has 23GB (similar with the A5000)
OMP_NUM_THREADS=64 # for torchrun
Dataset = webvid2.5M+cc3M (use the .sqlite.db file), and the data are pre-processed by the preprocess/compress.py. Video are sampled by 2 fps. The resolution is 224.
num_workers = 32
batch_size =  64
Model: BEIT-base + BERT-base

Now the ETA for one epoch is over 2 days, so 20+ days for 10 epochs. The following is part of the training log:

 utils.basic_utils: Train Epoch: [0]  [  200/10175]  eta: 2 days, 11:04:48  
lr: 0.000002  temperature: 0.0702  image-loss_vtc: 6.2285  
video-loss_vtc: 6.2430  image-loss_mlm: 5.3662  video-loss_mlm: 5.8240  image-loss_vtm: 0.6576  video-loss_vtm: 0.6384  
time: 40.1906  data: 38.0570  max mem: 10768 res mem: 11456

In addition, I follow #9 to set Dataloader(multiprocessing_context="spawn", ....) during pretraining, but it also has bug:

Traceback (most recent call last):                                                                                                                                   
  File "tasks/pretrain.py", line 285, in <module>                                                                                                                    
    main(cfg)                                                                                                                                                        
  File "tasks/pretrain.py", line 214, in main                                                                                                                        
    config,                                                                                                                                                          
  File "tasks/pretrain.py", line 59, in train                                                                                                                        
    train_loader = MetaLoader(name2loader=dict(list(zip(media_types, train_loaders))))                                                                               
  File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in __init__                                                                                        
    self.name2iter = {name: iter(l) for name, l in name2loader.items()}                                                                                              
  File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in <dictcomp>                                                                                      
    self.name2iter = {name: iter(l) for name, l in name2loader.items()}                                                                                              
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__                                                       
    self._iterator = self._get_iterator()                                                                                                                            
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator                                                  
    return _MultiProcessingDataLoaderIter(self)                                                                                                                      
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1077, in __init__                                                      
    w.start()                                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/process.py", line 112, in start                                                                            
    self._popen = self._Popen(self)                                                                                                                                  
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen                                                                           
    return Popen(process_obj)                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__                                                                       
    self._launch(process_obj)                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch                                                                 
    reduction.dump(process_obj, fp)                                                                                                                                  
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/reduction.py", line 60, in dump                                                                            
    ForkingPickler(file, protocol).dump(obj)                                                                                                                         
AttributeError: Can't pickle local object 'create_dataset.<locals>.<lambda>'  

How is that happended? Thank you for your time!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions