Problems about speed of pretraining 

Hi, I am reproducing the pretraining in your work. The eta shows that it needs to over 20 days to comleting the pretraining on webvid2.5M+cc3M for 10 epochs, which is far from the 1.8 days reported in the paper. Here is all of configs I think is relative.
```
8 * A10 cards without slurm, each card has 23GB (similar with the A5000)
OMP_NUM_THREADS=64 # for torchrun
Dataset = webvid2.5M+cc3M (use the .sqlite.db file), and the data are pre-processed by the preprocess/compress.py. Video are sampled by 2 fps. The resolution is 224.
num_workers = 32
batch_size =  64
Model: BEIT-base + BERT-base
```

Now the ETA for one epoch is over 2 days, so 20+ days for 10 epochs. The following is part of the training log:
```
 utils.basic_utils: Train Epoch: [0]  [  200/10175]  eta: 2 days, 11:04:48  
lr: 0.000002  temperature: 0.0702  image-loss_vtc: 6.2285  
video-loss_vtc: 6.2430  image-loss_mlm: 5.3662  video-loss_mlm: 5.8240  image-loss_vtm: 0.6576  video-loss_vtm: 0.6384  
time: 40.1906  data: 38.0570  max mem: 10768 res mem: 11456
```

In addition, I follow https://github.com/klauscc/VindLU/issues/9 to set Dataloader(multiprocessing_context="spawn", ....) during pretraining, but it also has bug:
```
Traceback (most recent call last):                                                                                                                                   
  File "tasks/pretrain.py", line 285, in <module>                                                                                                                    
    main(cfg)                                                                                                                                                        
  File "tasks/pretrain.py", line 214, in main                                                                                                                        
    config,                                                                                                                                                          
  File "tasks/pretrain.py", line 59, in train                                                                                                                        
    train_loader = MetaLoader(name2loader=dict(list(zip(media_types, train_loaders))))                                                                               
  File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in __init__                                                                                        
    self.name2iter = {name: iter(l) for name, l in name2loader.items()}                                                                                              
  File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in <dictcomp>                                                                                      
    self.name2iter = {name: iter(l) for name, l in name2loader.items()}                                                                                              
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__                                                       
    self._iterator = self._get_iterator()                                                                                                                            
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator                                                  
    return _MultiProcessingDataLoaderIter(self)                                                                                                                      
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1077, in __init__                                                      
    w.start()                                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/process.py", line 112, in start                                                                            
    self._popen = self._Popen(self)                                                                                                                                  
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen                                                                           
    return Popen(process_obj)                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__                                                                       
    self._launch(process_obj)                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch                                                                 
    reduction.dump(process_obj, fp)                                                                                                                                  
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/reduction.py", line 60, in dump                                                                            
    ForkingPickler(file, protocol).dump(obj)                                                                                                                         
AttributeError: Can't pickle local object 'create_dataset.<locals>.<lambda>'  

```


How is that happended? Thank you for your time!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems about speed of pretraining #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Problems about speed of pretraining #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions