Skip to content

Training with larger input sizes (e.g., 1280x1280) leads to NaN loss #3

@Ibrah-N

Description

@Ibrah-N

Hi Sebastian,

First of all, thank you for releasing this great work — I’ve been experimenting with LINEA and it has been performing really well on my dataset.

So I have trained the model for the images of patches 1024x1024 without changing the any original repo configuration and it was giving a very good results, but I want to train the model for larger sizes like 2048 or may higher, to test if the model better generalize on larger images so I train for larger patches to gain more context level special features.

So to try the model for larger input size that starts giving me issues while updating all those below params with the rule of thumb (multiplying by 2). Below is details explanation of the configuration changes I made.


configs/linea/include/dataset.py
===== original configuration =====

data_aug_scales = [(640, 640)]
data_aug_max_size = 1333
data_aug_scales2_resize = [400, 500, 600]
data_aug_scales2_crop = [384, 600]

===== changed configuration =====

data_aug_scales = [(1280, 1280)]
data_aug_max_size = 2666
data_aug_scales2_resize = [800, 1000, 1200]
data_aug_scales2_crop = [768, 1200]

configs/linea/include/linea.py
===== original configuration =====

modelname = 'LINEA'
eval_spatial_size = (640, 640)
eval_idx = 5 # 6 decoder layers
num_classes = 2
...
...
...

===== changed configuration =====

modelname = 'LINEA'
eval_spatial_size = (1280, 1280)
eval_idx = 5 # 6 decoder layers
num_classes = 2
...
...
...


Model Training:

%cd /content/LINEA
!torchrun --nproc_per_node=1 main.py \
  --config_file configs/linea/linea_hgnetv2_l.py \
  --coco_path data/wireframe_processed \
  --resume weights/linea_hgnetv2_l.pth \
  --num_workers 2

OUTPUT LOGS:
/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via init_process_group or barrier . Using the current device set by the user.
warnings.warn( warn only once
[rank0]:[W906 13:14:41.801293515 ProcessGroupNCCL.cpp:5023] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
Initialized distributed mode...
Namespace(config_file='configs/linea/linea_hgnetv2_l.py', options=None, coco_path='data/wireframe_processed', device='cuda', seed=42, resume='', start_epoch=0, eval=False, num_workers=2, find_unused_params=False, world_size=1, rank=0, local_rank=None, amp=False, gpu=0, distributed=True, data_aug_scales=[(1280, 1280)], data_aug_max_size=2666, data_aug_scales2_resize=[800, 1000, 1200], data_aug_scales2_crop=[768, 1200], data_aug_scale_overlap=None, batch_size_train=2, batch_size_val=2, lr=0.00025, weight_decay=0.000125, betas=[0.9, 0.999], epochs=16, lr_drop_list=[11], clip_max_norm=0.1, save_checkpoint_interval=1, modelname='LINEA', eval_spatial_size=(1024, 1024), eval_idx=5, num_classes=2, pretrained=True, use_checkpoint=False, return_interm_indices=[1, 2, 3], freeze_norm=True, freeze_stem_only=True, hybrid_encoder='hybrid_encoder_asymmetric_conv', in_channels_encoder=[512, 1024, 2048], pe_temperatureH=20, pe_temperatureW=20, transformer_activation='relu', batch_norm_type='FrozenBatchNorm2d', masks=False, aux_loss=True, num_queries=1100, query_dim=4, num_feature_levels=3, dec_n_points=[4, 1, 1], dropout=0.0, pre_norm=False, use_dn=True, dn_number=300, dn_line_noise_scale=1.0, dn_label_noise_ratio=0.5, embed_init_tgt=True, dn_labelbook_size=2, match_unstable_error=True, set_cost_class=2.0, set_cost_lines=5.0, criterionname='LINEACRITERION', criterion_type='default', weight_dict={'loss_logits': 4, 'loss_line': 5}, losses=['labels', 'lines'], focal_alpha=0.1, matcher_type='HungarianMatcher', nms_iou_threshold=-1, use_ema=False, ema_decay=0.9997, ema_epoch=0, output_dir='output/linea_hgnetv2_l', backbone='HGNetv2_B4', param_dict_type='hgnetv2_b4', use_lab=False, feat_strides=[8, 16, 32], hidden_dim=256, dim_feedforward=1024, nheads=8, use_lmap=False, expansion=0.5, depth_mult=1.0, feat_channels_decoder=[256, 256, 256], dec_layers=6, num_select=300, reg_max=16, reg_scale=4, use_warmup=False, model_parameters=[{'params': '^(?=.backbone)(?!.norm|bn).$', 'lr': 1.25e-05}, {'params': '^(?=.(?:encoder|decoder))(?=.(?:norm|bn)).$', 'weight_decay': 0.0}])
Loading stage1
Loaded stage1 B4 HGNetV2 from local file.
building train_dataloader with batch_size=2...
loading annotations into memory...
Done (t=0.03s)
creating index...
index created!
building val_dataloader with batch_size=2...
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Multi-scaling uses the following size: [768, 800, 832, 864, 896, 928, 960, 992, 1024, 1024, 1024, 1056, 1088, 1120, 1152, 1184, 1216, 1248, 1280]

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params: 25.04 M
fwd MACs: 91.9696 GMACs
fwd FLOPs: 184.327 GFLOPS
fwd+bwd MACs: 275.909 GMACs
fwd+bwd FLOPs: 552.98 GFLOPS

{'flops': '184.327 GFLOPS', 'macs': '91.9696 GMACs', 'params': 25036882}
----------------------------------------- Start training ------------------------------------------
Epoch: [0] [ 0/422] eta: 0:27:19 lr: 0.000013 loss: 79.0159 (79.0159) loss_line: 3.3012 (3.3012) loss_line_0: 3.3077 (3.3077) loss_line_1: 3.3077 (3.3077) loss_line_2: 3.3012 (3.3012) loss_line_3: 3.3012 (3.3012) loss_line_4: 3.3012 (3.3012) loss_line_dn_0: 5.7845 (5.7845) loss_line_dn_1: 5.7915 (5.7915) loss_line_dn_2: 5.8021 (5.8021) loss_line_dn_3: 5.8139 (5.8139) loss_line_dn_4: 5.8252 (5.8252) loss_line_dn_5: 5.8357 (5.8357) loss_line_interm: 3.3012 (3.3012) loss_logits: 1.6663 (1.6663) loss_logits_0: 1.4373 (1.4373) loss_logits_1: 1.5000 (1.5000) loss_logits_2: 1.5760 (1.5760) loss_logits_3: 1.6247 (1.6247) loss_logits_4: 1.6522 (1.6522) loss_logits_dn_0: 1.6217 (1.6217) loss_logits_dn_1: 1.6513 (1.6513) loss_logits_dn_2: 1.7024 (1.7024) loss_logits_dn_3: 1.7407 (1.7407) loss_logits_dn_4: 1.7579 (1.7579) loss_logits_dn_5: 1.7574 (1.7574) loss_logits_interm: 1.3535 (1.3535) time: 3.8844 data: 1.3720 max mem: 6224
Loss is nan, stopping training
{'loss_line': tensor(0., device='cuda:0', grad_fn=), 'loss_line_0': tensor(0., device='cuda:0', grad_fn=), 'loss_line_1': tensor(0., device='cuda:0', grad_fn=), 'loss_line_2': tensor(0., device='cuda:0', grad_fn=), 'loss_line_3': tensor(0., device='cuda:0', grad_fn=), 'loss_line_4': tensor(0., device='cuda:0', grad_fn=), 'loss_line_dn_0': tensor(0., device='cuda:0', grad_fn=), 'loss_line_dn_1': tensor(0., device='cuda:0', grad_fn=), 'loss_line_dn_2': tensor(0., device='cuda:0', grad_fn=), 'loss_line_dn_3': tensor(0., device='cuda:0', grad_fn=), 'loss_line_dn_4': tensor(0., device='cuda:0', grad_fn=), 'loss_line_dn_5': tensor(0., device='cuda:0', grad_fn=), 'loss_line_interm': tensor(0., device='cuda:0', grad_fn=), 'loss_logits': tensor(0.6740, device='cuda:0', grad_fn=), 'loss_logits_0': tensor(1.3308, device='cuda:0', grad_fn=), 'loss_logits_1': tensor(0.4909, device='cuda:0', grad_fn=), 'loss_logits_2': tensor(0.5068, device='cuda:0', grad_fn=), 'loss_logits_3': tensor(0.4487, device='cuda:0', grad_fn=), 'loss_logits_4': tensor(0.5344, device='cuda:0', grad_fn=), 'loss_logits_dn_0': tensor(nan, device='cuda:0', grad_fn=), 'loss_logits_dn_1': tensor(nan, device='cuda:0', grad_fn=), 'loss_logits_dn_2': tensor(nan, device='cuda:0', grad_fn=), 'loss_logits_dn_3': tensor(nan, device='cuda:0', grad_fn=), 'loss_logits_dn_4': tensor(nan, device='cuda:0', grad_fn=), 'loss_logits_dn_5': tensor(nan, device='cuda:0', grad_fn=), 'loss_logits_interm': tensor(0.4288, device='cuda:0', grad_fn=)}
[rank0]:[W906 13:15:48.271268219 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0906 13:15:50.690000 21783 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 21790) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 10, in
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-09-06_13:15:50
host : d51c69c24c24
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 21790)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions