Skip to content

Conversation

@aire-meta-bot
Copy link
Collaborator

This PR is auto-generated by AIRE Meta Bot.

  • Please check the CI results before merge it.
  • Please use rebase merge to keep pt-nightly-compatible branch against the main branch.
  • If one or more tests failed, please commit necessary changes to this branch (update-pinned-pytorch-nightly-1.13.0.dev20220801).

@comaniac
Copy link
Contributor

comaniac commented Aug 1, 2022

Diagnosis:

  1. Make kl_div a composite function. pytorch/pytorch#80334 removes kl_div op and make it composite.
  2. Replace all CHECK_ and DCHECK_ with TORCH_* macros pytorch/pytorch#82032 added TORCH_ to all CHECK* macros.

@comaniac
Copy link
Contributor

comaniac commented Aug 1, 2022

Somehow PyTorch upstream now defers the initialization of lazy tensor even for its shape. This results in failure in jit.script because we rely on the input shape (which is already on lazy device in the training loop) to convert PyTorch model to RAF.

While I don't have a clue about where to fix, I workaround this problem, I use meta device to make sure we can still get the input shape without copying the tensor back to CPU.

cc @hzfan @zachzzc @zhouyuan1119

@comaniac
Copy link
Contributor

comaniac commented Aug 1, 2022

Hmm the above solution seems not always working. This CI failed at test_image_model.py:test_compile_lenet_zero1. The stacktrace shows that

#3  0x00007fff6ef3680a in torch_lazy_tensors::Helpers::GetPromotedShape (shape1_dims=..., shape2_dims=...)
    at /home/ubuntu/torch_in_conda/ratex/ratex/lazy_tensor_core/csrc/helpers.cpp:235
235         LTC_CHECK(dim1 == dim2 || dim1 == 1 || dim2 == 1)

This is an add op, and dim1=3; dim2=0. Now I feel this is more like a bug...

@zachzzc please help take a look when you are available. Please note that this is against the PyTorch nightly version 20220801, and we don't really urgent to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants