In model.py, the following code appears:
with torch.no_grad():
ref_outputs = self.codi(input_ids=ref_input_ids, output_hidden_states=True, attention_mask=ref_attention_mask)
ref_outputs_with_grad = self.codi(input_ids=ref_input_ids, output_hidden_states=True, attention_mask=ref_attention_mask)
Since these two forward passes are executed independently, their outputs may exhibit slight numerical differences.
Is this design choice primarily intended to reduce GPU memory usage?
Would it be feasible to use .detach() instead to achieve the same effect while avoiding redundant computation?