-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi.
Thank you for this great contribution. I would like to know if you can use MRFI during training time, e.g. to perform fault-tolerant training. As I understood, faults are injected as Pytorch hooks, but will these propagate the gradient correctly during training time considering the fault injection (decouple output from input for the outputs affected by the injection)?
Taking a look at the code, I see the forward hooks added at mrfi.MRFI.__add_hoks(), but I don't see that any backward hooks are implemented. So I assume that at this moment MRFI can't be used at training time. Please correct me if I'm wrong.
If that's the case, are there any plans to implement this feature? It would be a very useful feature to implement fault-tolerant training pipelines.
Thanks