Should be pretty straightforward, AnimateDIFF adds some motion layers to UNet ELLA works on time step aware semantic connector and prompt encoder, In theory should work fine, let's see.