AnimateDIFF + ELLA 

Should be pretty straightforward, 
AnimateDIFF adds some motion layers to UNet 
ELLA works on time step aware semantic connector and prompt encoder, 

In theory should work fine, let's see.