Megatron support

### 🚀 The feature, motivation and pitch

The current Olmo is trained with fsdp but we want to reproduce the results with Megatron.

### Alternatives

_No response_

### Additional context

_No response_