Problem
The current data loader in prepare.py strictly packs documents to MAX_SEQ_LEN (2048). Training on full context lengths from step 0 is often computationally wasteful and slows down early convergence.
Proposal
Modify the dataloader to support sequence-length warmup (Curriculum Learning). For example, start training with a sequence length of 256 for the first 10% of the training budget, and progressively double it until reaching 2048. This requires updating the make_dataloader logic to dynamically resize the sequence packing on the fly.