Question about training loss

Hi,

I’m reproducing continued pre-training with LLaMA Factory, initializing from qwen3-base. The loss starts around 30 in the first steps, which is much higher than with standard AR training. When I initialize from your released checkpoints, the loss starts around 8, which also seems relatively high.

From your experience, is this expected behavior, or does it suggest a configuration issue on my side?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about training loss #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about training loss #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions