Skip to content

Conversation

@HosseinKaviani-H
Copy link
Contributor

  • Move compile from training.compile to compile.enable in llama3_8b.yaml
  • Move compile from training.compile to compile.enable in qwen3_8b.yaml
  • Update main.py to use job_config.compile.enable instead of job_config.training.compile

This aligns the YAML config structure with ForgeJobConfig's expected schema, where compile is a separate top-level dataclass, not nested under training.

See: https://github.com/pytorch/torchtitan/blob/29aafb91b7fbffe2ee259919a3249a0eb1d70779/torchtitan/experiments/forge/job_config.py#L42 where compile is a class not under training.

- Move compile from training.compile to compile.enable in llama3_8b.yaml
- Move compile from training.compile to compile.enable in qwen3_8b.yaml
- Update main.py to use job_config.compile.enable instead of job_config.training.compile

This aligns the YAML config structure with ForgeJobConfig's expected schema,
where compile is a separate top-level dataclass, not nested under training.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 23, 2025
@felipemello1
Copy link
Contributor

felipemello1 commented Dec 23, 2025

We have to compile reference model, training model, loss, generator. If we do compile.enable, how do we manage those? I think that the solution should be the same for sft/grpo. Please take a look and cmd+F "compile" here: https://github.com/meta-pytorch/torchforge/blob/main/apps/grpo/llama3_8b.yaml

I am inclined to think that its better for each actor to have their own compile: bool, instead of compile:str, and user can manage it at the actor level, as it is in the grpo config. SFT doesnt have this problem because its a single actor.

I do see the value in your PR though, because it brings our yaml closer to the jobconfig. No sure how to reconcile the two.

@HosseinKaviani-H
Copy link
Contributor Author

HosseinKaviani-H commented Dec 23, 2025

We have to compile reference model, training model, loss, generator. If we do compile.enable, how do we manage those? I think that the solution should be the same for sft/grpo. Please take a look and cmd+F "compile" here: https://github.com/meta-pytorch/torchforge/blob/main/apps/grpo/llama3_8b.yaml

I am inclined to think that its better for each actor to have their own compile: bool, instead of compile:str, and user can manage it at the actor level, as it is in the grpo config. SFT doesnt have this problem because its a single actor.

I do see the value in your PR though, because it brings our yaml closer to the jobconfig. No sure how to reconcile the two.

@felipemello1 This is a different issue though. SFT and GRPO have fundamental differences as you mentioned — GRPO has multiple actors (trainer, ref_model, generator) that each need independent compile control, while SFT is a single actor.

I see two goals in this PR:

Align with TorchTitan's YAML config schema — In ForgeJobConfig, compile is a top-level dataclass (Compile), not a field under Training. Keeping training.compile causes Training.init() got unexpected keyword argument 'compile' when the config is parsed.

Enable future refactoring — For the TitanTrainer refactor, compile not being part of the Training dataclass means we'd need a monkey-patch or special handling to pop it from the config before passing to the trainer. Moving it to top-level avoids this.

For GRPO's multi-actor case, the current pattern already works well:

compile: true # top-level toggle
trainer:
compile:
enable: ${compile}
ref_model:
compile:
enable: ${compile}

This PR just makes SFT consistent with that same structure. Each recipe can still manage compile at the actor level as needed.

Copy link
Contributor

@JenniferWang JenniferWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForgeJobConfig is a Titan specific integration detail. I'd recommend start with consolidating forge trainer interface to make backend-specific configurable parameters clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants