Refactor trainer class #101

MaHaWo · 2025-11-28T08:12:55Z

refactor Trainer and TrainerDDP
includes json schema and more general optimizer into Trainer class
generalize and unify initialization of models, optimizers, early_stopping, validator, tester...
adjust tests

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Comments suppressed due to low confidence (3)

QuantumGravPy/src/QuantumGrav/train.py:710

The docstring parameter type doesn't match the actual type annotation. The parameter is typed as pd.DataFrame in the signature, but the docstring says list[Any]. Update the docstring to reflect the correct type.

        """Check the status of the model during training.

        Args:
            eval_data (list[Any]): The evaluation data from the training epoch.

QuantumGravPy/src/QuantumGrav/train.py:320

The docstring is outdated. The parameters criterion, apply_model, early_stopping, validator, and tester are no longer function parameters—they are now extracted from the config dictionary. The docstring should only document the config parameter and describe what keys it should contain.

        """Initialize the trainer.

        Args:
            config (dict[str, Any]): The configuration dictionary.
            criterion (Callable): The loss function to use.
            apply_model (Callable | None, optional): A function to apply the model. Defaults to None.
            early_stopping (Callable[[Collection[Any]], bool] | None, optional): A function for early stopping. Defaults to None.
            validator (DefaultValidator | None, optional): A validator for model evaluation. Defaults to None.
            tester (DefaultTester | None, optional): A tester for model evaluation. Defaults to None.

QuantumGravPy/src/QuantumGrav/train_ddp.py:110

The docstring is outdated. The parameters criterion, apply_model, early_stopping, validator, and tester are no longer function parameters—they are now extracted from the config dictionary. The docstring should only document the rank and config parameters.

        """Initialize the distributed data parallel (DDP) trainer.

        Args:
            rank (int): The rank of the current process.
            config (dict[str, Any]): The configuration dictionary.
            criterion (Callable): The loss function.
            apply_model (Callable | None, optional): The function to apply the model. Defaults to None.
            early_stopping (Callable[[list[dict[str, Any]]], bool] | None, optional): The early stopping function. Defaults to None.
            validator (DefaultValidator | None, optional): The validator for model evaluation. Defaults to None.
            tester (DefaultTester | None, optional): The tester for model testing. Defaults to None.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

QuantumGravPy/src/QuantumGrav/train.py

QuantumGravPy/test/test_trainer.py

QuantumGravPy/test/test_trainer_ddp.py

QuantumGravPy/src/QuantumGrav/train.py

QuantumGravPy/src/QuantumGrav/train_ddp.py

QuantumGravPy/src/QuantumGrav/train.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (1)

QuantumGravPy/src/QuantumGrav/train.py:877

Missing validation: The docstring at line 863 states "Raises: ValueError: If the model is not initialized" but there's no check for self.model is None before calling self.model.save(outpath) at line 877. This will result in an AttributeError instead of the documented ValueError. Since the model is now initialized in __init__, this might be intentional, but the docstring should be updated to reflect the actual behavior or the check should be added back.

    def save_checkpoint(self, name_addition: str = ""):
        """Save model checkpoint.

        Raises:
            ValueError: If the model is not initialized.
            ValueError: If the model configuration does not contain 'name'.
            ValueError: If the training configuration does not contain 'checkpoint_path'.
        """
        self.logger.info(
            f"Saving checkpoint for model at epoch {self.epoch} to {self.checkpoint_path}"
        )
        outpath = self.checkpoint_path / f"model_{name_addition}.pt"

        if outpath.exists() is False:
            outpath.parent.mkdir(parents=True, exist_ok=True)
            self.logger.debug(f"Created directory {outpath.parent} for checkpoint.")

        self.latest_checkpoint = outpath
        self.model.save(outpath)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

QuantumGravPy/src/QuantumGrav/train.py

QuantumGravPy/test/test_trainer.py

QuantumGravPy/src/QuantumGrav/train.py

QuantumGravPy/test/test_trainer.py

QuantumGravPy/test/test_trainer_ddp.py

QuantumGravPy/src/QuantumGrav/train.py

QuantumGravPy/src/QuantumGrav/train_ddp.py

QuantumGrav.jl/src/branchedcsetgeneration.jl

QuantumGrav.jl/test/test_save_data.jl

QuantumGravPy/src/QuantumGrav/train_ddp.py

QuantumGravPy/test/test_early_stopping.py

QuantumGravPy/src/QuantumGrav/train_ddp.py

QuantumGravPy/test/test_trainer_ddp.py

mephistoteles-whatever

Looks good. I've got some small comments. The main question I had was about the determinism of numpy (non-)seeding in the trainer class.

QuantumGravPy/src/QuantumGrav/dataset_ondisk.py

QuantumGravPy/src/QuantumGrav/train.py

QuantumGravPy/src/QuantumGrav/train_ddp.py

MaHaWo · 2025-12-17T16:50:29Z

added remarks, and added the possibility to use learning rate schedulers. that's useful to stop optimizers jumping around a minimum in later stages of the optimization while also allowing them to escape from bad minima and making larger steps at the start.

MaHaWo added 30 commits November 25, 2025 14:00

put model defs into a separate dir

7b59d22

add SkipConnection

89d7388

adjust imports

ce1d70f

correct errors, exports

275527c

fix gnn_block tests

feaba08

adjust fixtures and some tests

09bbde0

add jsonschema and new way to initialize it to GNNBlock

0b56787

fix linearsequential model tests

6ffc6a2

fix utils tests

ae30e45

adjust model exports

c48e0ca

fix problem with type vs string

ac75cb7

don´t allow additional props in config for skipconnections

23f7103

write skip connection test

92c8d23

make skip connection tests work

f888875

don't allow additional stuff in config

a45b8a0

rework gnnmodel init, work on json schema

4e6d7c2

work on gnn model

7fc110a

fix type annotation

59697f6

work on gnn_model tests

4e412e9

remove verify_config

ad31b3e

work on gnn_model tests

ba91ee2

remove to_config requirement

69551e5

add missing fixture

664cac4

add pre-commit stuff,

172a77b

add arbitrary model test.

4e52ed9

correct some docstrings, imports

8f0d3c1

make eval tests work again

8bc428b

ajust trainer code for new model config

e7c26b3

finish test adjustments

1336a4e

remove old early stopper

8865e19

Copilot AI reviewed Dec 8, 2025

View reviewed changes

MaHaWo requested a review from Copilot December 9, 2025 07:51

Copilot started reviewing on behalf of MaHaWo December 9, 2025 07:52 View session

Copilot AI reviewed Dec 9, 2025

View reviewed changes

MaHaWo added 6 commits December 9, 2025 09:54

fix docstrings

bcd4117

implement some review suggestions

80a95d8

adjust tests

1b1261b

implement some more review comments

2a191c6

correct some small mistakes

7b8bd14

remove newline

ff3a721