Skip to content

Conversation

@2015aroras
Copy link
Collaborator

@2015aroras 2015aroras commented Oct 7, 2024

Issue: If we make a backward incompatible change or a regression, we don't have a mechanism to catch it. Also, if we start running jobs on a new platform we don't have an easy way to tell if the training will be identical to existing platforms.

Fix: Add training regression tests that run 2 steps of training and compare the model activations against an already-prepared set of model activations from beaker. The saved model activations can be updated by changing the flags passed to the tests.

The added tests also run on CPU, but this required making minor changes to OLMo training code. Autocast works differently on CPU, so the model activations are different for CPU compared to GPU.

The saved model-activations are about 26Mb total, which is a reasonable increase to repo size...

@2015aroras 2015aroras changed the title Shanea/add training regression tests Add regression tests for training Oct 7, 2024
@2015aroras 2015aroras marked this pull request as ready for review October 7, 2024 22:09
@2015aroras 2015aroras requested review from dirkgr and epwalsh October 7, 2024 22:09
@epwalsh
Copy link
Member

epwalsh commented Oct 8, 2024

I'm in favor of adding trainer tests but I have a couple concerns about this approach, the foremost being that I think comparing activations is way too brittle and strict. It would be enough to just check that loss is decreasing over a few steps.

we start running jobs on a new platform we don't have an easy way to tell if the training will be identical to existing platforms.

I wouldn't assume training would or even should be identical across platforms.

The saved model-activations are about 26Mb total, which is a reasonable increase to repo size...

I think this is a lot actually especially considering new versions of these may have to committed over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants