Conversation
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Greptile SummaryThis PR improves test coverage and CI reporting across several optimizer components: it fixes a shape bug ( Key points:
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["L0_Tests_CPU.sh"] -->|"torchrun n=8,4"| B["test_distributed_muon_utils_cpu.py"]
B -->|"rank-suffix XML per process"| C["test-results/tests/\ntest_distributed_muon_utils_cpu_n8_rank0.xml\n..."]
A -->|"coverage run loop"| D["test_scalar_optimizers.py\ntest_procrustes_step.py"]
D -->|"XML report"| E["test-results/tests/\ntest_scalar_optimizers.py.xml\n..."]
F["test_soap.py\n(ScheduleTest, SoapFunctionsTest)"] -.->|"NOT in CI loop"| A
G["test_soap_utils.py\n(SoapUtilsTest)"] -.->|"NOT in CI loop"| A
H["soap_utils.py\nget_eigenbasis_eigh / qr"] -->|"empty factor → torch.empty(0,0)"| I["Downstream code\nexpects 2-D tensor"]
J["eig.py\nmet_approx_eigvals_criteria"] -->|"algebraically equivalent\nreformulation"| K["tolerance * ||K|| ≥ ||K|| - ||diag||"]
Last reviewed commit: b4ea359 |
| def test_all_eigenbases_met_criteria_true_eigenbasis_returns_true(self, N: int) -> None: | ||
| kronecker_factor_list = [torch.randn(N, N, device=self.device)] | ||
|
|
||
| eigenbasis_list = [torch.diag(torch.linalg.eigh(K).eigenvalues) for K in kronecker_factor_list] | ||
| self.assertTrue(soap_utils.all_eigenbases_met_criteria(kronecker_factor_list, eigenbasis_list)) |
There was a problem hiding this comment.
Wrong eigh attribute used — .eigenvalues instead of .eigenvectors
torch.linalg.eigh returns a named tuple with .eigenvalues (1-D vector λ) and .eigenvectors (N×N orthogonal matrix Q). The test wraps the 1-D eigenvalues in torch.diag(), producing a diagonal eigenvalue matrix D, and passes that to all_eigenbases_met_criteria.
However, the conjugate function (used internally) assumes its second argument is an orthogonal matrix. Passing a diagonal eigenvalue matrix instead breaks this invariant. The met_approx_eigvals_criteria check will compute a meaningless result and likely pass by chance, so the test does not validate the intended mathematical property.
Additionally, K = torch.randn(N, N) on line 200 is not symmetric; calling torch.linalg.eigh on it is undefined behaviour (PyTorch silently uses only the lower triangular part).
The test should construct a symmetric matrix and use the eigenvectors:
def test_all_eigenbases_met_criteria_true_eigenbasis_returns_true(self, N: int) -> None:
g = torch.randn(N, N, device=self.device)
K_sym = g @ g.T + torch.eye(N, device=self.device) * 1e-5 # symmetric PSD
kronecker_factor_list = [K_sym]
eigenbasis_list = [torch.linalg.eigh(K_sym).eigenvectors]
self.assertTrue(soap_utils.all_eigenbases_met_criteria(kronecker_factor_list, eigenbasis_list))There was a problem hiding this comment.
@mkhona-nvidia , should we use eigen values or eigen vectors?
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Hao Wu <skyw@users.noreply.github.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
| exp_avg_sq_for_sim_ademamix, | ||
| num_beta_fast_warmup_steps=None, | ||
| num_beta_fast_warmup_steps=num_beta_fast_warmup_steps, | ||
| min_beta_fast=0.0, |
There was a problem hiding this comment.
Parameterized correct_bias is shadowed by local reassignment
Same issue as in the test_calculate_ademamix_update_with_alpha_zero_equals_adam function: the parameterized correct_bias argument (with values True and False) is immediately overwritten with correct_bias = False on this line. Both parameterized variants therefore always run with correct_bias=False, so the correct_bias=True branch (which would set centered=True in the RMSProp reference) is never actually exercised.
| min_beta_fast=0.0, | |
| step = 10 | |
| lr = 0.25 |
Remove the correct_bias = False override so the parameterized value flows through to both the calculate_sim_ademamix_update call (line 236) and the centered=correct_bias argument passed to torch.optim.RMSprop (line 252).
Signed-off-by: Hao Wu <skyw@nvidia.com>
|
|
||
| self.assertEqual(len(Q_new_list), 2) | ||
| self.assertEqual(Q_new_list[0].shape, (N, N)) | ||
| self.assertEqual(Q_new_list[1].numel(), 0) |
There was a problem hiding this comment.
Test doesn't verify the actual (0, 0) shape fix
The assertion only checks numel() == 0, which would have passed even before the soap_utils.py fix that changed torch.empty(0) to torch.empty(0, 0). Both a 1-D empty tensor and a 2-D empty tensor have numel() == 0, so this test doesn't actually validate the fix it was written for.
Compare with test_get_eigenbasis_eigh which correctly asserts Q.shape == (0, 0) for the zero-dim case. This test should do the same:
| self.assertEqual(Q_new_list[1].numel(), 0) | |
| self.assertEqual(Q_new_list[1].shape, (0, 0)) |
| @parameterized.parameters( | ||
| {"correct_bias": True, "num_beta_fast_warmup_steps": None}, | ||
| {"correct_bias": False, "num_beta_fast_warmup_steps": 2}, | ||
| ) | ||
| def test_calculate_ademamix_update_with_alpha_zero_equals_adam( | ||
| self, correct_bias: bool, num_beta_fast_warmup_steps: int | None | ||
| ) -> None: |
There was a problem hiding this comment.
Misleading parameter name num_beta_fast_warmup_steps in AdEMAMix test
The parameterized argument is named num_beta_fast_warmup_steps, but inside the function body it is passed to the calculate_ademamix_update call as num_beta_slow_warmup_steps=num_beta_fast_warmup_steps (line 181). "Fast" and "slow" are separate concepts in AdEMAMix — using the wrong name here can mislead a reader into thinking the wrong warmup schedule is being varied.
By contrast, the sibling test test_calculate_sim_ademamix_update_with_zero_momentum_and_alpha_equals_rmsprop (line 207) uses the same parameter name and passes it correctly as num_beta_fast_warmup_steps=. Renaming the parameter in this test to num_beta_slow_warmup_steps would make the intent clear:
| @parameterized.parameters( | |
| {"correct_bias": True, "num_beta_fast_warmup_steps": None}, | |
| {"correct_bias": False, "num_beta_fast_warmup_steps": 2}, | |
| ) | |
| def test_calculate_ademamix_update_with_alpha_zero_equals_adam( | |
| self, correct_bias: bool, num_beta_fast_warmup_steps: int | None | |
| ) -> None: | |
| @parameterized.parameters( | |
| {"correct_bias": True, "num_beta_slow_warmup_steps": None}, | |
| {"correct_bias": False, "num_beta_slow_warmup_steps": 2}, | |
| ) | |
| def test_calculate_ademamix_update_with_alpha_zero_equals_adam( | |
| self, correct_bias: bool, num_beta_slow_warmup_steps: int | None | |
| ) -> None: |
And update the usage on line 181 accordingly:
num_beta_slow_warmup_steps=num_beta_slow_warmup_steps,Signed-off-by: Hao Wu <skyw@nvidia.com>
|
/ok to test b4ea359 |
| for test in "tests/test_scalar_optimizers.py" "tests/test_procrustes_step.py"; do | ||
| report_name="test-results/${test}.xml" | ||
| coverage run -p --source=emerging_optimizers $test --device=cpu -v -2 --xml_output_file="$report_name" || error=1 | ||
| done |
There was a problem hiding this comment.
New test files excluded from CI
test_soap.py and test_soap_utils.py are not added to this script, so all the new test cases added in this PR — including ScheduleTest, test_all_eigenbases_met_criteria_*, and test_get_eigenbasis_qr_empty_factor — will not be executed in CI.
The for-loop only covers test_scalar_optimizers.py and test_procrustes_step.py. Since the PR's stated goal is to "improve test coverage and report", the new tests should also be wired into the CI script, for example:
for test in "tests/test_scalar_optimizers.py" "tests/test_procrustes_step.py" "tests/test_soap.py" "tests/test_soap_utils.py"; do
report_name="test-results/${test}.xml"
coverage run -p --source=emerging_optimizers $test --device=cpu -v -2 --xml_output_file="$report_name" || error=1
done| def test_all_eigenbases_met_criteria_random_eigenbasis_returns_false(self, N: int) -> None: | ||
| kronecker_factor_list = [torch.randn(N, N, device=self.device)] | ||
| eigenbasis_list = [torch.diag(torch.randn(N, device=self.device))] | ||
| self.assertFalse(soap_utils.all_eigenbases_met_criteria(kronecker_factor_list, eigenbasis_list)) |
There was a problem hiding this comment.
Flaky negative test — eigenbasis is not a valid orthonormal matrix
torch.diag(torch.randn(N, device=self.device)) produces a random diagonal matrix, not a random orthonormal matrix. The all_eigenbases_met_criteria function documents that eigenbasis_list should contain "orthonormal eigenbases", and internally calls eig_utils.conjugate(kronecker_factor, eigenbasis, diag=True) which assumes p is orthogonal.
While the test is very unlikely to fail in practice (the 1e-7 tolerance is extremely tight), a semantically correct negative test should use a randomly-rotated but misaligned orthonormal basis — e.g. via QR — rather than a random diagonal matrix that violates the function's own preconditions:
Q_random = torch.linalg.qr(torch.randn(N, N, device=self.device)).Q
eigenbasis_list = [Q_random]Using a proper orthonormal matrix makes the test verify the intended contract ("a correct orthonormal basis that doesn't diagonalize this particular matrix fails the criteria") rather than relying on an invalid input to produce the desired False return.
Test Results 46 files + 12 96 suites +38 1m 27s ⏱️ +18s Results for commit b4ea359. ± Comparison against base commit 7056267. This pull request removes 2 and adds 40 tests. Note that renamed tests count towards both. |
Half done by Claude Code and I reviewed AI written code.
some tiny bugs a fixed along with it.