Skip to content

Spectron optimizer for low-rank LLM pretraining #104

Draft
mkhona-nvidia wants to merge 15 commits intoNVIDIA-NeMo:mainfrom
mkhona-nvidia:mkhona/spectron
Draft

Spectron optimizer for low-rank LLM pretraining #104
mkhona-nvidia wants to merge 15 commits intoNVIDIA-NeMo:mainfrom
mkhona-nvidia:mkhona/spectron

Conversation

@mkhona-nvidia
Copy link
Contributor

@mkhona-nvidia mkhona-nvidia commented Feb 17, 2026

Added the Spectron optimizer

Also added power iteration and rayleigh coefficient method to get spectral norm to utils/eig.py

Based on https://arxiv.org/abs/2602.12429

Signed-off-by: mikail <mkhona@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mkhona-nvidia mkhona-nvidia self-assigned this Feb 17, 2026
@mkhona-nvidia mkhona-nvidia requested a review from skyw February 17, 2026 03:44
Signed-off-by: mikail <mkhona@nvidia.com>
…ctors

Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
@greptile-apps
Copy link

greptile-apps bot commented Feb 17, 2026

Greptile Summary

Adds Spectron, a low-rank spectral optimizer with orthogonalized momentum for LLM pretraining based on https://arxiv.org/abs/2602.12429. Maintains weights as low-rank factorizations W = A @ B^T, applies momentum with Newton-Schulz orthogonalization, and scales learning rates by spectral radii.

Major changes:

  • New Spectron optimizer class with SVD-based initialization and low-rank factor updates
  • power_iteration function in utils/eig.py for spectral norm estimation
  • Comprehensive test suite with 13 test cases
  • Integration into CI/CD pipelines

Critical issues preventing production use:

  • Dtype mismatch bugs will cause runtime failures with bfloat16 parameters (the standard dtype for LLM pretraining)
  • Tensor used as scalar in add_ operation may cause issues under torch.compile
  • No test coverage for mixed-precision dtypes that would catch these bugs

Confidence Score: 1/5

  • Critical runtime failures expected with bfloat16 parameters - the stated use case for this optimizer
  • Multiple critical dtype-related bugs will cause runtime errors when using bfloat16 parameters (standard for LLM pretraining). The gradient@factor matmul on line 179 and momentum updates on line 187 will fail due to dtype mismatches between bfloat16 gradients and float32 factors. These are not edge cases - they affect the primary use case stated in the docstring.
  • emerging_optimizers/orthogonalized_optimizers/spectron.py requires dtype casting fixes before merge, tests/test_spectron.py needs bfloat16 test coverage

Important Files Changed

Filename Overview
emerging_optimizers/orthogonalized_optimizers/spectron.py New Spectron optimizer implementation with critical dtype bugs preventing bfloat16 usage in LLM pretraining
emerging_optimizers/utils/eig.py Added power_iteration function implementing Algorithm 3 from Spectron paper - clean implementation
tests/test_spectron.py Comprehensive test suite but missing critical bfloat16 test coverage that would catch dtype bugs

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start([Optimizer Step]) --> CheckGrad{Gradient exists?}
    CheckGrad -->|No| End([Skip parameter])
    CheckGrad -->|Yes| InitCheck{First step?}
    
    InitCheck -->|Yes| SVDInit[SVD Initialization:<br/>W = U·S·V^T<br/>A = U·√S, B = V·√S]
    SVDInit --> InitState[Initialize:<br/>momentum_A, momentum_B<br/>u_A, u_B vectors]
    InitState --> Compute
    
    InitCheck -->|No| Compute[Compute factor gradients:<br/>grad_A = grad @ B<br/>grad_B = grad^T @ A]
    
    Compute --> WD[Apply weight decay<br/>to both factors]
    WD --> Momentum[Update momentum:<br/>momentum_A ← β·momentum_A + 1-β·grad_A<br/>momentum_B ← β·momentum_B + 1-β·grad_B]
    
    Momentum --> NS[Orthogonalize using<br/>Newton-Schulz iteration<br/>requires float32]
    
    NS --> PowerIter[Power iteration:<br/>estimate σ_A, σ_B<br/>spectral radii]
    
    PowerIter --> Scale[Scale learning rate:<br/>η_scaled = η / σ_A + σ_B + 1]
    
    Scale --> Update[Update factors:<br/>A ← A - η_scaled·orth_momentum_A<br/>B ← B - η_scaled·orth_momentum_B]
    
    Update --> Reconstruct[Reconstruct weight:<br/>W ← A @ B^T]
    
    Reconstruct --> End
Loading

Last reviewed commit: d2686bb

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

from emerging_optimizers.orthogonalized_optimizers.orthogonalized_optimizer import *
from emerging_optimizers.orthogonalized_optimizers.scion import *
from emerging_optimizers.orthogonalized_optimizers.spectral_clipping_utils import *
from emerging_optimizers.orthogonalized_optimizers.spectron import * No newline at end of file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing trailing newline

The file is missing a trailing newline after the new import line. This is flagged by most linters and POSIX standards, and the previous version of the file had one.

Suggested change
from emerging_optimizers.orthogonalized_optimizers.spectron import *
from emerging_optimizers.orthogonalized_optimizers.spectron import *

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: mikail <mkhona@nvidia.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: mikail <mkhona@nvidia.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: mikail <mkhona@nvidia.com>
@mkhona-nvidia
Copy link
Contributor Author

/ok to test 326f3f6

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@mkhona-nvidia
Copy link
Contributor Author

/ok to test 326f3f6

@mkhona-nvidia mkhona-nvidia requested a review from a team February 17, 2026 17:50
factor_B.add_(orth_momentum_B, alpha=-scaled_lr)

# Reconstruct full weight matrix: W = A @ B^T
p.copy_(factor_A @ factor_B.mT)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing this reconstruction is for the compatibility with the rest of the library. Otherwise the whole implementation looks correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave the weights of the model as a single matrix, but do the low-rank decomposition as optimizer states (rather than having the low-rank factored weights as 2 separate matrices in the model, which make it harder to access them inside the optimizer). This is functionally identical but makes the SW easier to use

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Copy link

@Pauljanson002 Pauljanson002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is correct with a minor difference. In our work we train the models with only factors. In this implementation the model weights remains in the dense form but optimization happens with low rank factors, reducing optimization state.

Signed-off-by: mikail <mkhona@nvidia.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +178 to +181
with utils.fp32_matmul_precision("highest"):
grad_A = grad @ factor_B # shape: (m, r)
grad_B = grad.mT @ factor_A # shape: (n, r)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradient dtype mismatch with non-fp32 parameters

grad = p.grad inherits p's dtype, but factor_B is always float32 (initialized from torch.linalg.svd(p.float(), ...)). When the parameter is bfloat16 — the standard dtype for LLM pretraining, which is the stated use case — the line grad @ factor_B will raise a RuntimeError at runtime:

RuntimeError: expected scalar type Float but found BFloat16

Even if PyTorch silently promotes the dtype in some contexts, momentum_A.lerp_(grad_A, ...) on line 187 will then fail because momentum_A is float32 but grad_A would be bfloat16.

The gradient should be explicitly cast to float32 before the matmul:

Suggested change
with utils.fp32_matmul_precision("highest"):
grad_A = grad @ factor_B # shape: (m, r)
grad_B = grad.mT @ factor_A # shape: (n, r)
with utils.fp32_matmul_precision("highest"):
grad_A = grad.float() @ factor_B # shape: (m, r)
grad_B = grad.float().mT @ factor_A # shape: (n, r)

Signed-off-by: mikail <mkhona@nvidia.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: mikail <mkhona@nvidia.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@skyw skyw marked this pull request as draft February 23, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants