Skip to content

Conversation

@thanay-sisir
Copy link
Contributor

⚡ Optimization Summary: compute_synchronisation

1. Technical Mechanism

  • Vectorization: Refactors the operation from a dense $O(N^2)$ outer product to a sparse, index-driven Hadamard product using torch.triu_indices.
  • Memory Efficiency: Eliminates the intermediate $(B, N, N)$ tensor allocation, reducing auxiliary memory complexity from $O(B \cdot N^2)$ to $O(1)$.
  • Device Awareness: Enforces explicit device placement for indices to prevent implicit host-to-device transfer overhead.

2. Stability & Scalability

  • Prevents "Quadratic Trap": Removes the bottleneck where memory usage scales quadratically with $d_{model}$, preventing OOM errors on larger model configurations.
  • Eliminates Compute Waste: Bypasses the calculation of the symmetric lower triangle, saving $\approx 50%$ of FLOPs and memory writes previously wasted on discarded data.

Copy link
Collaborator

@lukedarlow lukedarlow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does follow the rules.

However, I would like you to add commentary above these altered lines that shows the old version, explaining their equivalence.

The reason for this is simply to aid readers in understanding what the code is actually doing in relation to the paper.

@thanay-sisir
Copy link
Contributor Author

@lukedarlow when you are going to change that luke.......
Try to merge it as you become free ....😊
Thanks in advance !!!

@lukedarlow
Copy link
Collaborator

I already requested changes from you.

@thanay-sisir
Copy link
Contributor Author

@lukedarlow ok luke how about that ....?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants