Very cool idea!!! How can one contribute?

I saw your post on twitter about your new method for attention approximation and I think this is a cool idea! But can you clarify a few things?
**Approximation Method:** Is your method genuinely approximating attention, or is it fundamentally different? From what I gather, if one intends to use the Random Maclaurin (RM) method while retaining the query (q), key (k), and value (v) components, it would seem similar to approaches like the [Performer](https://arxiv.org/pdf/2009.14794.pdf) or the [RANDOM FEATURE ATTENTION](https://openreview.net/pdf?id=QtTKTdVrFBB). These methods approximate the RBF kernel as: $\kappa (\mathbf{q}, \mathbf{k}) = \mathbb{E} [\mathbf{Z}(\mathbf{q}),\mathbf{Z}(\mathbf{k})^{T}]$ and $\mathbf{Z}: \mathbb{R}^d \rightarrow \mathbb{R}^D, \mathbf{Z}: \mathbf{x} \mapsto \frac{1}{\sqrt{D}}\left(Z_1(\mathbf{x}), \ldots, Z_D(\mathbf{x})\right)$, which characterizes the RM algorithm. In its final form, it looks like this:
```math
\sum_{i=1}^{4} \mathbf{Z_i}(\mathbf{q}) \mathbf{Z_i}(\mathbf{k})^{T} \mathbf{v}
```
If I understand correctly, methods like these work because they reorganize matrix multiplications, thereby removing the $n^2$ dependency. For the RM method, this results in four sums each with multiplication with dimensions (n Dd), (Dd n), and (n d), assuming 'D' represents what you call 'order_expand.' A high 'D' value is crucial for the RM algorithm is it the case also here?.
**Query, Key, Value Components:** It appears you're not maintaining the traditional query, key, and value framework. How does this approach approximate attention without these components? I initially thought 'h' in your diagram played the role of the queries, but after examining the diagram (linked below), it doesn't seem to be the case. It is more like context. Also why average over the token lengths? Is this how tokens are mixed and communicate with each other?
![image](https://raw.githubusercontent.com/davidpicard/HoMM/master/misc/homm.png)
Can you explain more what your algorithm is trying to accomplish?  It looks like it's replacing the self-attention mechanism, but does it require additional heads or capacity to become akin to MHA?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very cool idea!!! How can one contribute? #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Very cool idea!!! How can one contribute? #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions