Attention matrix
dot-product of Query and Key


The attend function receives Query and Key.
- n_seq: input words sequence length
- n_q: size of query and key
\(QK^{T}\) similarity of Q and K
the resulting dot-product (Dot) entries describe a complete (n_seq,n_seq) map of the similarity of all entries of q vs all entries of k.
Masking

to exclude results that occur later in time (causal) or to mask padding or other inputs.
Softmax

\[softmax(x_i)=\frac{\exp(x_i)}{\sum_j \exp(x_j)}\tag{1}\]
applying attention to V

The purpose of the dot-product is to 'focus attention' on some of the inputs. Dot now has entries appropriately scaled to enhance some values and reduce others
𝑉V is of size (n_seq,n_v)

\(Z_{00} = W[0, :] * V[:, 0]\)