Transformer
Attention is all you need - Arxiv
Feed Forward: Two fully connected layers (Linear) with a ReLU activation in between.
Multi-head Attention:
Attention
GPT (Generative Pre-Training)
BERT (Bidirectional Encoder Representations from Transformers)
Different from the Transformer (GPT) which is trained using only the left context, BERT uses bidirectional encoder which makes use of both the left and the right context. [MASK] is used to mask some of the words so that the model will not see the word itself indirectly.
Pre-training of BERT makes use of two strategies: MLM (Masked Language Model) and NSP (Next Sentence Prediction). The model is trained with both the strategies together.
As shown below, the input embeddings of BERT consists of the token embeddings, the segment embeddings, and the position embeddings. Note that a segment may consists of multiple sentences.
In MLM task, the final hidden vectors of the masked tokens are fed into an output softmax over the vocaburary.
BERT can also be used for embedding words and sentences.
GloVe
The rough idea of GloVe (Global Vectors) is to find the embedding vector for each word in the corpus, such that the product of two word-vectors is an approximation to the logarithm of the co-occurrence of the two words in the corpus. The optimization target is:
$$ J = \sum^V_{i,j=1} f(X_{ij}) ( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - log X_{ij} )^2 $$
Where,
- \( w_i^T \) and \( \tilde{w}_j \) are the embedding vectors for word i and context word j.
- \( b_i \) and \( \tilde{b}_j \) are biases.
- \( X_{ij} \) is the co-occurrence of word i and j.
- \( f(X_{ij}) \) is a weighting function to avoid overweighting rare co-occurrences and frequent co-occurrences, and also satisfying that \( f(0) = 0 \) .