NLP with Deep Learning
2021-09-09
Machine Learning
332
Transformer
Attention is all you need - Arxiv
Feed Forward: Two fully connected layers (Linear) with a ReLU activation in between.
Multi-head Attention:
Attention
Attention - Qiita
Self-Attention
GPT (Generative Pre-Training)
GPT - paper
BERT (Bidirectional Encoder Representations from Transformers)
BERT - Arxiv
BERT Explained
Different from the Transformer (GPT) which is trained using only the left context, BERT uses bidirectional encoder which makes use of both the left and the right context. [MASK] is used to mask some of the words so that the model will not see the word itself indirectly.
Pre-training of BERT makes use of two strategies: MLM (Masked Language Model) and NSP (Next Sentence Prediction). The model is trained with both the strategies together.
As shown below, the input embeddings of BERT consists of the token embeddings, the segment embeddings, and the position embeddings. Note that a segment may consists of multiple sentences.
In MLM task,