


Transformer Attention is all you need - Arxiv Feed Forward: Two fully connected layers (Linear) with a ReLU activation in between. Multi-head Attention: Attention Attention - Qiita Self-Attention GPT (Generative Pre-Training) GPT - paper BERT (Bidirectional Encoder Representations from Transformers) BERT - Arxiv BERT Explained Different from the Transformer (GPT) which is trained using only the left context, BERT uses bidirectional encoder which makes use of both the left and the right context. [MASK] is used to mask some of the words so that the model will not see the word itself indirectly. Pre-training of BERT makes use of two strategies: MLM (Masked Language Model) and NSP (Next Sentence Prediction). The model is trained with both the strategies together. As shown below, the input embeddings of BERT consists of the token embeddings, the segment embeddings, and the position embeddings. Note that a segment may consists of multiple sentences. In MLM task,