2021

1

Transformer Attention is all you need - Arxiv Feed Forward: Two fully connected layers (Linear) with a ReLU activation in between. Multi-head Attention: Attention Attention - Qiita Self-Attention GPT (Generative Pre-Training) GPT - paper BERT (Bidirectional Encoder Representations from