NLP with Deep Learning

NLP with Deep Learning

September 9, 2021
September 16, 2021

Tags in blue are handcrafted tags; Tags in green are generated using AutoTag.


Attention is all you need - Arxiv

The Input and Output of Transformer

Model Architecture of the Transformer

Feed Forward: Two fully connected layers (Linear) with a ReLU activation in between.

Multi-head Attention:

Multi-head Attention


Attention - Qiita

General Architecture of the Attention Mechanism


Self Attention

GPT (Generative Pre-Training)

GPT - paper

The workflow of GPT

BERT (Bidirectional Encoder Representations from Transformers)

BERT - Arxiv

BERT Explained

Different from the Transformer (GPT) which is trained using only the left context, BERT uses bidirectional encoder which makes use of both the left and the right context. [MASK] is used to mask some of the words so that the model will not see the word itself indirectly.

Pre-training of BERT makes use of two strategies: MLM (Masked Language Model) and NSP (Next Sentence Prediction). The model is trained with both the strategies together.

The Training Process of BERT

As shown below, the input embeddings of BERT consists of the token embeddings, the segment embeddings, and the position embeddings. Note that a segment may consists of multiple sentences.

The Input Embedding of BERT

In MLM task, the final hidden vectors of the masked tokens are fed into an output softmax over the vocaburary.

BERT can also be used for embedding words and sentences.

BERT word embeddings tutorial


GloVe - Paper

The rough idea of GloVe (Global Vectors) is to find the embedding vector for each word in the corpus, such that the product of two word-vectors is an approximation to the logarithm of the co-occurrence of the two words in the corpus. The optimization target is:

$$ J = \sum^V_{i,j=1} f(X_{ij}) ( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - log X_{ij} )^2 $$


  • \( w_i^T \) and \( \tilde{w}_j \) are the embedding vectors for word i and context word j.
  • \( b_i \) and \( \tilde{b}_j \) are biases.
  • \( X_{ij} \) is the co-occurrence of word i and j.
  • \( f(X_{ij}) \) is a weighting function to avoid overweighting rare co-occurrences and frequent co-occurrences, and also satisfying that \( f(0) = 0 \) .