NLP with Deep Learning

# NLP with Deep Learning

Donny
September 9, 2021
September 17, 2021
332
-

Tags in blue are handcrafted tags; Tags in green are generated using AutoTag.

## Transformer

Attention is all you need - Arxiv

Feed Forward: Two fully connected layers (Linear) with a ReLU activation in between.

## Attention

Attention - Qiita

Self-Attention

GPT - paper

## BERT (Bidirectional Encoder Representations from Transformers)

BERT - Arxiv

BERT Explained

Different from the Transformer (GPT) which is trained using only the left context, BERT uses bidirectional encoder which makes use of both the left and the right context. [MASK] is used to mask some of the words so that the model will not see the word itself indirectly.

Pre-training of BERT makes use of two strategies: MLM (Masked Language Model) and NSP (Next Sentence Prediction). The model is trained with both the strategies together.

As shown below, the input embeddings of BERT consists of the token embeddings, the segment embeddings, and the position embeddings. Note that a segment may consists of multiple sentences.

In MLM task, the final hidden vectors of the masked tokens are fed into an output softmax over the vocaburary.

BERT can also be used for embedding words and sentences.

BERT word embeddings tutorial

## GloVe

GloVe - Paper

The rough idea of GloVe (Global Vectors) is to find the embedding vector for each word in the corpus, such that the product of two word-vectors is an approximation to the logarithm of the co-occurrence of the two words in the corpus. The optimization target is:

$$J = \sum^V_{i,j=1} f(X_{ij}) ( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - log X_{ij} )^2$$

Where,

• $$w_i^T$$ and $$\tilde{w}_j$$ are the embedding vectors for word i and context word j.
• $$b_i$$ and $$\tilde{b}_j$$ are biases.
• $$X_{ij}$$ is the co-occurrence of word i and j.
• $$f(X_{ij})$$ is a weighting function to avoid overweighting rare co-occurrences and frequent co-occurrences, and also satisfying that $$f(0) = 0$$ .