# NLP with Deep Learning

2021-09-09

Machine Learning

332

Transformer
Attention is all you need - Arxiv
Feed Forward: Two fully connected layers (Linear) with a ReLU activation in between.
Multi-head Attention:
Attention
Attention - Qiita
Self-Attention
GPT (Generative Pre-Training)
GPT - paper
BERT (Bidirectional Encoder Representations from Transformers)
BERT - Arxiv
BERT Explained
Different from the Transformer (GPT) which is trained using only the left context, BERT uses bidirectional encoder which makes use of both the left and the right context. [MASK] is used to mask some of the words so that the model will not see the word itself indirectly.
Pre-training of BERT makes use of two strategies: MLM (Masked Language Model) and NSP (Next Sentence Prediction). The model is trained with both the strategies together.
As shown below, the input embeddings of BERT consists of the token embeddings, the segment embeddings, and the position embeddings. Note that a segment may consists of multiple sentences.
In MLM task,

# Decision Tree

2019-03-16

Machine-Learning

1037

Information Content
where,
I : the information content of X. An X with greater I value contains more information.
P : the probability mass function.
: b is the base of the logarithm used. Common values of b are 2 (bits), Euler's number e (nats), and 10 (bans).
Entropy (Information Theory)[1]
where,
H : the entropy H. Named after Boltzmann's H-theorem (but the definition is proposed by Shannon). H indicates the uncertainty of X.
P : probability mass function.
I : the information content of X.
E : the expected value operator.
The entropy can explicitly be written as:
ID3[2]
Use ID3 to build a decision tree:
Calculate the entropy of the samples under the current node.
Find a feature F that can maximize the information gain. The information gain is calculatd by:
where E is the entropy of

# Convolutional Neural Network

2018-08-19

Machine-Learning

333

This article is my reflection on my previous work FaceLock, a project to recognize user's face and lock the computer if the user doesn't present in a certain time. CNN is used to recognize different faces. I watch the Coursera course Convolutional Neural Networks by Andrew Ng to understand more about CNN, so it's also a learning note about it.
One Layer of a Convolutional Network
In a non-convolutional network, we have the following formula:
Similarly, in the convolutional network, we can have:
@ is a convolution operation.
@ is the input matrix.
@ is the filter. Different filter can detect different feature, e.g. vertical edge, diagonal edge, etc.
@ is the bias.
@ is a activation function.
@ is the output matrix, and can be fed to the next layer.
Calculating the Number
The Number of the Parameters
Suppose we have 10 filters which are in one layer of a neural

# Mathematical Basis - Squashing Function

2018-08-11

Machine-Learning

444

This article is about some squashing functions of deep learning, including Softmax Function, Sigmoid Function, and Hyperbolic Functions. All of these three functions are used to squash value to a certain range.
Softmax Function
Softmax Function: A generalization of the logistic function that "squashes" a K-dimensional vector z of arbitrary real values to a K-dimensional vector of real values, where each entry is in the range (0, 1], and all the entries add up to 1.
In probability theory, the output of the softmax function can be used to represent a categorical distribution - that is, a probability distribution over K different possible outcomes.
The softmax function is the gradient of the LogSumExp function.
LogSumExp Function
LogSumExp Function: The LogSumExp(LSE) function is a smooth approximation to the maximum function.
( stands for the natural logarithm function, i.e. the logarithm to the base e.)
When directly encountered, LSE can be well-approximated by :
Sigmoid

# Recurrent Neural Network

2018-07-30

Machine-Learning

388

This article is my learning note of the coursera course Sequence Models by Andrew Yan-Tak Ng.
There are two typical RNN units of the hidden layers of the RNN according to Andrew Ng. One is GRN (Gated Recurrent Unit), the other is LSTM (Long Short-Term Memory).
Notice: Please refer to Mathematical Basis - Squashing Function for some basic math knowledge about the squashing functions.
GRN - Gated Recurrent Unit
The GRN is a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.
The fully gated version :
The formulas :
@ : The memory cell.
@ : The input sequence.
@ : The output sequence.
@ : Gate gamma r. It tells us how relevance is to computing the next candidate for .
@ : Gate gamma u. The update gate vector. Decide whether or not we actually update , the memory cell.
@ : The candidate value for the memory

# AdaBoost

2018-03-30

Machine-Learning

1004

Python 实现: AdaBoost - Donny-Hikari - Github
Introduction
AdaBoost 是 Adaptive Boosting 的简称。 Boosting 是一种 Ensemble Learning 方法。 其他的 Ensemble Learning 方法还有 Bagging, Stacking 等。 Bagging, Boosting, Stacking 的区别如下：
Bagging:
Equal weight voting. Trains each model with a random drawn subset of training set.
Boosting:
Trains each new model instance to emphasize the training instances that previous models mis-classified. Has better accuracy comparing to bagging, but also tends to overfit.
Stacking:
Trains a learning algorithm to combine the predictions of several other learning algorithms.
The Formulas
Given a N*M matrix X, and a N vector y, where N is the count of samples, and M is the features of samples. AdaBoost trains T weak classifiers with the following steps:
给定一个N*M的矩阵X（特征），和一个N维向量y（标签），N为样本数，M为特征维度。AdaBoost以一下步骤训练T个弱分类器：

# Classification And Overfitting

2017-11-20

Machine-Learning

658

This is a learning note of Logistic Regression of Machine Learning by Andrew Ng on Coursera.
Hypothesis Representation
Uses the "Sigmoid Function," also called the "Logistic Function":
Which turn linear regression into classification.
Sigmoid function looks like this:
give us the probability that the output is 1.
In fact, is simplified as
for logistic regression, and is
for linear regression. In some complicated case, z might be something like:
Decision Boundary
Decision boundary is the line (or hyperplane) that separates the area where y = 0 and where y = 1 (or separates different classes). It's created by our hypothesis function.
The input to the sigmoid function is not necessary to be linear, and could be a function that describes a circle (e.g. ) or any shape to fit the data.
Cost Function
Using the cost function for linear regression in classification will cause the output to be wavy, resulting in many local optima.