Transformer Attention is all you need - Arxiv Feed Forward: Two fully connected layers (Linear) with a ReLU activation in between. Multi-head Attention: Attention Attention - Qiita Self-Attention GPT (Generative Pre-Training) GPT - paper BERT (Bidirectional Encoder Representations from



Information Content where, I : the information content of X. An X with greater I value contains more information. P : the probability mass function. : b is the base of the logarithm used. Common values of b are 2 (bits),



This article is my reflection on my previous work FaceLock, a project to recognize user's face and lock the computer if the user doesn't present in a certain time. CNN is used to recognize different faces. I watch the Coursera course Convolutional Neural Networks by
This article is about some squashing functions of deep learning, including Softmax Function, Sigmoid Function, and Hyperbolic Functions. All of these three functions are used to squash value to a certain range. Softmax Function Softmax Function: A generalization of the logistic function that "squashes" a
This article is my learning note of the coursera course Sequence Models by Andrew Yan-Tak Ng. There are two typical RNN units of the hidden layers of the RNN according to Andrew Ng. One is GRN (Gated Recurrent Unit), the other is LSTM (Long Short
Python 实现: AdaBoost - Donny-Hikari - Github Introduction AdaBoost 是 Adaptive Boosting 的简称。 Boosting 是一种 Ensemble Learning 方法。 其他的 Ensemble Learning 方法还有 Bagging, Stacking 等。 Bagging, Boosting, Stacking 的区别如下: Bagging: Equal weight voting. Trains



This is a learning note of Logistic Regression of Machine Learning by Andrew Ng on Coursera. Hypothesis Representation Uses the "Sigmoid Function," also called the "Logistic Function": Which turn linear regression into classification. Sigmoid function looks like this: give us the probability that