Recurrent Neural Network

Recurrent Neural Network

Machine-Learning
July 30, 2018
Donny Donny 𝄡.

Tags in blue are handcrafted tags; Tags in green are generated using AutoTag.

This article is my learning note of the coursera course Sequence Models by Andrew Yan-Tak Ng.

There are two typical RNN units of the hidden layers of the RNN according to Andrew Ng. One is GRN (Gated Recurrent Unit), the other is LSTM (Long Short-Term Memory).

Notice: Please refer to Mathematical Basis - Squashing Function for some basic math knowledge about the squashing functions.

GRN - Gated Recurrent Unit

The GRN is a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.


The fully gated version :

$$ %GRN \tikzstyle{Entity}=[] \tikzstyle{Function}=[shape=rect, draw=blue, fill=cyan] \tikzstyle{Operation}=[shape=circle, draw=orange, fill=orange!30] \tikzstyle{arrow}=[draw, -latex, rounded corners=5pt] \tikzstyle{Region}=[draw=green, fill=green!30, line width=3pt, rounded corners=10pt] \begin{tikzpicture} %\draw [green] (-1,1) grid (10,-10); %\draw [line, brown] (-1,0) -- (10,0); %\draw [line, brown] (0,1) -- (0,-10); \draw [Region] (0.2,-1.0) rectangle (8.5,-7.7); \node [Entity] (c^{<t-1>}) at (-1,-2) {$c^{<t-1>}$}; \node [Entity] (c^{<t>}) at (10,-2) {$c^{<t>}$}; \node [Entity] (x) at (1,-8.5) {$x^{<t>}$}; \node [Entity] (y) at (7,0) {$y^{<t>}$}; \node [Entity, right] (Gamma_r) at (3,-4.4) {$\Gamma_r$}; \node [Entity, left] (Gamma_u) at (5,-4.4) {$\Gamma_u$}; \node [Entity, right] (tilde_{c}^{<t>}) at (7,-5) {$\tilde{c}^{<t>}$}; \node [Function] (sigma_r) at (3.0,-5) {$\sigma$}; \node [Function] (sigma_u) at (5.0,-5) {$\sigma$}; \node [Function] (tanh_cc) at (7.0,-6) {$tanh$}; % cc: candidate c \node [Operation] (multi_r) at (2.0,-4) {$*$}; \node [Operation] (subby1_u) at (5.0,-3.2) {$1-$}; \node [Operation] (multi_u) at (5.0,-2) {$*$}; \node [Operation] (multi_cc) at (7.0,-4) {$*$}; \node [Operation] (add_c) at (7.0,-2) {$+$}; \draw [arrow] (c^{<t-1>}) -| (1,-6) -| (sigma_r); \draw [arrow] (x) |- (3,-6) -| (sigma_r); \draw [arrow] (2,-6) -| (sigma_u); \draw [arrow] (c^{<t-1>}) -| (multi_r); \draw [arrow] (sigma_r) |- (multi_r); \draw [arrow] (x) |- (7,-7) -| (tanh_cc); \draw [arrow] (multi_r) |- (7,-7) -| (tanh_cc); \draw [arrow] (tanh_cc) -- (multi_cc); \draw [arrow] (multi_cc) -- (add_c); \draw [arrow] (sigma_u) |- (multi_cc); \draw [arrow] (sigma_u) -- (subby1_u); \draw [arrow] (subby1_u) -- (multi_u); \draw [arrow] (c^{<t-1>}) -- (multi_u); \draw [arrow] (multi_u) -- (add_c); \draw [arrow] (add_c) -- (y); \draw [arrow] (add_c) -- (c^{<t>}); \end{tikzpicture} $$


The formulas :

$$ \tilde{c}^{<t>} = tanh( W_c [\Gamma_r \star c^{<t-1>}, x^{<t>}] + b_c ) \\ \Gamma_u = \sigma( W_u [c^{<t-1>}, x^{<t>}] + b_u ) \\ \Gamma_r = \sigma( W_r [c^{<t-1>}, x^{<t>}] + b_r ) \\ c^{<t>} = \Gamma_u \star \tilde{c}^{<t>} + (1-\Gamma_u) \star c^{<t-1>} $$

@ $$c$$ : The memory cell.

@ $$x$$ : The input sequence.

@ $$y$$ : The output sequence.

@ $$\Gamma_r$$ : Gate gamma r. It tells us how relevance is $$c^{<t-1>}$$ to computing the next candidate for $$c^{<t>}$$.

@ $$\Gamma_u$$ : Gate gamma u. The update gate vector. Decide whether or not we actually update $$c$$, the memory cell.

@ $$\tilde{c}^{<t>}$$ : The candidate value for the memory cell.

@ $$tanh$$ : Hyperbolic tangent function. It squashes a real-valued number to the range [-1,1]. Defined as $$ tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$. More at Hyperbolic Function section in my article RNN-Base.

@ $$\sigma$$ : Sigmoid function. It squashes a real-valued number to the range [0,1]. Often the output value will be very close to either 0 or 1.

@ $$W_{*}$$ : Concatenated weight vector for $$c^{<t-1>}$$ and $$x^{<t>}$$.

@ $$b_{*}$$ : Concatenated bias vector for $$c^{<t-1>}$$ and $$x^{<t>}$$.


The $$tanh$$ Function:

$$ \tikz \node [scale=1.1] { \begin{tikzpicture}[] \begin{axis}[ samples=120, axis line style=gray, xmin=-2, xmax=2, ymin=-2, ymax=2, axis equal, axis x line=center, axis y line=center, xlabel=$x$, ylabel=$y$, ] \addplot[blue]{(exp(x)-exp(-x))/(exp(x)+exp(-x))}; \addplot[blue] coordinates{(1.5,1.5)} node{$tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$}; \end{axis} \end{tikzpicture} }; $$

LSTM - Long Short-Term Memory

The LSTM is an even slightly more powerful and more general version of the GRU, with more complicated structure.


The structure graph:

$$ %LSTM \tikzstyle{Entity}=[] \tikzstyle{Function}=[shape=rect, draw=blue, fill=cyan] \tikzstyle{Operation}=[shape=circle, draw=orange, fill=orange!30] \tikzstyle{arrow}=[draw, -latex, rounded corners=5pt] \tikzstyle{Region}=[draw=green, fill=green!30, line width=3pt, rounded corners=10pt] \begin{tikzpicture} %\draw [green] (-1,1) grid (10,-10); %\draw [line, brown] (-1,0) -- (10,0); %\draw [line, brown] (0,1) -- (0,-10); \draw [Region] (0.2,-1.0) rectangle (9.5,-8); \node [Entity] (c^{<t-1>}) at (-1,-2) {$c^{<t-1>}$}; \node [Entity] (c^{<t>}) at (11,-2) {$c^{<t>}$}; \node [Entity] (a^{<t-1>}) at (-1,-7) {$a^{<t-1>}$}; \node [Entity] (a^{<t>}) at (11,-7) {$a^{<t>}$}; \node [Entity] (x) at (1,-9) {$x^{<t>}$}; \node [Entity] (y) at (9,1) {$y^{<t>}$}; \node [Entity, left] () at (9,-0.5) {$a^{<t>}$}; \node [Entity, above] () at (5,-2) {$c^{<t>}$}; \node [Entity, right] (Gamma_f) at (2,-5.5) {$\Gamma_f$}; \node [Entity, right] (Gamma_u) at (3.5,-5.5) {$\Gamma_u$}; \node [Entity, right] (Gamma_o) at (6.8,-5.7) {$\Gamma_o$}; \node [Entity, right] (tilde_{c}^{<t>}) at (5,-5.4) {$\tilde{c}^{<t>}$}; \node [Function] (sigma_f) at (2,-6) {$\sigma_f$}; \node [Function] (sigma_u) at (3.5,-6) {$\sigma_u$}; \node [Function] (sigma_o) at (6.5,-6) {$\sigma_o$}; \node [Function] (tanh_cc) at (5,-6) {$tanh_{cc}$};% cc: candiate c \node [Function] (tanh_a) at (8,-4) {$tanh_a$}; \node [Operation] (multi_f) at (2,-2) {$*$}; \node [Operation] (multi_u) at (3.5,-4) {$*$}; \node [Operation] (add_c) at (3.5,-2) {$+$}; \node [Operation] (multi_o) at (8,-6) {$*$}; \draw [arrow] (a^{<t-1>}) -| (sigma_f); \draw [arrow] (a^{<t-1>}) -| (sigma_u); \draw [arrow] (a^{<t-1>}) -| (sigma_o); \draw [arrow] (a^{<t-1>}) -| (tanh_cc); \draw [arrow] (x) |- (2,-7) -- (sigma_f); \draw [arrow] (x) |- (3.5,-7) -- (sigma_u); \draw [arrow] (x) |- (6.5,-7) -- (sigma_o); \draw [arrow] (x) |- (5,-7) -- (tanh_cc); \draw [arrow] (c^{<t-1>}) -- (multi_f); \draw [arrow] (sigma_f) -- (multi_f); \draw [arrow] (multi_f) -- (add_c); \draw [arrow] (sigma_u) -- (multi_u); \draw [arrow] (tanh_cc) |- (multi_u); \draw [arrow] (multi_u) -- (add_c); \draw [arrow] (add_c) -- (c^{<t>}); \draw [arrow] (sigma_o) -- (multi_o); \draw [arrow] (add_c) -| (tanh_a); \draw [arrow] (tanh_a) -- (multi_o); \draw [arrow] (multi_o) |- (a^{<t>}); \draw [arrow] (multi_o) |- (9,-7) -- (y); \end{tikzpicture} $$


The formulas:

$$ \tilde{c}^{<t>} = tanh( W_c [a^{<t-1>}, x^{<t>}] + b_c ) \\ \Gamma_u = \sigma( W_u [a^{<t-1>}, x^{<t>}] + b_u ) \\ \Gamma_f = \sigma( W_f [a^{<t-1>}, x^{<t>}] + b_f ) \\ \Gamma_o = \sigma( W_o [a^{<t-1>}, x^{<t>}] + b_o ) \\ c^{<t>} = \Gamma_u \star \tilde{c}^{<t>} + \Gamma_f \star c^{<t-1>} \\ a^{<t>} = \Gamma_o \star tanh^{[:1]} ( c^{<t>} ) $$

@ $$c$$ : The memory cell.

@ $$a$$ : The output activation.

@ $$x$$ : The input sequence.

@ $$y$$ : The output sequence.

@ $$\Gamma_u$$ : The update gate's activation vector.

@ $$\Gamma_f$$ : The forget gate's activation vector.

@ $$\Gamma_o$$ : The output gate's activation vector.

@ $$\tilde{c}$$ : The candidate c value.

# $$[comment:1]$$ : Hyperbolic tangent function $$tanh$$ or, as the peephole LSTM paper suggests, just using function $$y=x$$ instead.