GRU_LSTM

liudongdong1 收录于 Categories 视觉AI

2020-07-15 约 1628 字预计阅读 4 分钟 - 次阅读

https://lddpicture.oss-cn-beijing.aliyuncs.com/picture/sheep-4772994__340.webp

0. RNN

.2. conditional RNN

$$ h_t=f(x_t,h_{t-1})\ h_t:=tanh(W_{xh}x_t+W_{hh}h_{t-1}) $$

计算目标：反向传播时，损失函数$l$ 对$t$ 时刻隐含状态向量$h_t$的偏导。

1. GRU

GRU（Gate Recurrent Unit）是循环神经网络（Recurrent Neural Network, RNN）的一种。和LSTM（Long-Short Term Memory）一样，也是为了解决长期记忆和反向传播中的梯度等问题而提出来的。相比LSTM，使用GRU能够达到相当的效果，并且相比之下更容易进行训练，能够很大程度上提高训练效率，因此很多时候会更倾向于使用GRU。

【输入输出结构】

【内部结构】

r：控制重置门控；
z：为控制更新门控；门控信号越接近1，代表”记忆“下来的数据越多；而越接近0则代表”遗忘“的越多。

【更新表达式】 $$ h^t=(1-z)\Theta h^{t-1}+z\Theta h’ $$

$(1-z)\Theta h^{t-1}$：表示对原本隐藏状态的选择性“遗忘”。这里的 $1-z$可以想象成遗忘门（forget gate），忘记 $h^{t-1}$维度中一些不重要的信息。

$z\Theta h^{t-1}$ ：表示对包含当前节点信息的 $h’$进行选择性”记忆“。与上面类似，这里的 $(1-z)$ 同理会忘记 $h’$维度中的一些不重要的信息。或者，这里我们更应当看做是对 $h’$维度中的某些信息进行选择。

$h^t=(1-z)\Theta h^{t-1}+z\Theta h’$ ：结合上述，这一步的操作就是忘记传递下来的 $h^{t-1}$ 中的某些维度信息，并加入当前节点输入的某些维度信息。

.1. pytorch API

CLASStorch.nn.``GRU(*args, **kwargs)

>>> rnn = nn.GRU(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> output, hn = rnn(input, h0)

学习于：https://zhuanlan.zhihu.com/p/32481747

2. LSTM

CLASStorch.nn.``LSTM(*args, **kwargs)

input_size – The number of expected features in the input x

hidden_size – The number of features in the hidden state h

num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1

bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True

batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False

dropout – If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout. Default: 0

bidirectional – If True, becomes a bidirectional LSTM. Default: False

proj_size – If > 0, will use LSTM with projections of corresponding size. Default: 0

input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() or torch.nn.utils.rnn.pack_sequence() for details.

h_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. If the LSTM is bidirectional, num_directions should be 2, else it should be 1. If proj_size > 0 was specified, the shape has to be (num_layers * num_directions, batch, proj_size).

c_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial cell state for each element in the batch.

If (h_0, c_0) is not provided, both **h_0** and **c_0** default to zero.

Outputs: output, (h_n, c_n)

output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the LSTM, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence. If proj_size > 0 was specified, output shape will be (seq_len, batch, num_directions * proj_size).

For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively. Similarly, the directions can be separated in the packed case.

h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len. If proj_size > 0 was specified, h_n shape will be (num_layers * num_directions, batch, proj_size).

Like output, the layers can be separated using h_n.view(num_layers, num_directions, batch, hidden_size) and similarly for c_n.

c_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the cell state for t = seq_len.

3.LSTM_paper

Paper《Long Short-Term Memory RNN Architectures for Large Scale Acoustic Modeling》

Note:

first distribute training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machine.
speech database TIMIT ，and the test it on a large vocabulary speech recognition task Google Voice Search Task.
how to calculate the total number of parameters and the computational complexity with a moderate number of inputs.
Eigen matrix library c++ 矩阵计算库

Paper《Convolutional,Long Short-Term Memory Fully Connected Deep Neural Networks》

Note:

CNNs are good at reducing frequency variations, LSTMs are good at temporal modeling, and DNNs are appropriate for mapping features to a more separable space.
take advantage of the complementarity of CNNs,LSTMs,DNNs by combining them into one unified architecture,and proposed architecture CLDNN on a variety of large vocabulary tasks.[LVCSR]
previous paper train the three models separately and then the ouput were combined through a combination layer.In this we train in a unified structure.

Paper《Improved Semantic Representations from Tree-Structured LSTM》

Note: author: Kai Sheng Tai stanford.edu

models where real valued vectors are used to represent meaning fall into three class:1.Bag-of-Word 2.sequence models 3. tree structured modes
Test on two task: semantic relatedness prediction on sentence pairs sentiment classification of sentences drawn from movie reviews
available code: https://github.com/stanfordnlp/treelsm ,project : https://nlp.stanford.edu/projects/glove/
previous work:
1. a problem with RNNs with transition functions of this form is that during training components of the gradient vector can gow or decay exponentially over long sequences.exploding or vanishing gradients make it difficult for the RNN to learn long-distance correlations in a sequence.
2. Bidirectional LSTM (stacked LSTM) allow the hidden state to capture both past and future information,Multilayer LSTM( deep LSTM) let the higher layers capture longer-term dependencies of the input sequences. they only allow for strictly sequential information propagation
Datastructure:
1. different from LSTM is that gating vectors and memory cell updates are dependent on the states of possibly many child units. Tree-LSTM unit contains one forget gate Fjk for each child k. to selective incorporate information from each child.
2. Classification model:
3. Semantic Relatedness of Sentence Pairs : given a sentence pair,predict a real-valued similarity score in some range.