Recurrent neural networks

Per Unneberg

NBIS

03-May-2026

Recap

Perceptron (single neuron)

Architecture

A single neuron has n inputs x_i and an output y. To each input is associated a weight w_i.

Activity rule

The activity rule is given by two steps:

Perceptron (single neuron)

Architecture

A single neuron has n inputs x_i and an output y. To each input is associated a weight w_i.

Activity rule

The activity rule is given by two steps:

a = \sum_{i} w_ix_i, \quad i=0,...,n

\begin{array}{ccc} \mathrm{activation} & & \mathrm{activity}\\ a & \rightarrow & y(a) \end{array}

(MacKay 2003)

Perceptron (single neuron)

a = w_0 + \sum_{i} w_ix_i, \quad i=1,...,n

y = y(a) = g\left( w_0 + \sum_{i=1}^{n} w_ix_i \right)

or in vector notation

y = g\left(w_0 + \mathbf{X^T} \mathbf{W} \right)

where:

\quad\mathbf{X}= \begin{bmatrix}x_1\\ \vdots \\ x_n\end{bmatrix}, \quad \mathbf{W}=\begin{bmatrix}w_1\\ \vdots \\ w_n\end{bmatrix}

(Alexander Amini 2021)

Simplified illustration and notation

Architecture

Vectorized versions: input \boldsymbol{x}, weights \boldsymbol{w}, output \boldsymbol{y}

Activity rule

a = \boldsymbol{wx}

Feed forward network

Simplified illustration

Sequential models

Motivation

Sequences around us

Word prediction

Language translation

Time series

(Herzen et al. 2021)

Genomics

(Shen et al. 2018)

Types of models

one to one

many to one

one to many

many to many

Image classification

Sentiment analysis

Image captioning

Machine translation

(Karpathy, Andrej 2015)

Recurrent Neural Networks (RNNs)

Feed forward network implementation to sequential data

Assume multiple time points.

Feed forward network implementation to sequential data

Assume multiple time points.

Feed forward network implementation to sequential data

Assume multiple time points.

Dependency of inputs not modelled \Rightarrow ambiguous sequences cannot be distinguished:

“dog bites man” vs “man bites dog”

Feed forward network implementation to sequential data

Assume multiple time points.

Time points are modelled individually (\hat{Y}_t = f(X_t))

Feed forward network implementation to sequential data

Assume multiple time points.

Time points are modelled individually (\hat{Y}_t = f(X_t))
However: also want dependency on previous inputs (\hat{Y}_t = f(..., X_1, X_0))

Adding recurrence relations

Folded representation

Unfolded representation

Add a hidden state h that introduces a dependency on the previous step:

\hat{Y}_t = f(X_t, h_{t-1})

h_t is a summary of the inputs we’ve seen sofar.

Sequential memory of RNNs

RNNs have what one could call “sequential memory” (Phi 2020)

Alphabet

Exercise: say alphabet in your head

A B C … X Y Z

Modification: start from e.g. letter F

May take time to get started, but from there on it’s easy

Now read the alphabet in reverse:

Z Y X … C B A

Memory access is associative and context-dependent

Recurrent Neural Networks

Add recurrence relation where current hidden cell state h_t depends on input x_t and previous hidden state h_{t-1} via a function f_W that defines the network parameters (weights):

h_t = f_\mathbf{W}(x_t, h_{t-1})

Note that the same function and weights are used across all time steps!

Recurrent Neural Networks - pseudocode

class RNN:
  # ...
  # Description of forward pass
  def step(self, x):
    # update the hidden state
    self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
    # compute the output vector
    y = np.dot(self.W_hy, self.h)
    return y

rnn = RNN()
ff = FeedForwardNN()

for word in input:
    output = rnn.step(word)

prediction = ff(output)

Vanilla RNNs

Output vector

\hat{Y}_t = \mathbf{W_{hy}^T}h_t

Update hidden state

h_t = \mathsf{tanh}(\mathbf{W_{xh}^T}X_t + \mathbf{W_{hh}^T}h_{t-1})

Input vector

X_t

Vanilla RNNs

(Olah 2015)

Vanilla RNNs

Note: \mathbf{W_{xh}}, \mathbf{W_{hh}}, and \mathbf{W_{hy}} are shared across all cells!

Desired features of RNN

Shared weights (parameters) are a fundamental characteristic that among other things address:

1. Variable sequence lengths

Not all inputs are of equal length

2. Long-term memory

“I grew up in England, and … I speak fluent English”

3. Preservation of order

“dog bites man” != “man bites dog”

Exercise

Box & Jenkins airline passenger data set

import pandas as pd
df = pd.read_csv(
    'airline-passengers.csv',
    names=['time','passengers'],
    header=0
)
df['time'] = pd.to_datetime(
    df['time'], format='%Y-%m'
)
df.head(12)

         time  passengers
0  1949-01-01         112
1  1949-02-01         118
2  1949-03-01         132
3  1949-04-01         129
4  1949-05-01         121
5  1949-06-01         135
6  1949-07-01         148
7  1949-08-01         148
8  1949-09-01         136
9  1949-10-01         119
10 1949-11-01         104
11 1949-12-01         118

(Onnen 2021)

Generate test and training data

Partition time series into training and test data sets at an e.g. 2:1 ratio

Generate input-output pairs by sliding a window across time points where input (X) corresponds to values in window, output (Y) the value following the last window index

Custom dataset class

class PassengerDataset(Dataset):
    def __init__(self, data, time, window_size=12, **kw):
        self.data = data
        self.time = time
        self.X = torch.from_numpy(np.array(
            [self.data[ind] for ind in list(range(j, j + window_size)) \
             for j in np.arange(window_size, len(self.data), 1)]
        ))
        self.Y = torch.tensor([self.data[i] for i in np.arange(window_size, len(self.data, 1))]).unsqueeze(-1)

    def __getitem__(self, idx):
        return self.X[idx], self.Y[idx]

    def __len__(self):
        return len(self.X)

Data is well-formatted and as a bonus plotting is easy:

plt.plot(train.time, train.data)

train_fraction = 0.7
split_index = int(df.shape[0] * train_fraction)

data = torch.tensor(df.passengers,
                    dtype=torch.float32).unsqueeze(-1)
train = PassengerDataset(
    data[:split_index],
    time=df.time[:split_index],
    window_size=12
)
test = PassengerDataset(
    data[split_index:],
    time=df.time[split_index:],
    window_size=12
)

Let PassengerDataset take care of data setup, formatting, custom functionality etc, and let torch.utils.data.DataLoader handle data loading:

train_dataloader = DataLoader(train)
test_dataloader = DataLoader(test)

Vanilla RNN model

import torch.nn as nn

class AirlineRNN(nn.Module):
    def __init__(self, hidden_size=3, output_size=1):
        super().__init__()
        self.rnn = nn.RNN(input_size=1, hidden_size=hidden_size,
                          num_layers=1, batch_first=True)
        self.fc1 = nn.Linear(hidden_size, output_size)
        self.fc2 = nn.Linear(output_size, 1)

    def forward(self, x):
        rnn_out, hidden = self.rnn(x)
        rnn_out = rnn_out[:, -1, :]  # Select final timestep
        x = self.fc1(rnn_out)
        return self.fc2(x)

    def predict(self, x):
        self.eval()
        with torch.no_grad():
            return self(x)

Training and evaluation

Define training and test loops that iterate dataloaders. The training loop updates model parameters by backpropagation and the use of an optimizer. The test loop evaluates model on an independent dataset.

def train_loop(dataloader, model, loss_fn, optimizer, device):
    model.train()  # Set model to training mode
    # ...
    for batch, (X, y) in enumerate(dataloader):
        X = X.to(device)  # Copy data to GPU
        y = y.to(device)
        output = model(X)
        loss = loss_fn(output, y)
        
        optimizer.zero_grad()  # Clear gradients
        loss.backward()  # Backpropagation
        optimizer.step()  # Optimization
        # ... keep track of total loss
    return total_loss

Test loop is similar but does not need optimization step (run with context manager with torch.no_grad()!).

Iterate over train_loop and test_loop in epochs and keep track of performance:

epochs = 200
# Dictionary to store validation metrics; use same keys as Keras
metrics = {'accuracy': [], 'loss': [], 'val_accuracy': [], 'val_loss': []}
for t in range(epochs):
    loss = train_loop(train_dataloader, model,
                      loss_fn, optimizer, device)
    metrics["loss"].append(loss)
    val_loss = test_loop(test_dataloader, model,
                         loss_fn, device)
    metrics["val_loss"].append(val_loss)
print("Done!")

Plot / examine training metrics to evaluate performance.

Example: model topology writ out

AirlineRNN(
  (rnn): RNN(1, 3, batch_first=True)
  (fc1): Linear(in_features=3, out_features=1, bias=True)
  (fc2): Linear(in_features=1, out_features=1, bias=True)
)

Example: model topology writ out

AirlineRNN(
  (rnn): RNN(1, 3, batch_first=True)
  (fc1): Linear(in_features=3, out_features=1, bias=True)
  (fc2): Linear(in_features=1, out_features=1, bias=True)
)

NB! In PyTorch, RNN input is a 3D tensor with shape [timesteps, batch size, input size] by default; use batch_first=True to reshape to [batch size, timesteps, input size].

An RNN in numbers

(Karpathy, Andrej 2015)

Example network trained on “hello” showing activations in forward pass given input “hell”. The outputs contain confidences in outputs (vocabulary={h, e, l, o}). We want blue numbers high, red numbers low. P(e) is in context of “h”, P(l) in context of “he” and so on.

What is the topology of the network?

4 input units h, e, l, o (features), 4 time steps, 3 hidden units, 4 output units

Exercise

Using the AirlineRNN (Vanilla RNN) class, see if you can improve the airline passenger model. Some things to try:

change the number of units
change time_steps
change the number of epochs

Training

Recap: backpropagation algorithm in ffns

(Alexander Amini 2021)

Recap: backpropagation algorithm in ffns

(Alexander Amini 2021)

perform forward pass and generate prediction

Recap: backpropagation algorithm in ffns

(Alexander Amini 2021)

perform forward pass and generate prediction
calculate prediction error \epsilon_i wrt (known) output: \epsilon_i = \mathcal{L}(\hat{y}_i, y_i), loss function \mathcal{L}

Recap: backpropagation algorithm in ffns

(Alexander Amini 2021)

perform forward pass and generate prediction
calculate prediction error \epsilon_i wrt (known) output: \epsilon_i = \mathcal{L}(\hat{y}_i, y_i), loss function \mathcal{L}
backpropagate errors and update weights to minimize loss

Backpropagation through time (BPTT)

(Alexander Amini 2021)

Backpropagation through time (BPTT)

(Alexander Amini 2021)

Backpropagation through time (BPTT)

(Alexander Amini 2021)

Backpropagation through time (BPTT)

(Alexander Amini 2021)

Backpropagation through time (BPTT)

(Alexander Amini 2021)

Backpropagation through time (BPTT)

(Alexander Amini 2021)

Errors are propagated backwards in time from t=t to t=0.

Problem: calculating gradient may depend on large powers of \mathbf{W_{hh}}^{\mathsf{T}} (e.g., \delta\mathcal{L} / \delta h_0 \sim f((\mathbf{W_{hh}}^{\mathsf{T}})^t)

The effect of vanishing gradients on long-term memory

In layer i gradient size ~ (\mathbf{W_{hh}}^{\mathsf{T}})^{t-i}

\downarrow

Weight adjustments depend on size of gradient

\downarrow

Early layers tend to “see” small gradients and do very little updating

\downarrow

Bias parameters to learn recent events

\downarrow

RNN suffer short term memory

(Olah 2015)

“The clouds are in the _”

“I grew up in England … I speak fluent _”

Solutions to vanishing gradient

1. Activation function

ReLU (or leaky ReLU) instead of sigmoid or tanh.

Prevents small gradient: for \mathbb{x>0}, gradient positive constant

Derivatives of \sigma, \mathsf{tanh} and \mathsf{ReLU} activation functions.

Solutions to vanishing gradient

2. Weight initialization

Set bias=0, weights to identity matrix

Solutions to vanishing gradient

3. More complex cells using “gating”

For example LSTM. Idea is to control what information is retained within each RNN unit.

Make use of regular multiplication (x) and addition (+) to combine signals.

LSTMs and GRUs

Motivation behind LSTMs and GRUs

LSTM

GRU

Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014) architectures were proposed to solve the vanishing gradient problem.

Intuition

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)

Intuition

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)

Remember the important parts, pay less attention to (forget) the rest.

LSTM: Cell state flow and gating

LSTM adds cell state that in effect provides the long-term memory

Information flows in the cell state from c_{t-1} to c_t.

(Olah 2015)

Gates affect the amount of information let through. The sigmoid layer outputs anything from 0 (nothing) to 1 (everything).

(Cho et al. 2014)

In our preliminary experiments, we found that it is crucial to use this new unit with gating units. We were not able to get meaningful result with an oft-used tanh unit without any gating.

(Olah, 2015)

NB: Olah’s example revolves around a language model where we try to predict the next output

(, Cho et al., 2014, p. 1726) on the hidden unit:

In our preliminary experiments, we found that it is crucial to use this new unit with gating units. We were not able to get meaningful result with an oft-used tanh unit without any gating.

Difference between cell and hidden state (https://datascience.stackexchange.com/questions/82808/difference-between-lstm-cell-state-and-hidden-state):

Cell state: Long term memory of the model, only part of LSTM models
Hidden state: Working memory, part of LSTM and RNN models

for RNN, every previous state is considered in calculation of backpropagation

LSTM: introduce cell state, in addition to hidden state, simply providing longer memory, enabled by

the storage of useful beliefs from new inputs
the loading of beliefs into the working memory (i.e. cell state) that are immediately useful.

Forget, input, and output gates

forget gate

Purpose: reset content of cell state

input gate

Purpose: decide when to read data into cell state

output gate

Purpose: read entries from cell state

Sigmoid squishes vector [\boldsymbol{h_{t-1}}, \boldsymbol{x_t}] (previous hidden state + input) to (0, 1) for each value in cell state c_{t-1}, where 0 means “reset entry”, 1 “keep it”

The forget gate

Purpose: decide what information to keep or throw away

Sigmoid squishes vector [\boldsymbol{h_{t-1}}, \boldsymbol{x_t}] (previous hidden state + input) to (0, 1) for each value in cell state c_{t-1}, where 0 means “forget entry”, 1 “keep it”

f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

Add new information - the input gate

Two steps to adding new information:

sigmoid layer decides which values to update

Add new information - get candidate values

Two steps to adding new information:

sigmoid layer decides which values to update
tanh layer creates vector of new candidate values \tilde{c}_t

i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c)

Updating the cell state

multiply old cell state by f_t to forget what was decided to forget
add new candidate values scaled by how much we want to update them i_t * \tilde{c}_t

c_t = f_t * c_{t-1} + i_t * \tilde{c}_t

Cell output

Output is filtered version of cell state.

sigmoid output gate decides what parts of cell state to output
push cell state through tanh and multiply by sigmoid output

o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t)

LSTM: putting it together

Intuition

if forget ~ 1, input ~ 0, c_{t-1} will be saved to next time step (input irrelevant for cell state)
if forget ~ 0, input ~ 1, pay attention to the current input

LSTM: putting it together

(Zhang et al. 2021)

f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\\ i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c)\\ c_t = f_t * c_{t-1} + i_t * \tilde{c}_t\\ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t)

x_t \in \mathbb{R}^{n\times d}, h_{t-1} \in \mathbb{R}^{n \times h}, i_t \in \mathbb{R}^{n\times h}, f_t \in \mathbb{R}^{n\times h}, o_t \in \mathbb{R}^{n\times h}, \\ c_t \in \mathbb{R}^{n\times h}, b_i,b_f,b_c \in \mathbb{R}^{1\times h}

and

W_f \in \mathbb{R}^{n \times (h+d)}, W_i \in \mathbb{R}^{n \times (h+d)}, W_o \in \mathbb{R}^{n \times (h+d)}, W_c \in \mathbb{R}^{n \times (h+d)}

GRU

forget and input states combined to single update gate
merge cell and hidden state
simpler model than LSTM

Concluding remarks

Example applications in genomics

Prediction of transcriptor factor binding sites

(Shen et al. 2018)

Recombination landscape prediction

(Adrion et al. 2020)

Limitations of recurrent neural networks

Encoding bottleneck
- How to represent (embed) and compress data?
Slow and difficult to parallelize
- Slow convergence
- Sequential nature not well adapted for parallelization
Short memory
- Don’t scale to sequences > thousands of time steps

Attention is all you need

(Vaswani et al. 2017)

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Transformers

process sequential input data (e.g., natural language)
process entire input at once
apply attention mechanism to provide positional context

Transformers were introduced in 2017 by a team at Google Brain and are increasingly the model of choice for NLP problems, replacing RNN models such as long short-term memory (LSTM). The additional training parallelization allows training on larger datasets
(“Transformer (Machine Learning Model)” 2023)

Summary

Sequential data can be modelled with RNNs
Recurrence to model sequences
Training with backpropagation through time
Gated units (LSTM, GRU) partially solve the vanishing gradient problem
Transformers model sequences without recurrence and have increasingly become the model of choice for many natural language processing (NLP) problems

Exercise

Airline passengers revisited and alphabet example

Analyse airline passengers with LSTM

Modify the airline passenger model to use an LSTM and compare the results. Try out different parameters to improve test predictions.

Next letter prediction

Predict the next letter in the alphabet

Bibliography

Adrion, Jeffrey R., Jared G. Galloway, and Andrew D. Kern. 2020. “Predicting the Landscape of Recombination Using Deep Learning.” Molecular Biology and Evolution 37 (6): 1790–808. https://doi.org/10.1093/molbev/msaa038.

Alexander Amini. 2021. MIT 6.S191: Recurrent Neural Networks. https://www.youtube.com/watch?v=qjrad0V0uJE.

Cho, Kyunghyun, Bart van Merrienboer, Caglar Gulcehre, et al. 2014. “Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation.” arXiv:1406.1078 [Cs, Stat], September. http://arxiv.org/abs/1406.1078.

Herzen, Julien, Francesco Lässig, Samuele Giuliano Piazzetta, et al. 2021. Darts: User-Friendly Modern Machine Learning for Time Series. https://arxiv.org/abs/2110.03224.

Karpathy, Andrej. 2015. The Unreasonable Effectiveness of Recurrent Neural Networks. https://karpathy.github.io/2015/05/21/rnn-effectiveness/.

MacKay, David J. C. 2003. Information Theory, Inference and Learning Algorithms. Illustrated edition. Cambridge University Press.

Olah, Christopher. 2015. Understanding LSTM Networks – Colah’s Blog. http://colah.github.io/posts/2015-08-Understanding-LSTMs/.

Onnen, Heiko. 2021. “Temporal Loops: Intro to Recurrent Neural Networks for Time Series Forecasting in Python.” In Medium. https://towardsdatascience.com/temporal-loops-intro-to-recurrent-neural-networks-for-time-series-forecasting-in-python-b0398963dc1f.

Phi, Michael. 2020. “Illustrated Guide to Recurrent Neural Networks.” In Medium. https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9.

Shen, Zhen, Wenzheng Bao, and De-Shuang Huang. 2018. “Recurrent Neural Network for Predicting Transcription Factor Binding Sites.” Scientific Reports 8 (1): 15270. https://doi.org/10.1038/s41598-018-33321-1.

“Transformer (Machine Learning Model).” 2023. Wikipedia, March. https://en.wikipedia.org/w/index.php?title=Transformer_(machine_learning_model)&oldid=1144198335.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. Attention Is All You Need. arXiv:1706.03762. arXiv. https://doi.org/10.48550/arXiv.1706.03762.

Verma, Shiva. 2021. “Understanding Input and Output Shapes in LSTM Keras.” In Medium. https://shiva-verma.medium.com/understanding-input-and-output-shape-in-lstm-keras-c501ee95c65e.

Zhang, Aston, Zachary C. Lipton, Mu Li, and Alexander J. Smola. 2021. “Dive into Deep Learning.” arXiv Preprint arXiv:2106.11342.