NBIS
03-May-2026
A single neuron has n inputs x_i and an output y. To each input is associated a weight w_i.
The activity rule is given by two steps:
a = \sum_{i} w_ix_i, \quad i=0,...,n
\begin{array}{ccc} \mathrm{activation} & & \mathrm{activity}\\ a & \rightarrow & y(a) \end{array}
A single neuron has n inputs x_i and an output y. To each input is associated a weight w_i.
The activity rule is given by two steps:
a = \sum_{i} w_ix_i, \quad i=0,...,n
\begin{array}{ccc} \mathrm{activation} & & \mathrm{activity}\\ a & \rightarrow & y(a) \end{array}
a = w_0 + \sum_{i} w_ix_i, \quad i=1,...,n
y = y(a) = g\left( w_0 + \sum_{i=1}^{n} w_ix_i \right)
or in vector notation
y = g\left(w_0 + \mathbf{X^T} \mathbf{W} \right)
where:
\quad\mathbf{X}= \begin{bmatrix}x_1\\ \vdots \\ x_n\end{bmatrix}, \quad \mathbf{W}=\begin{bmatrix}w_1\\ \vdots \\ w_n\end{bmatrix}
Vectorized versions: input \boldsymbol{x}, weights \boldsymbol{w}, output \boldsymbol{y}
a = \boldsymbol{wx}
one to one
many to one
one to many
many to many
Image classification
Sentiment analysis

Image captioning

Machine translation
![]()
Assume multiple time points.
Assume multiple time points.
Assume multiple time points.
- Dependency of inputs not modelled \Rightarrow ambiguous sequences cannot be distinguished:
“dog bites man” vs “man bites dog”
Assume multiple time points.
Assume multiple time points.
Folded representation
Unfolded representation
Add a hidden state h that introduces a dependency on the previous step:
\hat{Y}_t = f(X_t, h_{t-1})
h_t is a summary of the inputs we’ve seen sofar.
RNNs have what one could call “sequential memory” (Phi 2020)
Exercise: say alphabet in your head
A B C … X Y Z
Modification: start from e.g. letter F
May take time to get started, but from there on it’s easy
Now read the alphabet in reverse:
Z Y X … C B A
Memory access is associative and context-dependent
Add recurrence relation where current hidden cell state h_t depends on input x_t and previous hidden state h_{t-1} via a function f_W that defines the network parameters (weights):
h_t = f_\mathbf{W}(x_t, h_{t-1})
Note that the same function and weights are used across all time steps!
class RNN:
# ...
# Description of forward pass
def step(self, x):
# update the hidden state
self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
# compute the output vector
y = np.dot(self.W_hy, self.h)
return y
rnn = RNN()
ff = FeedForwardNN()
for word in input:
output = rnn.step(word)
prediction = ff(output)\hat{Y}_t = \mathbf{W_{hy}^T}h_t
h_t = \mathsf{tanh}(\mathbf{W_{xh}^T}X_t + \mathbf{W_{hh}^T}h_{t-1})
X_t
Note: \mathbf{W_{xh}}, \mathbf{W_{hh}}, and \mathbf{W_{hy}} are shared across all cells!
Shared weights (parameters) are a fundamental characteristic that among other things address:
Not all inputs are of equal length
“I grew up in England, and … I speak fluent English”
“dog bites man” != “man bites dog”
time passengers
0 1949-01-01 112
1 1949-02-01 118
2 1949-03-01 132
3 1949-04-01 129
4 1949-05-01 121
5 1949-06-01 135
6 1949-07-01 148
7 1949-08-01 148
8 1949-09-01 136
9 1949-10-01 119
10 1949-11-01 104
11 1949-12-01 118
Partition time series into training and test data sets at an e.g. 2:1 ratio
Generate input-output pairs by sliding a window across time points where input (X) corresponds to values in window, output (Y) the value following the last window index
class PassengerDataset(Dataset):
def __init__(self, data, time, window_size=12, **kw):
self.data = data
self.time = time
self.X = torch.from_numpy(np.array(
[self.data[ind] for ind in list(range(j, j + window_size)) \
for j in np.arange(window_size, len(self.data), 1)]
))
self.Y = torch.tensor([self.data[i] for i in np.arange(window_size, len(self.data, 1))]).unsqueeze(-1)
def __getitem__(self, idx):
return self.X[idx], self.Y[idx]
def __len__(self):
return len(self.X)Data is well-formatted and as a bonus plotting is easy:
train_fraction = 0.7
split_index = int(df.shape[0] * train_fraction)
data = torch.tensor(df.passengers,
dtype=torch.float32).unsqueeze(-1)
train = PassengerDataset(
data[:split_index],
time=df.time[:split_index],
window_size=12
)
test = PassengerDataset(
data[split_index:],
time=df.time[split_index:],
window_size=12
)Let PassengerDataset take care of data setup, formatting, custom functionality etc, and let torch.utils.data.DataLoader handle data loading:
import torch.nn as nn
class AirlineRNN(nn.Module):
def __init__(self, hidden_size=3, output_size=1):
super().__init__()
self.rnn = nn.RNN(input_size=1, hidden_size=hidden_size,
num_layers=1, batch_first=True)
self.fc1 = nn.Linear(hidden_size, output_size)
self.fc2 = nn.Linear(output_size, 1)
def forward(self, x):
rnn_out, hidden = self.rnn(x)
rnn_out = rnn_out[:, -1, :] # Select final timestep
x = self.fc1(rnn_out)
return self.fc2(x)
def predict(self, x):
self.eval()
with torch.no_grad():
return self(x)Define training and test loops that iterate dataloaders. The training loop updates model parameters by backpropagation and the use of an optimizer. The test loop evaluates model on an independent dataset.
def train_loop(dataloader, model, loss_fn, optimizer, device):
model.train() # Set model to training mode
# ...
for batch, (X, y) in enumerate(dataloader):
X = X.to(device) # Copy data to GPU
y = y.to(device)
output = model(X)
loss = loss_fn(output, y)
optimizer.zero_grad() # Clear gradients
loss.backward() # Backpropagation
optimizer.step() # Optimization
# ... keep track of total loss
return total_lossTest loop is similar but does not need optimization step (run with context manager with torch.no_grad()!).
Iterate over train_loop and test_loop in epochs and keep track of performance:
epochs = 200
# Dictionary to store validation metrics; use same keys as Keras
metrics = {'accuracy': [], 'loss': [], 'val_accuracy': [], 'val_loss': []}
for t in range(epochs):
loss = train_loop(train_dataloader, model,
loss_fn, optimizer, device)
metrics["loss"].append(loss)
val_loss = test_loop(test_dataloader, model,
loss_fn, device)
metrics["val_loss"].append(val_loss)
print("Done!")Plot / examine training metrics to evaluate performance.
AirlineRNN(
(rnn): RNN(1, 3, batch_first=True)
(fc1): Linear(in_features=3, out_features=1, bias=True)
(fc2): Linear(in_features=1, out_features=1, bias=True)
)
AirlineRNN(
(rnn): RNN(1, 3, batch_first=True)
(fc1): Linear(in_features=3, out_features=1, bias=True)
(fc2): Linear(in_features=1, out_features=1, bias=True)
)
NB! In PyTorch, RNN input is a 3D tensor with shape [timesteps, batch size, input size] by default; use batch_first=True to reshape to [batch size, timesteps, input size].
Example network trained on “hello” showing activations in forward pass given input “hell”. The outputs contain confidences in outputs (vocabulary={h, e, l, o}). We want blue numbers high, red numbers low. P(e) is in context of “h”, P(l) in context of “he” and so on.
What is the topology of the network?
4 input units h, e, l, o (features), 4 time steps, 3 hidden units, 4 output units
Using the AirlineRNN (Vanilla RNN) class, see if you can improve the airline passenger model. Some things to try:
Errors are propagated backwards in time from t=t to t=0.
Problem: calculating gradient may depend on large powers of \mathbf{W_{hh}}^{\mathsf{T}} (e.g., \delta\mathcal{L} / \delta h_0 \sim f((\mathbf{W_{hh}}^{\mathsf{T}})^t)
In layer i gradient size ~ (\mathbf{W_{hh}}^{\mathsf{T}})^{t-i}
\downarrow
Weight adjustments depend on size of gradient
\downarrow
Early layers tend to “see” small gradients and do very little updating
\downarrow
Bias parameters to learn recent events
\downarrow
RNN suffer short term memory
ReLU (or leaky ReLU) instead of sigmoid or tanh.
Prevents small gradient: for \mathbb{x>0}, gradient positive constant
Derivatives of \sigma, \mathsf{tanh} and \mathsf{ReLU} activation functions.
Set bias=0, weights to identity matrix
For example LSTM. Idea is to control what information is retained within each RNN unit.
Make use of regular multiplication (x) and addition (+) to combine signals.
LSTM
GRU
Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014) architectures were proposed to solve the vanishing gradient problem.
In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)
In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)
Remember the important parts, pay less attention to (forget) the rest.
LSTM adds cell state that in effect provides the long-term memory
Information flows in the cell state from c_{t-1} to c_t.
Gates affect the amount of information let through. The sigmoid layer outputs anything from 0 (nothing) to 1 (everything).
In our preliminary experiments, we found that it is crucial to use this new unit with gating units. We were not able to get meaningful result with an oft-used tanh unit without any gating.
Purpose: reset content of cell state
Purpose: decide when to read data into cell state
Purpose: read entries from cell state
Sigmoid squishes vector [\boldsymbol{h_{t-1}}, \boldsymbol{x_t}] (previous hidden state + input) to (0, 1) for each value in cell state c_{t-1}, where 0 means “reset entry”, 1 “keep it”
Purpose: decide what information to keep or throw away
Sigmoid squishes vector [\boldsymbol{h_{t-1}}, \boldsymbol{x_t}] (previous hidden state + input) to (0, 1) for each value in cell state c_{t-1}, where 0 means “forget entry”, 1 “keep it”
f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
Two steps to adding new information:
Two steps to adding new information:
i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c)
c_t = f_t * c_{t-1} + i_t * \tilde{c}_t
Output is filtered version of cell state.
o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t)
f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\\ i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c)\\ c_t = f_t * c_{t-1} + i_t * \tilde{c}_t\\ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t)
x_t \in \mathbb{R}^{n\times d}, h_{t-1} \in \mathbb{R}^{n \times h}, i_t \in \mathbb{R}^{n\times h}, f_t \in \mathbb{R}^{n\times h}, o_t \in \mathbb{R}^{n\times h}, \\ c_t \in \mathbb{R}^{n\times h}, b_i,b_f,b_c \in \mathbb{R}^{1\times h}
and
W_f \in \mathbb{R}^{n \times (h+d)}, W_i \in \mathbb{R}^{n \times (h+d)}, W_o \in \mathbb{R}^{n \times (h+d)}, W_c \in \mathbb{R}^{n \times (h+d)}
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Transformers were introduced in 2017 by a team at Google Brain and are increasingly the model of choice for NLP problems, replacing RNN models such as long short-term memory (LSTM). The additional training parallelization allows training on larger datasets
Modify the airline passenger model to use an LSTM and compare the results. Try out different parameters to improve test predictions.
Predict the next letter in the alphabet
Recurrent neural networks