Feedforward Propagation
Contents
Feedforward Propagation#
What is Feedforward Propagation?#
It is a first step in the training of a neural network (after initialization of the weights, which will be covered in the next lecture). The forward direction means going from input to output nodes.
The Feedforward Propagation, also called Forward Pass, is the process consisting of computing and storing all network nodes’ output values, starting with the first hidden layer until the last output layer, using at start either a subset or the entire dataset samples.
Forward propagation thus leads to a list of the neural network predictions for each data instance row used as input. At each node, the computation is the key equation (50) we saw in the previous Section Model Representation, written again for convenience:
Let’s define everything in the next subsection.
Notations#
Let’s say we have the following network with \(x_n\) input features, one first hidden layer with \(q\) activation units and a second one with \(r\) activation units. For simplicity, we will choose an output layer with only one node:
There are lots of subscripts and upperscripts here. Let’s explain the conventions we will use.
Input data
We saw in Lecture 2 that the dataset in supervised learning can be represented as a matrix \(X\) of \(m\) data instances (rows) of \(n\) input features (columns). For clarity in the notations, we will focus for now on only one data instance, the \(i^{\text{th}}\) sample. We will note it as a column vector \(\boldsymbol{x}^{(i)}\):
The vector elements are all the features in the data. The upperscript indicates the sample index \(i\), going from 1 to \(m\), the total number of samples in the dataset.
Activation units
In a given layer \(\ell = 1, 2, \cdots, N^\text{layer}\), the activation units will give outputs that we will note as a column vector as well:
where subscript is the row of the activation unit in the layer, starting from the top. The upperscript indicates the sample index \(i\) and the layer number \(\ell\). Why the presence of the sample here? We will see soon that these activation units will get a different value for each data sample.
Biases
The biases are also column vectors, one for each layer it connects to and of dimension the number of nodes in that layer:
If the last layer is only made of one node like in our example above, then \(b^{(L)}\) is a scalar. Note that the biases do not depend on the sample index \(i\).
Weights
Now the weights. You may see in the literature different ways to represent them. In here we use a convention we could write as:
In other words, the weights from the layer \(\ell - 1\) to the layer \(\ell\) have as their first index the row of the node from the previous layer (departing node of the weight’s arrow). The second index is the row of the node on layer \(\ell\) to which the weight arrrow points to. The weight \(w^\ell_{j \; \to \; k}\) is the weight departing from node \(j\) on layer \(\ell - 1\) and connecting node \(k\) on layer \(\ell\). For instance \(w^{(2)}_{3,1}\) is the weight from the third node of layer (1) going to the first node of layer (2).
We can actually represent each weight from layer \(\ell -1\) to layer \(\ell\) as a matrix \(W^{(\ell)}\). If the previous layer \(\ell -1\) has \(n\) activation units and the layer \(\ell\) has \(q\) activation units, we will have:
Note that we do not have an index \(i\) for the weight matrix \(W^{(\ell)}\). Why? Because the weights are unique for a given network. In fact the weights – and the biases – are optimized after the network has incorporated all the data samples. We will actually determine the optimal weights and biases in the next chapter after.
Let’s now see how we calculate all the values of the activation units!
Step by step calculations#
General rule for Forward Propagation#
If we rewrite the first layer of inputs for a given sample \(\boldsymbol{x}^{(i)}\) as a “layer zero” \(\boldsymbol{a}^{(i, \: 0)}\):
then we can write a general rule for computing the outputs of a fully connected layer \(\ell\) knowing the outputs of the previous layer \(\ell -1\):
This is the general rule for computing all outputs of a fully connected feedforward neural network.
Summary#
Feedforward propagation is the computation of the values of all activation units of a fully connected feedforward neural network.
As the process includes the last layer (output), feedforward propagation also leads to predictions.
These predictions will be compared to the observed values.
Feedforward propagation is a step in the training of a neural network.
The next step of the training is to go ‘backward’, from the output error \(\hat{y}^\text{pred} - y^\text{obs}\) to the first layer to then adjust all weights using a gradient-based procedure. This is the core of backpropagation, which we will cover at the next lecture.
Learn More
Very nice animations here illustrating the forward propagation process.
Source: Xinyu You’s course An online deep learning course for humanists