Monday, February 27, 2017

Backward Propogation Algorithm in a Neural Network

This post was inspired by the Coursera course on Machine Learning. In particular by the lesson in week 5. In that lesson the backward propagation algorithm is presented but the students were asked to take the formulae on faith. A derivation was not given. To rectify that, here I present a derivation of the main equations.

Suppose we have an input vector \( \underline{x} \) and we wish to make a prediction for the resultant vector \(\underline{y}\). We start by defining our activation level zero to be: \[ \underline{a}^0 = \underline{x} \] We define \[ z^{k+1}_i = \sum_{j}\theta^{k+1}_{ij} a^k_j \hspace{24 mm} (eqn 1) \] and we use the sigmoid function S(z) to evaluate: \[ a^{k+1}_i= S(z^{k+1}_i) \hspace{24 mm} (eqn 2) \] When we have \(\underline{a}^0\), we can use equations 1 and 2 to evaluate \(\underline{a}^1\).
We then use equations 1 and 2 again to evaluate \(\underline{a}^2\) and so on up to \(\underline{a}^L\)

Our forecast for \(\underline{y}\) is activation level L: \(\underline{a}^L\)

Suppose we are given a training set of T input vectors \(\underline{x}^t\)
and a corresponding set of resultant vectors \(\underline{y}^t\) where \(1 \leq t \leq T \).
For each input vector \(\underline{x}^t\) we will have a corresponding activation level L: \(\underline{a}^{Lt}\)
We can measure the discrepancy between \(\underline{y}^t\) and \(\underline{a}^{Lt}\). We'll call this D. We could have: \[ D(\underline{y}^t , \underline{a}^{Lt} )= \parallel \underline{y}^t - \underline{a}^{Lt} \parallel^2 \] For the rest of this posting we won't refer again to the right hand side of that equation. So if the discrepancy function D were defined differently, then it wouldn't make any difference to the formulae below.

When we sum up all the discrepancies and we'll call the result the cost function: \[ J = \frac{1}{2T} \sum_t D(\underline{y}^t , \underline{a}^{Lt} ) \] We are interested in finding the \(\theta\)s which minimise this cost function and so gives us accurate predictions.
One approach would be to find the partial derivatives \[ \frac{\partial J}{\partial \theta^k_{ij}} \] for all i,j,k and then use a gradient descent method to minimise J. \[ \frac{\partial J}{\partial \theta^k_{ij}} = \frac{1}{2T} \sum_t \frac{\partial D(\underline{y}^t , \underline{a}^{Lt})}{\partial \theta^k_{ij}} \hspace{24 mm} (eqn 3) \] We are now going to focus on the term: \[ \frac{\partial D(\underline{y}^t , \underline{a}^{Lt})}{\partial \theta^k_{ij}} \] For simplicity we are going to drop the t superscript. So we will write: \[ \frac{\partial D(\underline{y} , \underline{a}^L)}{\partial \theta^k_{ij}} \] where the t is implied. Though in the end, when we want to evaluate the partial derivative of J, we will need to do the sum over t.

We define \[ \delta^k_i = \frac{\partial D}{\partial z^k_i} \hspace{24 mm} (eqn 4) \] and when we apply the chain rule to the right hand side we get \[ \delta^k_i = \sum_j \frac{\partial D}{\partial z^{k+1}_j} \frac {\partial z^{k+1}_j}{\partial z^k_i} \] we substitute in \( \delta^{k+1}_j\) and we find: \[ \delta^k_i = \sum_j \delta^{k+1}_j \frac {\partial z^{k+1}_j}{\partial z^k_i} \hspace{24 mm} (eqn 5) \] Now we want to evaluate the last term \[ \frac {\partial z^{k+1}_j}{\partial z^k_i} \] to do that we first combine equations 1 and 2 to obtain: \[ z^{k+1}_j = \sum_l \theta^k_{jl} S(z^k_l) \] and so \[ \frac {\partial z^{k+1}_j}{\partial z^k_i} = \sum_l \theta^k_{jl} S'(z^k_l) I_{li} \hspace{24 mm} (eqn 6) \] where \[ S'(z) = \frac {d S(z) }{dz} \] and \[ I_{lj} = \{ 1 \hspace{4 mm} when \hspace{4 mm} l = j, \hspace{4 mm}otherwise \hspace{4 mm}0 \} \] So, returning to equation 6, we find only one term in the sum survives, i.e. when l = j.
Hence eqn 6 becomes: \[ \frac {\partial z^{k+1}_j}{\partial z^k_i} = \theta^k_{ji} S'(z^k_i) \hspace{24 mm} (eqn 7) \] When we substitute that into equation 5 we find \[ \delta^k_i = \sum_j \delta^{k+1}_j \theta^k_{ji} S'(z^k_i) \hspace{24 mm} (eqn 8) \] It follows from equation 1 that: \[ \frac{\partial z^{k+1}_i}{\partial \theta^k_{jl}} = a^k_l I_{ij} \hspace{24 mm} (eqn 9) \] Using the chain rule we find: \[ \frac{\partial D}{\partial \theta^k_{ij}} = \sum_l \frac{\partial D}{\partial z^{k+1}_l} \frac{\partial z^{k+1}_l}{\partial \theta^k_{ij}} \] Substitute in equation 9 and we find only one element of the sum survives and we obtain: \[ \frac{\partial D}{\partial \theta^k_{ij}} = \delta^k_j a^{k+1}_i \hspace{24 mm} (eqn 10) \] So we can go forward (increasing k's) using equations 1 and 2 to evaluate \( a^k_i\)
and then using equation 8, go backwards ( decreasing k's) to evaluate \( \delta^k_j \).
We will then be able to work out all the partial derivatives of D with respect to \( \theta\).
Remember there is an implied superscript t in equation 10. And to work out the partial derivatives of J we will need to do the sum over all t's as shown in equation 3.
After that, finding the optimal \( \theta \) 's using gradient descent will be as easy as sliding down a hill.