The math of neural networks | Marshall Shen

The math of neural networks

The math of neural networks

Building neural networks is at the heart of any deep learning technique. Neural networks is a series of forward and backward propagations to train paramters in the model, and it is built on the unit of logistic regression classifiers. This post will expand based on the math of logistic regression to build more advanced neural networks in mathematical terms.

A neural network is composed of layers, and there are three types of layers in a neural network: one input layer, one output layer, and one or many hidden layers. Each layer is built based on the same structure of logistic regression classifier, with a linear transformation and an activation function. Given a fixed set of input layer and output layer, we can build more complex neural network by adding more hidden layers.

Before diving into the details of the mathematical model, we need to have a big picture of the computation. To quote from class:

the general methodology to build a Neural Network is to:

Define the neural network structure (number of input units, number of hidden units, etc.)

Initialize the model’s parameters


  • Implement forward propagation
  • Compute loss
  • Implement backward propagation to get the gradients
  • Update parameters (gradients)

To make it easier to understand, we take an iterative approach to break down the math of neural networks, first we analyze a 2-layer neural network, then we analyze L-layer neural network.

Two-layer neural network

Let’s think of the following hypothetical scenario: we have two nodes x1x_{1} and x2x_{2} for input layer, three nodes defined in the hidden layer, and we have one node yy for the output layer. Converting the graph below into mathematical terms, we have:


The following is our input parameters where we specify the 2-layer neural network:

  1. Input layer X(2,1)X \in (2, 1), with its weight W1W_{1} and bias b1b_{1}
  2. Oput layer Y(1,1)Y \in (1, 1), with its weight W2W_{2} and bias b2b_{2}
  3. Hidden layer A(4,1)A \in (4, 1)

To perform forward propagation, we have the following calculation:

  • z[1]=W[1]x(i)+b[1]z^{[1]} = W^{[1]} x^{(i)} + b^{[1]}
  • a[1]=tanh(z[1])a^{[1]} = \tanh(z^{[1]})
  • z[2]=W[2]a[1]+b[2]z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}
  • y^(i)=a[2]=σ(z[2])\hat{y}^{(i)} = a^{[2]} = \sigma(z^{[2]})
  • If a[2]>0.5a^{[2]} > 0.5 then y^(i)=1\hat{y}^{(i)} = 1, otherwise y^(i)=0\hat{y}^{(i)} = 0.

Given that we have computed A[2]A^{[2]}, which contains a[2](i)a^{[2](i)} for every example, we can compute the cost function as follows:

  • J=1mi=0m(y(i)log(a[2](i))+(1y(i))log(1a[2](i)))J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large{(} \small y^{(i)}\log\left(a^{[2](i)}\right) + (1-y^{(i)})\log\left(1- a^{[2](i)}\right) \large{)} \small

Given the loss function, we want to implement the backward propagation starting from z2z_{2} back to z1z_{1}:

  • dz[2]=a[2]ydz^{[2]} = a^{[2]} - y
  • dW[2]=dz[2](a[1])TdW^{[2]} = dz^{[2]}(a^{[1]})^{T}
  • db[2]=dz[2]db^{[2]} = dz^{[2]}
  • dz[1]=(W[2])Tdz[2]g[1](z[1])dz^{[1]} = (W^{[2]})^{T}dz^{[2]} * g^{[1]'}(z^{[1]})
  • dW[1]=dz[1]xTdW^{[1]} = dz^{[1]}x^{T}
  • db[1]=dz[1]db^{[1]} = dz^{[1]}

Then we use gradient descent to calculate W[1]W^{[1]}, b[1]b^{[1]} and W[2]W^{[2]}, b[2]b^{[2]}, with a specified learning rate α\alpha:

  • W[1]=W[1]αdW[1]W^{[1]} = W^{[1]} - \alpha * dW^{[1]}
  • b[1]=b[1]αdb[1]b^{[1]} = b^{[1]} - \alpha * db^{[1]}
  • W[2]=W[2]αdW[2]W^{[2]} = W^{[2]} - \alpha * dW^{[2]}
  • b[2]=b[2]αdb[2]b^{[2]} = b^{[2]} - \alpha * db^{[2]}

After one iteration of the loop is finished, we then run the model again with the training set, and we expect to see the value of loss function descreases.

L-layer neural network

A l-layer neural network follows the same logical loop as the 2-layer neural network, however activation function for the hidden layers is different.

Rather than using tanhtanh as the activation function, in recent years people have started using rectified linear function, ReLU for short. ReLU has two advantages, first is that it is a non-linear function so it provides the similar benefit as other non-linear function such as tanhtanh or sigmoidsigmoid. Also, the derivative of ReLU is a constant, making it much faster when calculating the backward propagation step.

In addition, we need to make sure we initialize non-zero values for W[1]W^{[1]}. if W[1]W^{[1]} is a vector of zeros, then the forward and backward propagation will effectively update parameters during each iteration, making the model ineffective.

L layer propagation

Following the general pattern of building the neural network, we can specify the input parameters in mathmatical terms:

  • We have LL layers with input layer XX and output layer YY.

The forward propagation is computed using following equations:

  • The first activation layer: Z[1]=W[1]X+b[1]Z^{[1]} = W^{[1]}X + b^{[1]}, A[1]=ReLU(Z[1])A^{[1]} = ReLU(Z^{[1]})
  • The nth activation layer: Z[n]=W[n]A[n1]+b[1]Z^{[n]} = W^{[n]}A^{[n-1]} + b^{[1]}, A[n]=ReLU(Z[n])A^{[n]} = ReLU(Z^{[n]})
  • The last activation layer: Z[L]=W[L]A[L1]+b[1]Z^{[L]} = W^{[L]}A^{[L-1]} + b^{[1]}, A[n]=sigmoid(Z[L])A^{[n]} = sigmoid(Z^{[L]})

Next we want to implement the loss function to check if our model is actually learning:

1mi=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right))

Then we calculate the backward propagation, which follows steps similar to forward propagation:

  • linear backward
  • linear to activation backward where activation computes the derivative of ReLUReLU or sigmoidsigmoid activation
  • [linear to ReLUReLU] X (L-1) to Linear to sigmoidsigmoid backward (whole model)

For layer ll, the linear part is: Z[l]=W[l]A[l1]+b[l]Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]} (followed by an activation).

Given we have already calculated the derivative dZ[l]=LZ[l]dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l]}}. We want to get (dW[l],db[l]dA[l1])(dW^{[l]}, db^{[l]} dA^{[l-1]}).

  • dW[l]=LW[l]=1mdZ[l]A[l1]TdW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T}
  • db[l]=Lb[l]=1mi=1mdZ[l](i)db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}
  • dA[l1]=LA[l1]=W[l]TdZ[l]dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]}

Now that we have (dW[l],db[l]dA[l1])(dW^{[l]}, db^{[l]} dA^{[l-1]}), we can update our parameters using gradient descent:

  • W[l]=W[l]α dW[l]W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]}
  • b[l]=b[l]α db[l]b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]}

Similar to 2 layer neural network, after one iteration of the loop is finished, we then run the model again with the training set, and we expect to see the value of loss function descreases.