## Step 1: Initialize parameters

• The bias vectors are initialized with zeros.
• The weights matrices are initialized with small random variables
   np.random.randn(d1, d2) * 0.01


0.01 can also be a variable that is chosen later but it’s not widely common.  Choosing a big constant will affect the speed of gradient descent algorithm in some activation function like tanh(z) so the values will be either very small or very big and hence the gradients will be close to zero and will slow the algorithm.

## Step 2: Implement forward propagation

• compute $Z^{[l]}$ and $A^{[l]}$ for all layers
• linear forward (compute $Z$)
• activation forward (compute $A$)

## Step 3: Compute cost function ($J$)

• this is not included in the computations but it’s very useful for debugging and visualization.

## Step 4: Implement backward propagation

• Use cached values from forward propagation to compute the gradients
• Linear backward (compute dW, db, dA_prev)
• Activation backward (com)

## Step 5: Update parameters

• Update the parameters using the gradients computer in step 4.
  theta = theta - alpha * dtheta


Where alpha is the learning_rate and dtheta is the derivative of cost function J with respect to theta.

## Step 7: Predict

Notes:

• we don’t calculate the input layer in the total number of NN layers
• The input layer is layer zero ($l$ = 0)
• $A_{0} = X$
• $n_{0} = n_x$
• $A_L = \widehat{Y}$
• 1 layer NN is actually logistic regression (shallow NN)
• More than 2 layer NN is called (Deep NN)
• In deep learning, the “[LINEAR->ACTIVATION]” (compute the forward linear step followed by forward activation step) computation is counted as a single layer in the neural network, not two layers.

Tags:

Categories:

Updated: