In this week you will:

• learn different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam.
• Use random minibatches to accelerate the convergence and improve the optimization.
• Know the benefits of learning rate decay and apply it to your optimization.

Recall the update rule for gradient descent:

for $l$ in $L$:

$W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{1}$ $b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{2}$

Where:
$W: \text{ Weights}$
$b: \text{ biases}$
$\alpha: \text{ learning rate}$
$l: \text{ layer number}$

• We update parameter every m example.
• We take gradient steps with respect to all $m$ examples on each step.
• Equivalent to mini-batch gradient descent but with mini-batch size of $m$.

• We compute gradients on one training example.
• We update parameter every 1 example.
• Equivalent to mini-batch gradient descent but with mini-batch size of $1$.

• we update parameter every subset of examples.
• mini-batch size is between $1$ and $m$.
• It’s better to choose the mini-batch size to be powers of 2.
• it’s build in two steps:
• Shuffling (To ensures that examples will be split randomly into different mini-batches).
• Partition
• Note that the number of training examples is not always divisible by mini_batch_size.

Extra: Here is a comparison that combines the three algorithms.

• Using momentum can reduce the oscilation of mini-batch path toward convergence.
• Based on exponentially weigted averages.
• Takes into account past gradients to smooth out the update path.
• Direction of previous gradients is stored in $v$ which can be seen as the velocity of a ball rolling down the pow.
• Velocities are initialized with zeros, that’s why the algorithms takes some time (few iterations) to warm up.
• If $\beta = 0$ this will be the standard gradient descent.
• beta is another hyperparamter that could be tuned, to reduce the value of cost funcion $J$.
• The larger beta is the smother the path toward converence get, because more past gradient is taken into account.
• Common values for $\beta$ is between 0.8 and 0.999.
• 0.9 is often resonable choice if you don’t want to tune it.

Momentum update rule:

for $l$ in $L$:

$\begin{cases} v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\ W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \end{cases}\tag{3}$ $\begin{cases} v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\ b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} \end{cases}\tag{4}$

Where:
$\beta: \text{ momentum}$
$v: \text{ velocity}$

## RMSProp

Surprisingly RMSProp was not introduced in some reasearch paper but a course in Coursera by Geoff Hinton!

$s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2)(dW^{[l]})^2$ $W^{[l]} = W^{[l]} - \alpha \frac{dW^{[l]}}{\sqrt{s_{dW^{[l]}} + \varepsilon}}$
• $\varepsilon$ is to avoid the division by zero.
• RMSProp stands for root mean squared prop.

• Combines the advantages of RMSProb and Momentum
• We have two variables $v,s$ that keeps track of the past information.

How does it work?

Simply it:

• Calculates an exponentially weighted average of past gradients, and stores it in variables v (before bias correction) and vcorrected (with bias correction).
• Calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables s (before bias correction) and scorrected (with bias correction).
• Updates parameters in a direction based on combining information from “1” and “2”.

The update rule is:

$\text{for } l = 1, …, L:$

$\begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon} \end{cases}$

Where:

• $t$ counts the number of steps taken of Adam
• L is the number of layers
• $\beta_1$ and $\beta_2$ are hyperparameters that control the two exponentially weighted averages.
• $\varepsilon$ is a very small number to avoid dividing by zero often $10^{-8}$

## Learning rate decay

Sometimes you need decrease the learning rate especially when you are about to reach convergence.

one way is:

$\alpha = \frac{1}{1+decay-rate * epoch-num} \alpha_0$

Where

• $\alpha_0$: The initial learning rate.
• decay-rate is another hyperparamter you need to tune.

Other learning rate decay methods:

• $\alpha = 0.95^{epoch-num} \alpha_0$ (exponintially decay)
• $\alpha = \frac{k}{\sqrt{epoch-num}} \alpha_0$
• And others ..

Notes:

• In linear algebra terms, m[:,n] is taking the nth column vector of the matrix m so taking the first 3 examples in input data X, X[:, 0 : 3]. (slicing). (important for programming assignment)
• Different optimization algorithms can leads to different accuracy so it’s important to always choose a good one.
• One epoch passes through all the $m$ training examples whether you use batch, stochastic or mini-batch.
• Inside an epoch (iteration) you loop m times, 1 times or t times for stochastic, batch or mini-batches respectivaly.