In this week you will:

  • learn different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam.
  • Use random minibatches to accelerate the convergence and improve the optimization.
  • Know the benefits of learning rate decay and apply it to your optimization.

Gradient Descent

Recall the update rule for gradient descent:

for \(l\) in \(L\):

Where:
\(W: \text{ Weights} \)
\( b: \text{ biases} \)
\( \alpha: \text{ learning rate} \)
\( l: \text{ layer number} \)

Batch Gradient Descent

  • We update parameter every m example.
  • We take gradient steps with respect to all \(m\) examples on each step.
  • Equivalent to mini-batch gradient descent but with mini-batch size of \(m\).

Stochastic Gradient Descent

  • We compute gradients on one training example.
  • We update parameter every 1 example.
  • Equivalent to mini-batch gradient descent but with mini-batch size of \(1\).
SGD vs Batch

Mini-batch Gradient Descent

  • we update parameter every subset of examples.
  • mini-batch size is between \(1\) and \(m\).
  • It’s better to choose the mini-batch size to be powers of 2.
  • it’s build in two steps:
    • Shuffling (To ensures that examples will be split randomly into different mini-batches).
    • Partition
  • Note that the number of training examples is not always divisible by mini_batch_size.
SGD vs Mini-Batch

Extra: Here is a comparison that combines the three algorithms.

"From Stackexchange"

Gradient descent with momentum

  • Using momentum can reduce the oscilation of mini-batch path toward convergence.
  • Based on exponentially weigted averages.
  • Takes into account past gradients to smooth out the update path.
  • Direction of previous gradients is stored in \(v\) which can be seen as the velocity of a ball rolling down the pow.
  • Velocities are initialized with zeros, that’s why the algorithms takes some time (few iterations) to warm up.
  • If \(\beta = 0 \) this will be the standard gradient descent.
  • beta is another hyperparamter that could be tuned, to reduce the value of cost funcion \(J\).
  • The larger beta is the smother the path toward converence get, because more past gradient is taken into account.
  • Common values for \(\beta\) is between 0.8 and 0.999.
  • 0.9 is often resonable choice if you don’t want to tune it.
  • Momentum can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
"The red arrows shows the direction taken by one step of mini-batch gradient descent with momentum. The blue points show the direction of the gradient (with respect to the current mini-batch) on each step. Rather than just following the gradient, we let the gradient influence v and then take a step in the direction of v."

Momentum update rule:

for \(l\) in \(L\):

Where:
\(\beta: \text{ momentum} \)
\(v: \text{ velocity} \)

RMSProp

Surprisingly RMSProp was not introduced in some reasearch paper but a course in Coursera by Geoff Hinton!

  • \(\varepsilon\) is to avoid the division by zero.
  • RMSProp stands for root mean squared prop.

Adam optimization algorithm

  • Combines the advantages of RMSProb and Momentum
  • We have two variables \(v,s\) that keeps track of the past information.

How does it work?

Simply it:

  • Calculates an exponentially weighted average of past gradients, and stores it in variables v (before bias correction) and vcorrected (with bias correction).
  • Calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables s (before bias correction) and scorrected (with bias correction).
  • Updates parameters in a direction based on combining information from “1” and “2”.

The update rule is:

\(\text{for } l = 1, …, L:\)

Where:

  • \(t\) counts the number of steps taken of Adam
  • L is the number of layers
  • \(\beta_1\) and \(\beta_2\) are hyperparameters that control the two exponentially weighted averages.
  • \(\varepsilon\) is a very small number to avoid dividing by zero often \(10^{-8}\)

Learning rate decay

Sometimes you need decrease the learning rate especially when you are about to reach convergence.

one way is:

Where

  • \(\alpha_0 \): The initial learning rate.
  • decay-rate is another hyperparamter you need to tune.

Other learning rate decay methods:

  • \( \alpha = 0.95^{epoch-num} \alpha_0 \) (exponintially decay)
  • \( \alpha = \frac{k}{\sqrt{epoch-num}} \alpha_0 \)
  • And others ..

Notes:

  • In linear algebra terms, m[:,n] is taking the nth column vector of the matrix m so taking the first 3 examples in input data X, X[:, 0 : 3]. (slicing). (important for programming assignment)
  • Different optimization algorithms can leads to different accuracy so it’s important to always choose a good one.
  • One epoch passes through all the \(m\) training examples whether you use batch, stochastic or mini-batch.
  • Inside an epoch (iteration) you loop m times, 1 times or t times for stochastic, batch or mini-batches respectivaly.
  • Very useful article: http://ruder.io/optimizing-gradient-descent/index.html

Resources: Deep Learning Specialization on Coursera, by Andrew Ng

Leave a Comment