Objectives:
 Master the process of hyperparameter tuning.
 Learn how to use TensorFlow.
this is test $ x = y$
Hyperparamter tuning
We have seen a lot of hyperparamters (Grouped by importance of tuning):
 Group 1
 \(\alpha\)
 Group 2
 \(\beta\) (momentum term).
 Minibatch size.
 Number of hidden units.
 Group 3
 Number of layers.
 Learning rate decay.
 Group 4
 \(\beta_1, \beta_2\) (Adam) We can use: \(\beta_1 = 0.9, \beta_2 = 0.999, \varepsilon = 10^{8}\) Which will work well most of the time.
Tuning Process
Efficiently finding the best combination of these hyperparameters is the key to any DL/ML solution. There are basically two famous ways to tune hyperparameters:
 Grid search: We try all possible combinations of hyperparamters.
 Random Search: We try random values.
Using an appropriate scale.
Sampling at random doesn’t mean sampling uniformly at random over the range of valid values, so it’s important to pick the appropriate scale on which to explore the Hyperparameters.
How? Imagine you want to tune the learning rate \(\alpha\) and you said that it could be as low as 0.0001 and as high as 1.
So we need to explore the values \([0.0001, …., 1]\) but if we sampled the values uniformly at random, about 90% of the sampled values will be from \([0.1 \Rightarrow 1]\) and only 10% will be from \([0.0001 \Rightarrow 0.1 ]\). So instead of using just a linear scale we must use larger scale:
\([0.0001, …, 0.001, …, 0.01, …, 0.1, …, 1]\)
And sample uniformly at random over:
 \(0.0001 \Rightarrow 0.001\)
 \(0.001 \Rightarrow 0.01\)
 \(0.01 \Rightarrow 0.1\)
 \(0.1 \Rightarrow 1\)
This can be done in python like this:
r = 4 * np.random.rand()
learning_rate = 10**r
Recall: np.random.rand()
returns random sample from a uniform distribution over [0, 1).
To understand why this work notice that \(r \in [4, …, 0] \) thus \(\text{learning_rate }(\alpha) \in [10^{4}, …, 10^0] = [0.0001, …, 1]\)
In General: If you want to sample or take values uniformly at random between \([10^a, …, 10^b]\) on large scale: you first find the values a,b:
\(a = log_{10}(\text{min_value})\) (min_value in our example was 0.0001) \(b = log_{10}(\text{max_value})\) (max_value in our example was 1)
and you choose \(r \in [a, …, b]\) then you get the correct value \(parameter = 10^r \).
Choosing Hyperparameter for exponentially weighted averages \((\beta)\) can be a little tricky:
If we found someway that \(\beta \in [0.9, …, 0.999] \)
we can search instead for \(1  \beta \in [0.1, …, 0.001]\)
thus \(r \in [log_{10}(0.1), …, log_{10}(0.001)] = [3, …, 1] \)
so \(1  \beta = 10^r \)
and finally:
\(\beta = 1  10^r\)
Pandas VS Caviar
 Babysitting one model: you use this when you have huge data and low resources CPI/GPU.

Training many models in parallel: you train different model in parallel with different hyperparameter setup, then choose the best one.
 Choosing between pandas and caviar is determined by the computational power you can access.
Batch normalization
We could either normalize \(Z^{[l]} \text{ or } A^{[l]}\) but often we normalize \(Z^{[l]}\) before applying any activation.
Implementation:
\[\mu = \frac{1}{m} \sum_i Z^{(i)}\] \[\sigma^2 = \frac{1}{m} \sum_i (Z^{(i)}  \mu)^2\] \[Z_{norm}^{(i)} = \frac{Z^{(i)}  \mu}{\sqrt{\sigma^2 + \varepsilon}}\] \[\tilde{Z^{(i)}} = \gamma Z^{(i)} + \beta\]Notes:
 \(\varepsilon\) is to avoid division by zero.
 \(\gamma \text{ and } \beta \) are learnable parameters in the model.
Multiclass classificatin
Use Softmax in the last layer.
Intro to Deeplearning frameworks
Some of deeplearning frameworks:
 Caffe/Caffe2
 CNTK
 DL4J
 Keras
 Lasagne
 mxnet
 PaddlePaddle
 TensorFlow
 Theano
 Torch
Criteria for choose a framework:
 Ease of programming (Fast development Cycle)
 Running Speed
 Truly open (open source with good governance)
TensorFlow Code example
Some notes about TensorFlow:
 Tensorflow is a programming framework used in deep learning
 The two main object classes in tensorflow are Tensors (Variables, Placeholders …) and Operators (tf.matmul, tf.add, …).
 When you code in tensorflow you have to take the following steps:
 Create a graph containing Tensors and Operations.
 Create a session
 Initialize the session
 Run the session to execute the graph
 You can execute the graph multiple times in for loop as seen before.
 The backpropagation and optimization is automatically done when running the session on the “optimizer” object.
 Placeholder is a variable whose values you assign later.
 You use it to get your data into the cost function.
 or Feed different minibatches.
 TensorFlow knows how to compute the derivates and minimize the cost functin J
 TensorFlow has builtin the neccessary backward functions, so no need to explicity implement backpropagation.
Resources: Deep Learning Specialization on Coursera, by Andrew Ng
Leave a Comment