Practical aspects of Deeplearning

Build a system first and then iteratevely improve upon the system with bias and variance/ error correct

Numpy vs Tensor

FrameworkData structurePropertyTargetsAutomatic differentiationData type
tensorflowtensorsimmutableCPU/GPU/TPUYesNumerical/String
numpyarraymutableCPUNoNumerical
  1. A tensor is not actually a grid of values but the computations that lead to those values
  2. Different type of tensors

Bias and Variance

High varienceHigh biashigh bias & variencelow bias & variance
tain setlow errormedium errormedium errorlow error
val sethigh errormedium errorhigh errorlow error

> More on that in splitting data > - Error is in respect to the bayes error (optimal error). > - Avoidable bias is the difference between bayes error and training error. > - Its not always easy to find bayes error(human level performance) on large structured dataset.

These steps help with the orthogonalization

  1. Hight Bias ? If Yes (Try training deeper models and or Longer training period or with a different Optimizer, NN architecture/hyperparameter search)

  2. Hight Variance ? If Yes (Try adding more data/augmenation and or Regularization, Error analysis)

  3. If the model now doesn’t perform well on the test set, increase the validation set.

Orthogonalization

Tune one parameter at a time to have a clear view over its impact on the training process.

Hyper parameter

by the order of importance

  • \(\alpha\) learning rate
  • \(\beta\) momentum \(~0.9\)
  • \(\beta_1 0.9, \beta_2 0.999, \epsilon 10^{-8}\) for Adam optimizer.
  • Number of hidden units
  • Mini-batch size
  • Number of layers
  • Learning rate decay

Grid search of optimal hyperparameter might not work for deeplearning, since not eache hyperparameter is equally weight in its importance.

Random sampling is good to initally figure out the area in the grid that gives good results; later on reduce the search space to this and try again more densly in this reduced space.

Ensure the scale of the hyperparameter is appropriate; for example \(\alpha\) learning rate should be in log scale.

1
2
3
r = -4 * np.random.rand() # [-4, 0]
a = 10**r #[10^-4 .. 10^-0]
r, a

Similarly for \(\beta\) its of the range \([0.9 .. 0.999]\)

\((1-\beta) = [0.1 .. 0.001]\) and r is the range -3, -1

1
2
3
r = -2 * np.random.rand()-1 # [-3, 1]
a = 1- 10**r # [10^-3 .. 10^1]
r, a

Local optima

The chances of getting gradient 0 in a higher dimention is quite small. Its mostly going to be a saddle point. (Saddle because its like on the horse)

Analyse the performace

Evaluation metric

Use a single real value to evaluate the models performace quickly.

Following can be a error metric; \[ Accuracy - 0.5 * lattency\\ \frac 1 {\sum_i^{} w^{(i)}} \sum_i^{m_{dev}} w^{(i)} (y_{pred}^{(i)} \neq y^{(i)})\\ where w is 1 for x^{(i)} is non-porn 10 if x^{(i)} is porn. \] F1 score; Takes the average (Harmonic mean) of precision and recall \(\frac 2 {\frac 1 {Precision} + \frac 1 {Recall}}\)

If there are N metrics you care about,

  1. use 1 as the optimizer and
  2. N-1 as satisfier (as long as it meets some accetable threshold)

Then optimize your training for the above metic

\(J = \frac 1 {\sum_i^{m}w^{(i)}} \sum_i^{m}w^{(i)} l(\hat{y}^i, y^i)\)

Dataset

Splitting

The dataset is usually split into three.

SplitsSmall datasetLarge dataset
Train70%98%
Validation15%1%
Test15%1%

> Note; since the 1% of a large dataset is enough to validation the model.

  • Ensure the validation and test set come from the same distribution.
  • The test set is to have an unbiased performace estimation of your model (which is not mandotory).

But when the training data is not the same distribution as the validation and test, the training data is split into

SplitError (in distribution A) Training dataError (in distribution B) Real world dataType
Human level \(\approx\) Bayes error4%6%
Train7%6%Hight Bias (7%-4% is avoidable bias)
Train-val10%High Variance (10%-7% is vaiance)
Validation12%Distribution shift (12%-10%)
Test12%Over fitted on the validation split
Distribution shift across these two colums of error

> To find the error is due to hight varience or due to different distribution.

> Note: The different in error between train-val and validation will give the error added due to different distribution.

Error analysis

  • Use confusion matrix on the validation set and or filtering the misslabeled images as below.
  • Pro tip;
    1. Catch misslabelled ground truth is a seprate column
    2. Random errors on large datset is fine; only systematic errors are an issue).
      • In case of misslabelled data, its wise to look at correctly predicted classes as well.
    3. Ensure to do the same process to your validation set as well as the test set.

Addressing data mismach

  • Making training data more similar (data synthesis) and or collecting more data similar to validation/ test sets.

Transfer learning

When you have trained a large set of data on a deep learning model you can expect the initial layers will learn useful features that can be reused for other similar task. For example

Using the features from the model trained on large imagenet dataset \(\color{blue}{x,y}\) to retrain on small radiology dataset \(\color{purple}{x,y}\) by replacing the final layers with one or more layers to predict the \(\color{purple}{\hat{y}}\)

Note;

  • Pretraining is on the \(\color{blue}{x,y}\) and
  • Fintuning is on the \(\color{purple}{x,y}\) .

Muititask learning

\(y^{(i)}\) will be \([0,1,1,0]\)

\[ J = \frac 1 {m} \sum_{i=1}^m \sum_{j=1}^{number of class} L(\hat{y}_j^i, y_j^i) \] > when there is missing label data for some class you should only consider summing over the classes where there is label 0/1 in the cost funtion.

where the loss funtions is same as logistic loss.

Tips for training a multitask learning model;

  1. Ensure sharing of low features benefits in training the model recognising different tasks.
  2. Would need a big model to train well on all the task else the performace will get a hit.
  3. There should be similar amount of data for each task.

End to end approach

When you have enough data to map a complex funtion from input to output, an end to end deeplearning approach can be used, rather than hand designed feature.

This is not always straight forward, it differs from application to applicaiton.

  • Speech recognition example:
  • Face recognition example:
Licensed under CC BY-NC-SA 4.0