Stanford CS231n Lecture6 Training Neural Networks, Part 1

Activation Functions

Sigmoid
- \[\sigma (x) = 1/(1+e^{-x})\]
- squashes numbers to range [0,1]
- historically popular
- problems
  1. saturated nuerons kill the gradients (saturate: converging to a constant value)
    - at some point, the gradient becomes zero.
  2. outputs are not zero-centered
    - since input is always positive, gradients on w are always all positive or negative
  3. exp() is a bit compute expensive
    - it’s less complicated than convolution and dot product, but it can still be a problem
tanh(x)
- squashes numbers to range [-1,1]
- zero centered
- kills gradients when saturated
- better than sigmoid
ReLU (Rectified Linear Unit)
- \[f(x) = max(0,x)\]
- does not saturate (in + region)
- computationally efficient
- convergence is much faster than sigmoid/tanh in practice
- more biologically plausible than sigmoid
- issues
  - not zero-centered output
  - saturatation when \(x \leq 0\)
    - dead ReLU occurs when there is an error in the initial value setting or the learning rate is too large. If the learning rate is too large and w becomes equal to or less than zero, then it becomes a dead ReLU from that point on.
Leaky ReLU
- \[f(X) = max(0.01x, x)\]
- does not saturate
- computationally efficient
- converges much faster than sigmoid/tanh in practice
- will not die
PReLU (Parametric Rectifier)
- \[f(x) = max(\alpha x, x)\]
- \(\alpha\): parameter, determined by backprops
- more flexible
ELU (Expoential Linear Units)
- all benefits of ReLU
- closer to zero mean outputs
- negative saturation regime compared with Leaky ReLU adds some robustness to noise
- compuatation requires exp()
Maxout Neuron
- \[max(w_1^Tx + b_1, w_2^Tx + b_2)\]
- does not have the basic form of dot product -> nonlinearity
- generalizes ReLU and Leaky ReLU
- Linear segime, No saturation, Never die
- doubles the number of parameters/neuron
Conclusion: “Use ReLU!”

Data Preprocessing

Zero-centering
- subtract the mean image (e.g. AlexNet)
- subtraact per-channel mean (e.g. VGGNet)
Normalization: the image is not in this process because it is already somewhat normalized.
PCA/Whitening: images do not perform this process because spatial characteristics are used.

Weight Initialization

W = 0
- all neurons do the same operation (occurs if all ‘w’s are same)
Small random numbers
- ```
  W = 0.01 * np.random.randn(D, H)
  # generating Gaussian normal distribution random numbers with mean 0, standard deviation 1 in D X H matrix
```
- Works well for small networks, but problems with deeper networks
  - using tanh as an activation function,
    - smaller slopes gradually disappear and only around zero remains
    - the gradient is zeroed at backpropagation
Bigger random numbers
- almost all neurons completely saturated, either -1 and 1. Gradients will be all zero. No updating!
Xavier initialization
- divide by the number of inputs
- ```
  W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)
```
- when using nonlinearity, it kills the half of units -> bad
- improvement: python W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in/2)

Batch Normalization

progresses normalization each batch
- find the meaning and variation of each batch
- normalize
  - \[\hat{x}^{(k)} = {x^{(k)} - E[x^{(k)}] \over \sqrt {Var[x^{(k)}]}}\]
usually inserted after Full Connected or Convolutional layers, and before nonlinearity
- mitigates bad scaling effects caused by matrix operations
normailized form -> original form
- \[y^{(k)} = \gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}\]
- \[\gamma^{(k)} = \sqrt {Var[x^{(k)}]}\]
- \[\beta^{(k)} = E[x^{(k)}]\]
- \(\gamma^{(k)}, \beta^{(k)}\): determined by learning
features
- improve gradinet flow through the network
- allows higher learning rates
- reduces the strong dependence on initialization
- acts as a form of regularization
- slightly reduces the need for dropout
- is not always good
- maintain basic structure and space characteristics after batch normalization
- test data apply one mean and variance obtained during training

Babysitting the Learning Process

preprocess the data
choose the architecture
check the loss is reasonable
- when disable regularization, loss should be \(-ln({1 \over c})\)
- when crank up regularization, loss should go up (= sanity check)
check the model
- ensure that training accuracy with a very small training set is 1.00, overfitting (overfitting means the model is well defined)
start with small regularization and find learning rate that makes the loss go down
- loss barely changing = learning rate is too low
- cost: NaN = learning rate is too high

Hyperparameter Optimization

cross-validation strategy
- coarse -> fine
- (tip) cost > 3 * original cost: break
- (tip) optimize in log space -> \(10^{(range)}\) (e.g. \(10^{(-5, 5)}\))
order
- select candidates having validation accuracy
- narrow the range of hyperparameters
- if a candidate that exists at the edge is found, modify the range of hyperparameters
random search vs grid search
- grid search: Assuming that all features are of equal importance, it is difficult to find the optimum.
- random search: Assuming that features are all different in importance, it is more likely to find the optimum point by exploring various points.
visualization
- - - big gap: can remove some features
track the update ratio
- update scale / origin param scale
- want this to be somewhere around 0.001 or so

This is written by me after taking CS231n Spring 2017 provided by Stanford University. If you have questions, you can leave a reply on this post.