Stanford CS231n Lecture5 Convolutional Neural Networks

32 x 32 x 3 image -> stretch to 3072 x 1 -> 10 x 3072 weights -> 1 x 10 -> activation -> 1 number

Why is it convolutional?
- it is related to convolution of two signals
preserve spatial structure to take advantage of the spatial features of an image
use filters which always extend the full depth of the input volume
ex) 32 x 32 x 3 image, 5 x 5 x 3 filter
to calculate using filters, both a filter and a part of an image are increased to (the size of the filter) x 1, and they are calculated and bias is added. (\(w^Tx + b\))
applying multiple filters to an image stacks up activation maps (activation map is output of particular convolution layer)
ex) 32 x 32 x 3 image, 5 x 5 x 3 filter x 3 -> 28 x 28 x 3 activation map
Convnet: a sequence of Conbolution Layers, interspersed with activation functions
Sequence: image -> Low-level features -> Mid-level features -> High-level features -> Linearly separable classifer

Why is padding used?
- to ensure that the values of the edges are also used in the same number of operations as the internal values
- to maintain the size of the input
  - the size of the input is reduced quickly by using a filter, which is not good for performance
ex) 7 x 7 image, 3 x 3 image, stride 1, pad with 1 pixel
- ((7+2)-3)/1 + 1 = 7 => 7 x 7 output
ex) 32 x 32 x 3 image, 5 x 5 filters 10, stride 1, pad 2
in general, CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2

dot product between a part of an image and a filter = a neuron with local connectivity
an activation map is a (output length x output length) sheet of neuron outputs
- each is connected to a samll region in the input
- all of them share parameters (= since the same filter is used, the weight are all the same)
ex) 32 x 32 x 3 image, 5 x 5 filters 5 -> 28 x 28 x 5
- there are 5 different neurons all looking at the same reigon in the input volume
- each neuron looks at the full input volume

makes the representations smaller and more manageable, but the depth is not changed (= downsampling)
to prevent overfitting by reducing the number of parameters
operates over each activation map independently
Max pooling
- Max pooling is better than average pooling
  - because the biggest feature in that area is found
- trend towards controling stride than pooling for downsampling
common settings
- size: 2 x 2, stride: 2
- size: 3 x 3, stride: 2
trend towards getting rid of POOL/FC layers (just CONV)

This is written by me after taking CS231n Spring 2017 provided by Stanford University. If you have questions, you can leave a reply on this post.