Fully Connected Layer

  • 32 x 32 x 3 image -> stretch to 3072 x 1 -> 10 x 3072 weights -> 1 x 10 -> activation -> 1 number

Convolution Layer

  • Why is it convolutional?
    • it is related to convolution of two signals
  • preserve spatial structure to take advantage of the spatial features of an image
  • use filters which always extend the full depth of the input volume
  • ex) 32 x 32 x 3 image, 5 x 5 x 3 filter
  • to calculate using filters, both a filter and a part of an image are increased to (the size of the filter) x 1, and they are calculated and bias is added. (\(w^Tx + b\))
  • applying multiple filters to an image stacks up activation maps (activation map is output of particular convolution layer)
  • ex) 32 x 32 x 3 image, 5 x 5 x 3 filter x 3 -> 28 x 28 x 3 activation map 1
  • Convnet: a sequence of Conbolution Layers, interspersed with activation functions 2
  • Sequence: image -> Low-level features -> Mid-level features -> High-level features -> Linearly separable classifer

Filter

  • stride
    • 7 x 7 image + 3 x 3 filter -> 5 x 5 output
    • 7 x 7 image + 3 x 3 filter + stride 2 -> 3 x 3 output
    • 7 x 7 image + 3 x 3 filter + stride 3 -> Impossible (asymmetric output)
    • output size: (the length of an image - the length of an filter) / stride + 1
    • ex) image 7 x 7, 3 x 3 filter
      • stride 1: (7-3)/1 + 1 = 5
      • stride 2: (7-3)/2 + 1 = 3
      • stride 3: (7-3)/3 + 1 = 2.33
    • trend towards smaller filters and deeper architectures

Padding

  • Why is padding used?
    • to ensure that the values of the edges are also used in the same number of operations as the internal values
    • to maintain the size of the input
      • the size of the input is reduced quickly by using a filter, which is not good for performance
  • ex) 7 x 7 image, 3 x 3 image, stride 1, pad with 1 pixel
    • 3
    • ((7+2)-3)/1 + 1 = 7 => 7 x 7 output
  • ex) 32 x 32 x 3 image, 5 x 5 filters 10, stride 1, pad 2
    • 4
    • 5
  • in general, CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2

Common Settings

  • number of filters K, spatial extend F, stide S, amount of zero padding P
    • K: powers of 2 (32, 64, 128, …)
    • F = 3, S = 1, P = 1
    • F = 5, S = 1, P = 2
    • F = 5, S = 2, P = whatever
    • F = 1, S = 1, P = 0 -> to reduce depth while maintaining width and height of input

Brain/neuron view of CONV layer

  • dot product between a part of an image and a filter = a neuron with local connectivity
  • an activation map is a (output length x output length) sheet of neuron outputs
    • each is connected to a samll region in the input
    • all of them share parameters (= since the same filter is used, the weight are all the same)
  • ex) 32 x 32 x 3 image, 5 x 5 filters 5 -> 28 x 28 x 5
    • there are 5 different neurons all looking at the same reigon in the input volume
    • each neuron looks at the full input volume
    • 6

Pooling

  • makes the representations smaller and more manageable, but the depth is not changed (= downsampling)
  • to prevent overfitting by reducing the number of parameters
  • operates over each activation map independently
  • Max pooling
    • 7
    • Max pooling is better than average pooling
      • because the biggest feature in that area is found
    • trend towards controling stride than pooling for downsampling
  • common settings
    • size: 2 x 2, stride: 2
    • size: 3 x 3, stride: 2
  • trend towards getting rid of POOL/FC layers (just CONV)



This is written by me after taking CS231n Spring 2017 provided by Stanford University. If you have questions, you can leave a reply on this post.