Vanila Neural Network vs Recurrent Neural Networks

  • vanila neural network
    • 1
    • there are limits to a fixed length of input and output
  • RNN
    • 2
      • e.g.
        • one to many: image captioning
        • many to one: sentiment calssification
        • many to many: machine translation, video classification on frame level

Recurrent Neural Network

  • 3
  • 4
  • the same function and the same set of parameters are used at every time step
  • Vanila Recurrent Neural Network
    • 5
    • \(W_{hh}\): weight from hidden state
    • \(W_{xh}\): weight from x
    • \(W_{hy}\): weight to y
  • 6
  • 7
    • \(h_T\): final hidden state \(\rightarrow\) summarizes all of the context
  • 8
    • a fixed input initializes the initial hidden state of the model
  • 9
    • encoder: take variable inputs and summarize the whole thing at the end
    • decoder: make a variable output from the summarized data
  • Example: Characeter-level Language Model
    • 10
      • vocabulary: [h, e, l, o]
      • training sequence: “hello”
      • values of output layers are not close to correct values, but after training by calculating loss values, the values of the correct characters increase
    • 11
      • when test-time was given an input of “h”, “e” was selected
      • “e” becomes the input of the next step and passes through the model, creating a new output and continuing
      • for the first time, the softamx value is “o” the largest, but luckily “e” is selected.
      • for this problem, test-time does not select the alphabet with the highest softmax value, but sampling (random selection of one by probability)
      • in general, sampling lets get diversity from the model
  • Truncated Backpropagation through time
    • 12
      • if the data is large, there is a problem of slowing down
    • 13
      • run forward and backward through chunks of the sequence instead of whole sequence
      • carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

Image Captioning

  • 14
    • a vector containing information in the image is obtained through CNN, and a sentence is generated through RNN
  • 15
    • the softmax at the end of the CNN model is eliminated, and a vector containing the entire degree of the image is generated through the fc-4096 layer
  • 16
    • needs start token to start
    • calculated by sending vector information obtained from CNN to h0
    • one output is the input of the next step
    • RNN ends when end token is created (e.g. y2)

Attention

  • Image Captioning
    • 17
      • RNN focuses its attention at a different spatial location when generating each word
    • 18
      • h0: create a distribution of the location to concentrate in the image (a1)
      • a1: computes the original vector (L \(\times\) D) to create an attention (z1)
      • z1 and the first word(y1) enter h1 and a2 and distribution over vocabularay words (d1) are created
    • results
      • 19
        • soft attention: take a weighted combination of all features from all image locations
        • hard attention: select exactly one location to look at in the image at each time step
      • 20
  • Visual Question Answering
    • 21
    • 22
      • get a summary of the image via CNN and a summary of the input question via RNN
      • predict the correct answer through the obtained summaries

LSTM

  • 23
  • 24
    • problems arise as the weights continue to multiply
    • when weights are large, it can be solved by gradient clipping, which is using normalizing, but when they are small, there is only a way to change RNN architecture
  • 25
    • g: determine how much input cell will be included
  • 26
  • 27
    • Backpropagation from \(c_t\) to \(c_{t-1}\) only elementwise c multiplication by f, no matrix t multiply by W



This is written by me after taking CS231n Spring 2017 provided by Stanford University. If you have questions, you can leave a reply on this post.