CPU vs GPU

  • 1
  • 2
    • CUDA: hard to write performat code, so use existing codes
  • 3

Deep Learning Frameworks

  • 4
    • Numpy
      • can’t run on GPU
      • each expression must be completed to calculate gradients
    • Tensorflow
      • create forward computational graph
      • ask TensorFlow to compute gradients
      • tell TensorFlow to run on CPU or GPU
    • PyTorch
      • define Variables to start building a computational graph
      • forward pass looks just like numpy
      • calling c.backward() computes all gradients
      • run on GPU by casting to ‘.cuda()’
  • TensorFlow
    • 5
      • define model object as a sequence of layers
      • define optimizer object
      • build the model, specify loss function
      • train the model with a single line
  • Theano
    • earlier framework from Montreal University
    • 6
      • define symbolic variables (similar to TensorFlow placeholder)
      • forward pass: compute predictions and loss (no computation performed yet)
      • ask Theano to compute gradients for us (no computation performed yet)
      • compile a function that computes loss, scores, and gradients from data and weights
      • 7
        • run the function many times to train the network
  • PyTorch
    • abstraction
      • Tensor: Imperative ndarray, but runs on GPU
      • Variable: Node in a computational graph; stores data and gradient
      • Module: A neural network layer; may store state or learnable weights
    • Tensors
      • 8
        • to run on GPU, just cast tensors to a cuda datatype (dtype)
        • create random tensors for data and weights
        • forward pass: compute predictions and loss
        • backward pass: manually compute gradients
        • gradient descent step on weights
    • Autograd
      • 9
        • x.data is a Tensor
        • x.grad is a Variable of gradients (same shape as x.data)
        • x.grad.data is a Tensor of gradients
        • Variables remember how they were created (for backprop)
        • forward pass looks exactly the same as the Tensor version, but everything is a variable now
        • compute gradient of loss with respect to w1 and w2 (zero out grads first)
        • make gradient step on weights
    • nn
      • higher-level wrapper
      • 10
        • define our model as a sequence of layers
        • defines common loss functions
        • forward pass: feed data to model, and prediction to loss function
        • backward pass: compute all gradients
        • make gradient step on each model parameter
    • DataLoader
      • wraps a Dataset and provides minibatching, shuffling, multithreading
      • when you need to load custom data, just write your own Dataset class
  • Static vs Dynamic Graphs
    • 11
    • Static
      • once graph is built, can serialize it and run it without the code that built the graph
      • good for reuse because the form is stored separately
    • Dynamic
      • graph building and execution are intertwined, so always need to keep code around
      • bad for reuse
      • 12
        • convenient to use conditional expressions
      • 13
        • can be expressed simply when there is a Variable that varies in size every time
        • Recurrent networks, Recursive networks, Modular Networks
  • Caffe / Caffe2
    • Core written in C++
    • Has Python and MATLAB bindings
    • Good for training or finetuning feedforward classification models
    • Often no need to write code
      1. Convert data (run a script)
      2. Define net (edit prototxt): define a model
      3. Define solver (edit prototxt): define a SolverParameter
      4. Train (with pretrained weights) (run a script)
    • Not used as much in research anymore, still popular for deploying models
    • Caffe Model Zoo: has pre-trained models
    • Pros / Cons
      • (+) good for feedforward networks
      • (+) good for finetuning existing networks
      • (+) train models without writing any code
      • (+) python interface is pretty useful
      • (+) can deploy without Python
      • (-) need to write C++ / CUDA for new GPU layers
      • (-) not good for recurrent networks
      • (-) cumbersome for big networks (GoogLeNet, ResNet)
  • Summary
    • 14
    • TensorFlow is a safe, and use it for one graph over many machines
    • PyTorch is good for research



This is written by me after taking CS231n Spring 2017 provided by Stanford University. If you have questions, you can leave a reply on this post.