Stanford CS-231 from Spring 2017


https://www.youtube.com/watch?v=CIfsB_EYsVI&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&index=16
https://cs231n.github.io/
http://cs231n.stanford.edu/


================================================================================
Lecture 1
Introduction - History of Vision
================================================================================
History of vision
In the 60's-80's things were primitive, focused on building models of objects from lines.
In 90's, this shifted to image segmentation
In the 90's and 2000's machine learning started to develop. 
In 2001, Viola and Jones built real-time face detection, and witin 5 years it shipped
as an application in Fuji cameras.

In 90's-2000's things were based on feature recognition.
Find important features of an object, and then match those to a dictionary of known objects.

Later work looked at "holistic features", overall patterns of the shape.

A lot of work now focuses on object recognition, look at pictures and decide whether it is a 
person or a plane or a car. How far can this go? Can we build a large dictionary of all common 
objects? Some of this is driving large image repositories, like ImageNet (download and organize 
15M images with 22K categories).

In 2012, a Convolutional Neural Net (ALEXNet) won the challenge and improved significantly over 
other technologies. This course will focus on CNN's. 
It tries to solve "Image Classification", which looks at an image and then picks from a list of
categories to classify an image. This is what ImageNet did.
It is a versatile problem, but classifies the entire image.

The class will also work on object detection, which finds an object from within a larger image.
This will use Convolutional Neural Networks (CNN's)

CNN's were developed earlier, in 1998 (LeCun et al at Bell Labs) for recognizing hand-written
letters for checks.
They became popular in 2012 with ALEXNet (Krizhevsky et al)

A few things changed since 1998 that combined to make the big breakthrough in 2012
- Computation speed, particularly GPU's
- Much larger data sets

This is a story of scale - changing the size of a thing will eventually change its nature.
Quantitaive changes have qualitative effects.

The future is semantic segmentation, activity recognition, 3m modelling, 

Extract a group of objects from an image, then find connections between those objects.
This can also feedback and impose constraints on the image processing.
This is where digital path can go.
Build a semantic layer on top of the vision recognition system.
    Extract higher level features
    Find relationships between the objects we extract fromthe image
    Feed back to the image analysis (create context and expectations).


================================================================================
Lecture 2
Nearest Neighbors and Linear Classification
================================================================================

Image Classification Pipeline
You process an image, and categorize the entire image.
Typically, there are a fixed set of category labels, but there may be thousands of different labels.
Moreover, the camera may have different viewpoints, lighting, zoom and more.
The object may be partially occluded, and may have different poses.
There may also be background clutter that may resemble the object.

The classifier will not have hard-coded rules. Instead, it will be data-driven and
built on a large library of training images.
So, there is a train(image) module, and then a predict(image) module.

This data-driven approach relies on good training data. A simple example for vision is
the CIFAR-10 dataset, a group of images from 10 different classes.

Nearest Neighbor classifier
=============================
This is the simplest classifier

Train(image)
    Record every training Image
This is O(1)

Predict(image)
    Compare the new image to every trainign image, and select closest match
    Return the category of the closest match image
This is O(n) for each example

So, how do we compare 2 images?
L1 distance (Manhattan distance)
================================
Distance = SUM[for all pixels](|image1 - image2|)
This is the sum of absolute value of the difference between pixels.
This is fast at train and slow at predict, which is backwards.
In practice, we want a classifier that is slow at training but fast at prediction.
This depends on the coordinate system, so it is affected by rotation.


K-Nearest Neighbor classifier
=============================
This is an improvement on nearest neighbor.
Find the K nearest neighbors, and each is a vote for one category. Return the category 
with the most votes.

K=5 is better. This smooths out the decision boundaries between the different ctegories.
There may be cases where there is no clear winner (each category gets 1 vote) so randomly
pick a winner.


Alternatively, we can use a different metric for comparing images.
L2 (Euclidean) Distance
========================
We can still do a pixel-by-pixel comparison but now compute 
    Distance = SQRT(SUM[for all pixels]((image1 - image2|)**2)
Sqrt of the sum of the squares.
This does not depend on the coordinate system, so it is NOT affected by rotation.


You can use K-nearest neighbor for non-image data, like text or any other data.

Hyperparameters
These are not learned from the training data, instead they are chosen by the programmer
- What is the best image comparison algorithm
- What is the best value of k


Testing the Algorithm
=====================
Bad Idea #1
Generally, do NOT pick the hyperparameters that work the best on the training data.
You run the risk of overfitting the training data.

Bad Idea #2
Split the data into 2 groups: training, and test data.
Train each algorithm on the training set, and then run on the test data.
Compare the performance of each algorithm on the test data.

Reasonable Approach
Split the data into 3 groups: training, validation and test data.
Train each algorithm on the training set, and then run on the validation data.
Compare the performance of each algorithm on the validation data.
Pick one final algorithm.
Then, run just that final algorithm on the test set.
There needs to be a separation of the validation and test data.
The test data will be used for the final results reported in any paper.

Cross Validation
This is the gold standard, but is not used too much in practice.
First, split the data into 2 groups: training, and test data.
Next, split the training into N groups called Folds.
For example, there may be k folds: fold1, fold2, ..., foldK
For each possible parameter:
    Train on K-1 folds, and then validate on the remaining fold.
Each algorithm will use a different fold for validation.

Note, in all cases, the validation and test data still have known correct labels.
This means we can measure how the algorithm does on these.

Nearest-Neighbors ASSUMES that the training data will densley cover the problem space.
Othwerwise, you will not learn about some regions of the problem space.



Linear Classification
=====================
This is an important step toward neural networks.

Each image is an array of WxHxdepth bits
Compute a function f(X,W) where
    x is the input image
    W are some weights, that are tuned parameters
The output is a group of K numbers, each giving a liklihood score for a different "type" or class
of the image.
Then, take the classification with the highest score.

Note, K-Nearest-Neighbors kept the entire training set.
A linear classifier will only keep the parameter weights after it has been trained.

How do we define f(X,W)?
One simple approach is multiplication:
    Class Scores = f(X,W) = Wx + b

    The imput image is serialized into a long column vector.
        For example, a 32x32 image of 3-channel RGB will have 3072 8-bit numbers where each
            RGB is flattened into 3 separate numbers.
        Call the size Z, where Z = imageWidth * imageHeight * #colorChannelsPerPixel
        The vector is Z x 1

    The multiplier is a C x Z matrix
        Z is the size of an image vector. The image vector has length of a row of the matrix
        C is the number of different classes. There is one row for each class.

    b may be a bias offset, which can bias one class over the others.
    This is a Cx1 vector. For example, it may be a pre-rest probability for Baysean reasoning.

    The product is a Cx1 vector, giving a score for each class

Each row of the matrix is like a template for the class. Multiplying the image vector
by one row calculates how well the image aligns with the template.
The numeric value allows some pixels to be more important in the template than others.

If each row of the matrix is a template, then it is just a vector representatino of an image.
We can reverse the mapping, and convert each matrix row into the template image, and this shows
the templates.

The classifier makes one image template for an entire category.
So, this may be an amalgam of different perspectives.
BNeural networks will solve this problem.


A linear classifier will try to fit a line through the problem space to separate one class
of images from others. It creates a decision boundary in the problem space.
This fails if the problem space does not have a simple linear separation between classes.



================================================================================
Lecture 3
Loss Functions and Optimization
================================================================================

Let's return to Linear Classifiers.
    f(x, W) = W * x
Note, the output seems to be a vector of scores.
This is the default, and works well when we are doing classification. The vector has
one entry for each possible class, and the score values are the probability that the
image belongs to each class


How do we use the training data to make the matrix W values?

We define a loss function, that measures the difference between a predicted class and the actual class.
We use this loss function to create a feedback, it tells us how to adjust the weight matrix W 
on each training example.

A loss function takes the training data and the correct answers.
    (Xi, yi)
Where x is an input image (a Z x 1 vector) and the yi is an integer label.
Note, y is a single number, while the recognizer output is a vector of scores for each possible
category, with highest score at the yth row meaning this is category y.
We use si to denote the ith score in the output vector.
So if the result is category 5, then s5 is the largest entry in the vector.

The loss over a training set is simply the sum of the loss for each training value
    L = 1/N SUM[i=1..N]( L( f(xi, W), yi ) )
Where N is the number of examples in the training set


Support Vector Machine loss
===========================
An example loss function is the multi-class Support Vector Machine loss.

Get the score given to the correct class. Call this Sy because we assume it is for class y.
Initialize TotalLossForThisTrainingExample = 0
Look at each score in the output vector
    Call this score Sj
    If this is an incorrect class, then compute the difference between the correct score and this 
        specific incorrect score.
    ScoreDifference = Sj - Sy
    
    Compare ScoreDifference to some safery margin, which is 1
        1 is a somewhat arbitrary choice.
        We could scale it up and down, but that does not change the effect.
    if (ScoreDifference < 1) then do nothing.
    if (ScoreDifference >= 1) then add (ScoreDifference + 1) to TotalLossForThisTrainingExample

    Optionally, you may compute the average
        TotalLossForThisTrainingExample = TotalLossForThisTrainingExample / N
    Where there are N training examples.
    However, this step does not seem to be required, and it will not change the ultimate error-feedback.
    

Now, repeat for all training examples, and add TotalLossForThisTrainingExample to 
    the total TotalLossForAllExamples

The min loss is 0, but the max loss is +infinity
If we initialize the matrix to be all 0's, then we get a loss of 1 for each wrong class so the loss
is #classes - 1 for each training example. But, this is only true if the matrix is all 0's.
We may often initialize it to all 0's so we can expect this on the first iteration.
This is nice for debugging.

Here is some numPy code for this:
def L_I_vectorized(x, y, W):
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + 1)
    margins[y] = 0
    loss_i = np.sum(margins)
    return(loss_i)    


Note, there are many W's that can have 0 loss, because they may all have difference scores
that are above the safety margin.
For example, scale W by a constant, like 2. Then all correct margins are still correct over 
the safety margin. So, they all make 0 loss.

We don't need an exact loss metric, because we do not want to overfit the training data.
So, this is an intentional sloppiness is useful because it avoids overfitting.
We want a recognizer that will work well on the test data, and not be a perfect fit for 
the training data.

We solve overfitting with "regularization".
    Loss(W) = 1/N * SUM[i=1..N](Loss(f(X^i, W), y^i)) + lambda*R(W)
R(W) will favor the simplest model, so it is a penalty for large W
Lambda is a hyperparameter - it is chosen at algorithm design time.

Some examples of R(W)
- L2: R(W) = SUM[k]SUM[l] ((W[k][l])**2)
- L1: R(W) = SUM[k]SUM[l] (|W[k][l]|)
- Elastic (L1 + L2): SUM[k]SUM[l] ((W[k][l])**2 + |W[k][l]|)



Multinomial Logistic Regression (or Softmax)
=============================================
Multinomial Logistic Regression (or Softmax) is a different loss function that can be used.
This is more common.

We interpret the scores as the log of the probability of each particular class.
So, we can use the scores to compute a probability distribution over the classes.

The probability of each class K is P(Y = K) = (e**S[k])  /  SUM[j]((e**S[j])
This is the softmax.
The probability of each class is between 0 and 1 and the sum of probabilities for all classes is 1

We then compare this calculated probability distribution to the target probability distribution
The target probability distribution has 1 at the correct class and 0 for all other classes.
    L = -log(P(Y = yi | Xi))
We want the probability of the correct class to be high.
The loss is the -log of the probability of the true class. Loss will be 0 when the probability
of the true class is 1.
Note, we really do not care about the specific values of the incorrect probabilities.
All probabilities must add to 1, so we only care how much of this is allocated to the correct class.

The loss function measures "badness" not "goodness".
log(P(Y = yi)) increases as more the correct score has more probability.
So, we take the negative of this score, so measure "badness", or to increase as
less probability is allocated to the correct answer.
    Loss = -log(P(Y = yi | Xi))

Min loss is 0, and max loss is infinity.
But, both of these would require infinitely large scores (either true or false)
If the matrix initializes at all 0's, then the intiial loss is log(#classes)
This is nice for debugging.


Optimization
============
This uses the loss function to update the matrix W
This is a search problem, through the space of possible W matrices.
We want to minimize the loss.
In practice, we do not try to do a closed form solution. Instead, we use iterative methods.

We start with a Hill-Climbing algorithm, which is also called Gradient Descent.
This is commonly used.

The gradient is the partial derivative of the Loss with respect to each input
We use the gradient of L with respect to W
    grad L (written as "del-w L")
*** Note, many of the examples below will use the notation df/dx where f is some function
for a node. That is correct when evaluating a single node. But, this output is really the loss.

The inputs may be vectors, and this is explained in the next lecture.
But, the key is you can have a single variable that is a very large vector (like a flattened image bitmap).
You do not need separate variables for each value in a matrix or vector.

This is a search algorithm.
Recall that these partial derivative vectors on each dimension can add to a vector that points
in the overall direction of the slope. The relative magnitudes of each component vector (the gradient along
each dimension) will determine how much the final vector points in that dimension.
The slope in any one particular dimension is the dot product of the gradient vector with a unit 
vector on that dimension.

The gradient is the the direction to maximize the function.
The negative gradient is the direction to minimize the function.

A coarse estimate of the gradient is the linear slope.
    df/dx = lim[h-->0] (f(x + h) - f) / h
In practice, we could compute loss for W*x and (W+h)*x where h is a small change in one dimension.
Then, compute the difference in the loss function
    dW/dx = (Loss(W + h) - Loss(W)) / h
    delta-W is the dW/dx value in each entry
    W = W + delta-W


Gradient Descent
================
We can compute the derivative of W with respect to each input

While (loss)
    deltaW = EvaluateGradient(Loss, X, W)
    W = W - (stepSize * deltaW)

The stepSize is a hyper-parameter, and is one of the most important parameters set during
algorithm design.


Stochastic Gradient Descent
===========================
In general:
    dW = 1/N * (SUM[i=1..N] (dW * L(x^i, y^i, W)))
                + (lambda * dW * R(W))

For efficiency, we update W by just examining a subset of the training samples.
We randomly select some subset, like 32, 64, 128 or 256 elements.

while True:
    # Take 256 training samples
    dataBatch = sample_training_data(allData, 256)

    deltaW = EvaluateGradient(Loss, dataBatch, W)
    W = W - (stepSize * deltaW)


Image Features
==============
Feeding raw pixel values into a linear classifier has problems.
Instead, we can pre-process the image to convert it into an alternate description
Then, feed this processed feature description into the linear classifier.

Some example conversions:
    Convert pixel colors from cartesian to polar coordinates

Or change representation entirely:
    Color histogram
        (an array where the ith element is the number of pixels with color #i)

    Gradient direction histogram
        (an array where the ith element is the number of 8x8 pixel blocks with an edge with orientation #i)
        This is like horizontal, vertical, SW->NE, NW->SE and so on.

    Word histogram
        (an array where the ith element is the number of words with color #i)

    Image Block Histogram
        Define a "dictionary" of small pixel block configurations (solids, line, etc)
        (an array where the ith element is the number of 8x8 pixel blocks with pattern #i)

In practice, convolutional neural networks will discover features at runtime.


            

================================================================================
Lecture 4
Backpropagation and Neural Networks
================================================================================

We will use a graphical notation (computation graph) which is a bit like a flowchart, or
really like a dataflow diagram. Data travels along edges, and the nodes represent functions
like multiplier, calculate loss, compute R, etc.

The computational graph is useful for calculating backpropagation
This is more than notation for this lecture, large libraries like Caffe seem to use this
and it is a standard way of structuring the problem.


Back Propagation
================
We want to seen how the error/loss changes with each input.
    Here we use "Loss Function" to describe the error of the computation
Back Propagation computes the partial derivatives of the loss function in terms of the original inputs
to the network.

Back Propagation is basically a recurrent application of the Chain Rule for derivatives.
Consider a simple function
    f(x,y,z) = (x + y) * z

We want to compute the error with respect to any of the variables.
First, represent this as a computational graph

     --------
x---\|      |
     | Add  |- Q ----------+   --------
y---/|      |              |---|      |
     --------                  | Mult |- f ------>
                           |---|      |
z--------------------------+   -------|

We have some intermediate nodes
    Q = x + y
    f = q * z

And, we can compute partial derivatives:
    dQ/dx = 1
    dQ/dy = 1

    df/dq = Z
    df/dz = q

However, we don't want Q in the final result, we want the derivative in terms of
the inputs. How do we compute df/dx and df/dy? Use the chain rule
    df/dy = df/dq * dq/dy
    df/dx = df/dq * dq/dx

We will use this to compute the derivative of the loss vector in terms of the
original inputs. 
    dL/dx, dL/dy, dL/dz

The algorithm is
- Label all internal values with new variables

- Propagate the inputs forward to the final output node
    This assigns a numeric value to each intermediate variable on the graph

- Select the final output node as the current node

- Define the gradient of the output in terms of the output of the final node
        This is trivial, df/df = 1
    *** Note, this assumes that the output of the final node is the loss.
    Here we are computing dLoss/dLoss = 1. 

- Repeat until you get to the original input variables
    - Start with the upstreat gradient coming down.
        This is the derivative of the Loss function in terms of the output of the current node.
        This is a numeric value (but may be a scaler or vector or matrix)

    - Compute the partial derivative of the loss function in terms of EACH of the inputs to 
            the current node
        Each node is a simple operation, like + or *, so this should be either a constant or a 
            simple expression in terms of its input
        This will create several expressions
            dL/dIn1, dL/dIn2, etc

    - Rewrite the derivative of the loss function in terms of the inputs
        This uses the chain rule: df/dy = df/dq * dq/dy

    - Plug in the numeric values:
        dL/dOut takes the derivative value we computed on the previous stage
        df/dx, df/dy, etc. take numeric inputs from original forward propagation
        Together, these compute a numeric value for each of dL/dIn1, dL/dIn2, etc 

    - Propagate the numeric values for the partial derivatives back to the output
        of the nodes that computed those input values

    - Repeat this loop at the next stage back in the graph, so iterate on the nodes that computed
        the input values

At each node, we only compute the local gradient.
*** This is where some of the confusion in the lecture comes from. She talks about df/dx, which is true.
*** But, the f() here is the Loss, not the predicted value.

Some useful derivatives
    d(e**x)/dx = e**x
    d(kx)/dx = k
    d(1/x)/dx = -1/x**2
    d(k + x)/dx = 1

We can define any math functions for nodes:
    Basic Arithetic: +, -, *, /
    Sigmoid: 1/(1 + e**(-x))
        Written as sigma(x)
        dsigma(x)/dx = (1 - sigma(x))*(sigma(x))

A few patterns:
    An addition node will pass the output gradient to both inputs
    A multiplication node will scale the output gradient by the value of the other branch
    A MAX node will pass the output gradient to larger input and 0 gradient to the smaller input


Vectors
=======
We can now apply this to vectors. The algorithm and logic is the same. 
Now, x and y are input vectors and Z is an output vector
    df/dx is a Jacobian matrix, so it is matrix of derivatives,
        The derivative of each element of Z with respect to the each element of X
        The ith row of the matrix is the derivative of the ith element of the output variable with respect
            to each element in the input vector
        The jth column of the matrix is the derivative of the output vector with respect to the jth element
            in the input vector
    A 4096 output vector with a 4096 input vector would create a 409,600 x 409,600 Jacobian matrix
    
The Jacobian matrixes can become impractically large.
However, these are very sparse (mainly a diagonal. So, there are workarounds 


Example
f(x, W) = ||W * x||**2 = SUM[i=1..n](((W * x)[i])**2)
We can write this as a graph:
    q = W * x
    f(x,W) = {{q}}**2 = SUM[i=1..n](q[i]**2)


Summation Nodes
---------------
Some nodes will take a vector as an input and compute a scaler result, like ||x|| computes the length
of the vector. 
    f(q) = {{q}}**2 = q1**2 + q2**2 + .... + qn**2

The gradients always have the same dimension as the argument, so the output gradient may be
a scaler while the input gradient may be a vector.
The input is a vector, so the input gradient is a vector.

Compute the gradient of this node in terms of one entry in the vector
    df/dq[i] = 2qi
Now, apply this operator to each element in the input vector.


Matrix Multiplication Nodes
---------------------------
Suppose q = Wx where q and x are vectors and W is a matrix
    dq[1]/dW[1][1] = x[1]
More generally:
    dq[k]/dW[i,j] = x[j]
Now, use the chain rule
    df/dw[i][j] = df/dq[1]*dq[1]/d[w[i][j] + .... + df/dq[N]*dq[N]/d[w[i][j]
                = SUM[k=1..N]( df/dq[k] * dq[k]/dw[i][j] )


Design
There are 2 separate tasks

1. Forward. Propagate values forward through the graph.
    This will compute a final predicted value, 
    Then compare that to the correct value and compute the loss.

2. Backward. Use the chain rule to compute gradients backward through the graph
    Take the dLoss/dOut for each node and use the chain rule to compute 
        dLoss/dInput for each of the inputs.
    This will require plugging in data values that were saved from the forward propagation.
    This takes in a numeric value (scaler or vector or matrix) and outputs a numeric value
    Note that the input gradients may have a different dimension than the output gradient
    The input gradients will have the same dimension as the input values
    For example, a vectorLength function may have output gradient that is a scaler while
    input gradients that are vectors

We implement this in the same way for each type of neural net and node, with explicit functions
for forward and backward
Note, she uses the term "gate" and "node" interchangeably. For example, a node in the compute graph
has stylistic similarities to a logic gate in a digital circuit.

#######################################
# This computes the values of all internal nodes, and also the error (called "loss")
# at the final node.
# Note, here the final result is the loss (or also called the error), not a specific value.
# This is used for training, so we don't care about the actual value, only the error between the
# predicted and correct answer.
def forward(inputs):
    for gate in self.graph.nodes_topologically_sorted():
        gate.forward()

    # There is an unwritten step here. The loop will compute a final predicted value
    # Now, we must compare that to the correct value and compute the loss.
    return loss

#######################################
# This returns the gradients of the loss in terms of each input
def backward():
    for gate in reversed(self.graph.nodes_topologically_sorted()):
        gate.backward()

    return input_gradients


#######################################
# This is one sample gate, the multiplication function.
class MultiplyGate(object):
    #########################
    # x and y are scalers
    def forward(x,y):
        # Save copies of all values, since we will need these later 
        # in the backward pass.
        z = x * y
        return z

    #########################
    # dz is dL/dz
    def backward(dz):
        # dx is dL/dx and dy is dL/dy
        ... Compute dx and dy
        return [dx, dy]

Large libraries, like Caffe have pre-built classes for lots of the common functions, such as
sigmoid, multiplication, and more.  




Neural Networks
===============
We have a score function that we want to optimize.
A linear classifier has a simple function
    f = W * x
A 2-layer neural network just has two stages of linear classifiers
    f = W2 * MAX(0, W1 * x)
This is just serializing the two matrix multipliers, W1 and W2.
There is an intermediate stage that is non-linear, it passes either 0 or the result of the first 
    stage to the second stage. eith
This non-linear step is very important, it is like the sigmoid.
Without it, W1 and W2 will collapse, and really become just a single linear function.
The non-linear step separates the logic of adjacent stages.
But, it also seems to do something more, but that is not yet clear.

[Side Note: Winston (in 6.034) seemed to describe this as converting the linear score back to a boolean decision.
However, this seems to be more than that, and is needed even when the final output is a linear score. He also
discussed this as an analogy the threshold voltage of a neuron, which is certainly true.]

We will stack multiple stages of "hierarchical" computation.
This seems to imply that the early stages do low level things, like identify edges or simple features
and the later stages identify higher level concepts based on the low level features.

Recall, that the matrix for a simple 1-stage linear classifier was like a template.
This had the problem that there were only 1 prototype for each class.
A multiple layer network will let you match the inputs against several different templates in parallel 
in the first stage and then combine the results in the second stage.

For example, the inputs may be an image of a car
The first stage may have several linear classifiers in parallel
    Sports car
    Sedan
    Front view
    Side view
    ...
And the second stage may be a OR-gate or weighting function that combines the results from the first stage.

The non-linear step says that if an image does not match one of the first stage multipliers, then ignore it.
This is key because it keeps negative results or noise from propagating along with the positive results.

The roles of first and second stage are not *exactly* like this, but it is a good approximation.
There may be some feature recognition in the second stage as well.

We can stack as many layers as we want
    Inputs --> W1 --> Non-Linear --> W2 ---> Non-Linear ---> W3
The middle layers are called a "Hidden Layer".
    A 2-layer network has 1 hidden layer
    A 3-layer network has 2 hidden layers
Note, some diagrams will also have an "output layer" which is not counted in the # layers.
But, it also does some computation. However, it's result are not passed through a non-linear
step, they are the final outputs of the network.

There are many different possible non-linear functions
- out = MAX(in, 0)
- Sigmoid
- ReLU (hinge function) it is 0 up to a threshold and then a linear growing function after that

The neural network is stylistically similar to a neuron neywork. 

An example 2-layer neural network
-----------------------------------------
#Define the sigmoid
f = lambda x: 1.0/(1.0 + np.exp(-x))

# Make random inputs
x = np.random.randn(3, 1)

# First hidden layer
h1 = f(np.dot(W1, x) + b1)

# Second hidden layer
h1 = f(np.dot(W2, h1) + b2)

# Final outputs
out = np.dot(W3, h2) + b3




================================================================================
Lecture 5
Convolutional Neural Networks
================================================================================

Review from last lecture - Neural Networks
Each layer is a linear function, f = W*x
    where W is the weights matrix and x is the input vector
    This computes a score.
We stack the layers, with non-linear function between each layer

Convolutional Neural Networks add a new type of layer, a convolution, to process
spatial information.

First a bit of History
======================
The Mark I Perceptron (1957, Frank Rosemblatt)
This had a similar design, it computed scores
    f(x) = 1 iff W*x + b > 0, and f(x) = 0 otherwise
There is an update rule, which is similar to back propagation
    W(t+1) = W(t) + alpha*(d^j - y^j(t))*x^iij
These layers seemed to be non-stacked.

Adaline/Madaline (1960, Widrow and Huff)
This stacked the layers into a multi-layer perceptron network
This still does not have a formal back-propagation

1986, Rumelhart et al, introduced Backpropagation
2006, Hinton and Salakhutdinov reinvigorated research in deep learning.
    But, this still required careful initialization, and they used a pre-training stage.

The current craze in neural networks came in 2012
    Krizhevsky, Susskever, Hinton (2012), 
        "Imagenet classification with deep convolutional neural networks"
    Mohamed, Dahl, Hinton, 2010
        "Acoustic Modelling using Deep Belief Networks"

In the 1950's, Hubel and Wiesel (1959, 1962, 1968) monitored electrical signals
in cat brains in response to different visual cues. This showed
- There is a topographical mapping in the cortex - nearby neurons respond to 
    nearby parts of the visual field.
- Neurons have a hierarchical organization
    Retinal neurons respond to circles and spots
    Simple cells respond to edges
    Complex cells respond to movement
    Hypercomplex cells respond to movement with an endpoint (like corners)

In 1980, Fukushima et al built an architecture resembling the architecture
found by Hubel and Widel, the Neurocognitron.
This was a stack of alternating layers of simple and complex cells.

In 1998, LeCun, Bottu, Haffner, "Gradient Based Learning applied to document recognition"
used CNN's for recognizing zip codes. This is a neural net and was used in Post offices, but it
did "not scale well", maybe to other tasks or larger sets of images.

Krizhevsky, Sutskever, Hinton, 2012, built AlexNet in "ImageNet Classification
with Deep Convolutional Neural Networks"
This was very similar to the CNN in LeCun's work, but was scaled up.
This used GPU's and large data sets.

Now, CNN's are used for image classification, segmentation, bounding box,
and pixel classification.
    Ren, He, Girshick, Sun, 2015, "Faster R-CNN"
    Farabet et al, 2012
Facial recgnition, also used for videos which uses temporal
    Simonyan et al, 2014
Medical Images - Levy et al, 2016
Street signs, Sermanet et al, 2011 and Ciresan et al


How Convolutional Neural Networks Work
======================================
In any NN, a "Fully Connected Layer" takes an input, which is a vector, and multiplies
it by a weight function.
For example, 
    the input x is a vector 3071x1 (a 32x32 RBG image), 
    the weight matrix is 10x3017
    the output is a vector 10x1 of class activation values

A Convolutional layer tries to preserve spatial structure.
Instead of stretching the 2D image into a vector, we preserve the 2d matrix organization

The weights are small filters, like a 5x5x3 filter, and compute a dot product at each spatial location.
So, the weight matrix only covers a small portion of the total image, and we "slide" this along the image 
to compute a result.

The weight matrix always covers the full depth (the number of bytes per pixel), over a small region.
So, instead of breaking the R, G, B values apart into vector entries, each pixel is now a complete
(R,G,B) tuple, that is kept together.

The filter computes a number, which is the dot product of the weight matrix and a pixel region
that is a subset of the image.
    (W-T * x) + b
This number is 1 entry of the convolved result.
We will repeat this with the matrix centered on each pixel of the input image to compute 
an output image with the same dimension. For example, if the input is NxN then the output
will also be NxN but each output pixel is a function of adjacent pixels in the input
image.

Now, we actually convert the 5x5x3 weight matrix into a vector to compute
the dot product. But, we may keep the RGB for each pixel in adjacent positions.
For example, the 5x5x3 becomes a 1x75 vector.

We use the "convolution" term loosely. Really, this is more like a "correlation" which
is convolving with a flipped representation of the image.
    f[x,y] * g[x,y] 
        = SUM[n1 = -infinity..+infinity]SUM[n2 = -infinity..+infinity]( f[n1,n2] * g[x-n1,x-n2] )


Now, we slide the weight matrix so it is centered on pixels of the input matrix. There are several
ways to do this:
- Go by every pixel
- Slide by 2 pixels, or more

We call a weight matrix a "filter"    
The output is an activation map. We may have *several* filters, and each will produce
a different activation map.
So, we will produce a set of K activation maps, each the result of a different filter.
We stack these activation maps, so we end up with an NxNxK 3-dimensional matrix, with depth K. 
Each (i,j) will index a vector of K different values, each produced by a different filter
This forms a single layer of a CNN.

The CNN is a sequence of stack of Convolutional layers, separated by Activation Functions, 
Recall, Activation Functions are what twe defined in Neural Networks, this is a weight
matrix that computes a dot product with a serialized matrix to compute a smaller vector.

    Image -> Convolution -> ReLU -> Convolution -> ReLU .....

The ReLU layers are Rectifier layer. This is a non-liner rectifier, 
    f(x) = max(0, x)
There may or may not be activation functions as well, which is a weight matrix W-T*x + b

Remember that each convolution layer has many filters, each filter produced a different 
plane of the output.

The final output is a vector, with one entry for each possible class. The values in the vector
are the liklihood of the image being in that class.

In practice, this stack forms a hierarchy of filters.
The early layers detect edges, the middle layers find corners or blobs, and the later layers
detect higher level features. This is consistent with what Hubel and Weisel found in
cat visual systems.

Each layer will usually increase the depth. Really, the higher layers detect a richer vocabulary
of features.
In a CNN, we also may have Pooling layers, which downsamples the map.

In practice, each convolution is a bitmap of a pattern. We slide it over an image to look
for all occurrences of that pattern in the image. The dot-product value will be highest when
the pattern in the filter most closely matches the region of the image.


Now, some of the details...
===========================
We could start with an input 32x32x3 image and a 5x5x3 image.
For now, assume we have a smaller 7x7 pixel image, and a 3x3 filter.

Sliding the filter, so it is centered on each pixel will produce a 5x5 output.
Note, there is no output pixel for the borders.

"Stride" is the number of pixels we advance the filter bitmap over the input image bitmap.
If we increase the stride, we produce a smaller output bitmap.
For example, 7x7 image with 3x3 filter, and stride=1 makes 5x5, but stride=2 amkes a 3x3 output.
It seems that we apply the stride to advancing both the X and Y.

If we do NOT pad, then the output is smaller than the output.
For an NxN image, and a FxF filter matrix, and a stride, the output dimension will be
    ((N - F) / Stride) + 1
Note, this is the width and also the height size for the output matrix.
If the stride does not compute an integer size, then we do not use it, so not all strides
will work with all sizes of matrixes.

So, how do we handle border pixels?
The filter matrix will overlap past the edge of the input matrix?
There are several ways to handle this:
- Use 0's to provide pixel values for the part of the image past its borders
- Mirror nearby values
- Copy the edge value

Now, the output is the same size of the input.
We can still use ((N - F) / Stride) + 1, but now replace N with the size of the padded
image.

Note, each filter does a dot product of the whole pixels.
So, it converts NxNx3 into NxNx1. It makes a 1-dimensional plane.
However, we may combine the planes from several filters to make a deep output, NxNxK

It is common to use filters of 3x3, or 5x5.
A 3x3 needs a padding of 1, and a 5x5 needs padding of 2.

Without padding, we shrink the bitmap matrix at each layer, and lose information
from the edges.

Example:
    Input is 32x32x3
    Filter is 5x5, pad=2
    If there are 10 filters, then the output is 32x32x10
    The padding preserves the 32x32 and 10 filters may 10 values per pixel

    Each filter is 5x5 so there are 25 entries.
    However, each entry has 3 weights, one for R, G, and B. 
    So, 1 filter has 25*25*3 weights, but we also add a bias constant, so there are 76 weight per filter.
    If there are 10 filters, then there are 760 parameters total.

The input is W x H x D (where D is the depth per pixel)
There are hyperparameters
    K = number of filters
    F is size of each filter in each dimension, also called "spatial extent"
    S is the stride
    P is the number of padding bytes
This produces an output with
    output width = ((W - F + 2P) / S) + 1
    output height = ((H - F + 2P) / S) + 1
    output depth = K

Some common settings
    F=3, S=1, P=1
    F=5, S=1, P=2
    F=5, S=2, P=computed from input size
    F=1, S=1, P=0
K = 16, 32, 64, .... usually a power of 2
A stride of 2 or more essentially downsamples the image. This may or may not be desireable.

The size of the filter is also called the "receptive field" since this is the size
of the region the filter considers when computing each output value. So a 5x5 filter
has a 5x5 receptive field.

In a fully connected layer, each output is a function of every value in the input
In a convolutional layer, each output is a function of only a small region of pixels in the input


Pooling Layers
==============
This downsamples, and makes the matrix smaller. It focuses on key properties and makes the
data smaller, so fewer parameters (and also faster).

We only pool spatially, so the output depth is the same as the input depth.
Like filters, the pooling also uses a smaller matrix that only covers a part of the input image.
It uses a stride to slide over the entire image.
At each position, the pool filter will read several adjacent entries in the input and generate
a single value of output.

Often, the stride is the size of the pooling matrix, so there is NO overlap.
But, this is not required

There are several common pooling functions
- Max, the output is the max of all values in the input region
    This is intuitively reporting whether the property exists anywhere in the field.
- Average

Some people use stride more to do downsampling.
However, pooling is more explicit and is more commonly used.

The input is W x H x D (where D is the depth per pixel)
There are hyperparameters
    F is size of each filter in each dimension, also called "spatial extent"
    S is the stride
This produces an output with
    output width = ((W - F + 2P) / S) + 1
    output height = ((H - F + 2P) / S) + 1
    output depth = input depth

Some common settings
    F=2, S=2
    F=3, S=2
We usually do not use padding

This will usually take the output of a RELU layer


RELU Layers
===========
This is like the ReLU in a connected neural network
    f(x) = max(0, x)
But, it can also be a sigmoid or something else.


Fully Connected Layer
======================
This is the last layer at the end of the pipeline.
It takes as input the last convolutional layer and outputs 


The trends are
- Smaller filters and deeper architectures
- Reduce/remove Pooling and fully connected layers, and just use Conculution layers




================================================================================
Lecture 6
Training Neural Networks Part I
================================================================================

Today, we want to learn the values of parameters for the convolution matrices.
We will do this with optimization.
So, we want to minimize loss, and use Stochastic Gradient Descent (SGD)
1. Sample a batch of data
2. Forward propagate it through the graph, and compute the loss
3. Backprop through the network to calculate the gradients
4. Update the parameters with the gradients

Thsi lecture and the next ones will go through all of the details.

Activation Functions
====================
This is a non-linear function. There are several different functions we can use.

- Sigmoid
---------
f(x) = 1 / (1 + e**-x)
This returns values between 0 and 1.
This is popular, and resembles the saturation firing rate of a neuron as it approaches saturation.
But, it has problems. 
First, it zeros out the gradient in backprop.
In backprop, we compute dL/dx with the chain rule
    dL/dx = dSigmoid/dx * dL/dSigmoid
where L is loss, and x is the input to the gradient.
But, when X is negative, sigmoid is flat so dSigmoid/dx ~= 0, so this means the backprop gradient is 0.
    So, the Sigmoid function causes the backprop gradient to be 0.
Similarly, when x is very positive, then Sigmoid is also flat, so dSigmoid/dx ~= 0 and again the
backprop is 0.

Also, the outputs are not "zero-centered". The gradient for all parameters will have the same sign.
The function returns values between 0 and 1, never a negative value.
This seems to cause you to add a positive value to all weights or add a negative value to all weights. ??? Why?
This means all parameters move as a group, and so you cannot follow an arbitrary sloe of gradient descent.
You still may reach the optimal answer, but it will be slower and less efficient process to find this.

Finally, the exponent function is slightly expensive.

- tanh
---------
f(x) = tanh(x)
This returns values between -1 and 1, so it is 0-centered. It is sigmoid shaped, but now between -1 and +1.
Introduced by LeCun 1991
But, it has problems. 
Lige sigmoid, it zeros out the gradient in backprop.

- ReLU
---------
f(x) = max(0, x)
ReLU stands for "Rectified Linear Unit". Introduced by Krizhevsky 2012
It is commonly used.
This does not saturate, so can become infinitely positive. This is "biologically plausible).
Very efficient to compute and also converges faster (up to 6x faster) than sigmoid or tanh

It is still not a zero-centered output. 
Moreover, if input < 0, then this is flat, so the gradient is 0. This means zeros out
the gradient on backprop when the inputs are negative.
This means gradients go to 0 and the weights never converge.

- Leaky ReLU
-------------
f(x) = max(0.01*x, x)
Introduced by Mass et al 2013, and also He et al 2015
This does not saturate, so can become infinitely positive.
Very efficient to compute and also converges faster (up to 6x faster) than sigmoid or tanh
There is no saturation, so even with negative values it will not zero out the gradient.

- PReLU
-----------
f(x) = max(alpha*x, x)
PReLU stands for "Parametric Rectified Linear Unit".
This does not saturate, so can become infinitely positive.
Very efficient to compute and also converges faster (up to 6x faster) than sigmoid or tanh
There is no saturation, so even with negative values it will not zero out the gradient.

The alpha parameter is also learned and optimized by backprop.

- ELU
---------
f(x) = x if x>= 0, or alpha*(e**x - 1) is x < 0
ELU stands for "Exponential Linear Unit". Introduced by Clevert 2015
All the benefits of the ReLU
Closer to 0 mean outputs, not centered on 0, but this has some slope for negative values.
This has a saturation for negative values, and may be more robust than ReLU.
In practice it behaves somewhere between ReLU and Leaky ReLU
But, the exponentiation is computationally more expensive.

- Maxout
---------
f(x) = max(W1-T*x + b1, W2-T*X + b2)
Linear, does not saturate, does not kill off gradients.
The only problem is it doebles the number of parameters per neuron.



In general,
- Try ReLU, but be careful with learning rates
- Try Leaky ReLU to empirically see if it helps
- Do not use sigmoid, and don't bother with tanh.


Data Preprocessing
====================
You always want to preprocess the data. 
We do this with all data: train, validate, test

There are several steps

1. Zero-center the data
    X = X - meanValue
We do this for image data.
Compute the mean for all training data. (we will re-use this mean for validate and test).
We compute a mean for all data once, and then reuse this mean for each batch of training data.
The mean data can be over the all R, G, and B values together, or else per-channel mean.
    R = R - mean-R
    G = F - mean-G
    B = B - mean-B

2. Normalize the data.
    X = X / StandardDeviation
All values have the same range, so all values contibute equally to the curve fitting.
We typically do NOT do this with image data, since all values are already
fairly well distributed over the same small range (usually 0-255 for a single RGB value)

3. Optional - PCA or Whitening
This maps the data to a lower dimensional space with other qualities.
We don't do this for image data


Weight Initialization
=====================
If all weights are initialized to 0, then all neurons initially have the
same response to an input. This means all neurons have the same gradient, and
so move in lockstep. This makes the system learn more slowly.

Additionally, if we initialize to small weights, then the signal shrinks as it passes 
forward through the network and the later stages are multiplying 2 very small numbers for
the dot products. As a result, the activation functions decline and approach 0.

If the initialization is too big, then the activations explode, the neurons saturate
and the gradient approaches 0.

There are several fixes:

- Initialize with random values
    W = 0.01 * np.random.randn(D, H)
This works with a small network, but has problems with larger networks.
It continues to work well with the first layer, but the second and later layers
will quickly trend toward a constant output. The distribution of the outputs
of layers 2 and beyond will shrink to 0, so the outputs are producing a small
constant number.

The outputs are 0, so the gradient will then also go toward 0 in backprop.
This makes the net slow to improve, and actually can get stuck on 0.

This problem happens because we assign small random numbers. But, if we take
larger random values, then we risk going to the saturation point in the non-linearity.
This is the same sort of problem, before we had all neurons stuck in cutoff at 0 and now 
we have all neurons stuck in saturation. Again, the gradient is 0.

- Xavier Initialization
Introduced in Glorot and Bengio, "Understanding the difficulty of training deep feedforward neural Networks", 2010
    W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)
This is a standard gaussian that is scaled by number of inputs.
This has the property that the variance of input = variance of output
This means that 
    a small number of inputs will divide by small number and get larger weights
    a large number of inputs will divide by large number and get smaller weights
In all cases, they are evenly spread

However, this breaks if you have ReLU. ReLU will squash all negative inputs (gradient is
0 so backprop dL/dX is 0). So, this only halves the variance.

You can fix this for ReLU by adding a divide-by-2
    W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in / 2)
Introduced by He at al 2015, "Delving deep into Rectifiers: Surpassing Human-level performance on ImageNet classification"
This assumes thathalf the neurons get "killed" which means they are in cutoff.
*** This is an important optimization in practice. It may make the difference

See also:
Mishkin and Matas, "All you need is a good init", 2015
Krahenbuhl et al, "Data-dependant Initializations of Convolutional Neural Networks", 2015
Sussillo and Abbott, "Random Walk Initialization for training very deep feedforward networks" 2014
Saxe et al, "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks"

Rule of Thumb
-------------
Use Xavier
    W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in / 2)



Batch Normalization
====================
In general, we want all activations at each layer fit in a Gaussian Distribution. 
We want 0-mean (0-centered) and also normalized.
Normalization means the system is less sensitive to small changes in weights. So, it is
    easier to optimize. 

We can explicitly force this:

    x^k = (x^k - E(x^k)) / SQRT(Var(x^k))

Loffe and Szegedy, 2015
Here, we adjust the values after each forward pass. However, notice that this loses
the simple model of adjusting weights with dL/dW, since we now do a normalization after
that step.

The lecture leaves out a lot of important words, like mean and variance of what?
This seems to be a function that modifies the values X that are output by a Fully-Connected
layer, after the inputs are multiplied by the weight matrix W.
- Compute the mean and variance of each output value X across all training values in a batch.
- Normalize the output values
This is a differential function, so we can do backprop through this.

This is normally done as a post-processing layer that is performed after each 
fully-connected layer. 

    ... --> Fully-Connected --> Batch Normalization --> ReLU --> ....

In a sense, this is post-processing the data, and is analogous to the pre-processing
step discussed earlier.

It can ALSO be done after a convolutional layer.
In Convolutional layers we normalize across all values, so one mean for the whole
activation map. This preserves the relationship between 

Rule of Thumb:
This is very useful, and use this after each convolutional and fully connected layer.

However, this will push values into the linear part of each ReLU/tanh/etc. 
It is possible, however, that we want to still have some values in cutoff/saturation. 

In practice, we do 2 steps
- Normalize
    x^k = (x^k - E(x^k)) / SQRT(Var(x^k))

- Squash the range and shift
    y^k = (gamma^k * x^k) + beta^k
    where:
        gamma^k = SQRT(Var(x^k))
        beta^k = E(x^k)
Gamma and Beta are both learned parameters.

The Normalize step modifies inputs before a stage, and the second step 
undoes the effects of the batch normalization on the outputs. This removes
the effects of the scaling on the final data, but still 
the input data pass through without perturbatino by this scaling function.

    --> Normalize --> NeuralNet Stage --> Squash the range and shift

So, in summary:
Suppose the input is X and is an N-element vector, where each element is a tuple of D values
There are 2 learnable params: gamma and beta, both are tuples of D values
We compute some intermediates, mean, variance
We output a normalized vector Y which also is an N-element vector, where each element is a tuple of D values

- Compute mean across all values in the batch
    mean = 1/m ( SUM[i=1..m](x^i) )
- Compute variance across all values in the batch
    var**2 = 1/m ( SUM[i=1..m](x^i - mean) )
- Normalize inputs for each layer. This does not change hte weights of W.
    x^i = (x^i - mean) / SQRT(var**2 + epsilon)
- Scale and shift the output for each output
    y^i = (SQRT(Var(x^k)))*x^i + E(x^k)

In a sense, this is pre-processing the data but applied to EACH layer in the network.

All of this improves gradient flow through the network.
It allows faster learning rates, reduces dependance on the initial values.

At test data, we do not re-compute the mean and variance. Instead, we use the
mean and variance we computed at training time.


Babysitting the Learning Process
================================
How do we monitor training

- Pre-process the data

- Choose the architecture (how many hidden layers)

- Initialize the network
    model = init_two_layer_model(32*32*3, num_hidden_neurons=50, numOutputs=10)
    loss, grad = two_layer_network(X_train, model, y_train, regularization=0.0)
    print loss    
Make sure the loss is reasonable.
After first pass, Loss should be -log(liklihood) when we disable resularization

- Make sure the loss increases when we enable regularization
    model = init_two_layer_model(32*32*3, num_hidden_neurons=50, numOutputs=10)
    loss, grad = two_layer_network(X_train, model, y_train, regularization=3)
    print loss    

- Start training
    You should be able to overfit a small subset of the data.
    Try to make the loss decline and approach 0

    model = init_two_layer_model(32*32*3, num_hidden_neurons=50, numOutputs=10)
    trainer = ClassifierTrainer
    x_tiny = x_train[:20]
    y_tiny = y_train[:20]
    best_model, stats = trainer.train(x_tiny, y_tiny, x_tiny, y_tiny, 
                                        model, two_layer_net,
                                        num_epochs=200, reg=0.0
                                        update='sgd', learning_rate_decays,
                                        sample_batches = False,
                                        learning_rate=1e-3, verbose=True)

- Train with full training data
    Start with a small regularizationn, and find the learning rate

Try a very small learning rate
    model = init_two_layer_model(32*32*3, num_hidden_neurons=50, numOutputs=10)
    trainer = ClassifierTrainer
    x_tiny = x_train[:20]
    y_tiny = y_train[:20]
    best_model, stats = trainer.train(x_tiny, y_tiny, x_tiny, y_tiny, 
                                        model, two_layer_net,
                                        num_epochs=200, reg=0.0
                                        update='sgd', learning_rate_decays,
                                        sample_batches = False,
                                        learning_rate=1e-6 (NOTE), verbose=True)
The loss changes slowly, so the learning rate is likely too small.

Try a very large learning rate
    model = init_two_layer_model(32*32*3, num_hidden_neurons=50, numOutputs=10)
    trainer = ClassifierTrainer
    x_tiny = x_train[:20]
    y_tiny = y_train[:20]
    best_model, stats = trainer.train(x_tiny, y_tiny, x_tiny, y_tiny, 
                                        model, two_layer_net,
                                        num_epochs=200, reg=0.0
                                        update='sgd', learning_rate_decays,
                                        sample_batches = False,
                                        learning_rate=1e6 (NOTE), verbose=True)
The loss is NaN, so +infinity. This means the rate is too big.




Hyperparameter Optimization
============================
Some hyperparameters to optimize
    - Number of hidden layers
    - Learnign rate (decay schedule, update type)
    - Regularization (L2/Dropout strength)

We use cross-validation.
Train on the training set and then evaluate on a validation set.

First, try hyperparameters that are far apart, and only try a few epochs,
like 5 epochs. This quickly narrows down to a range.
It's better to optimize in log space. We want to narrow down to a range
of 10**-4

Second repeat with a more narrow search over more epochs.

Tip: In general, if the cost ever exceeds 3x original cost, then abort the trial. 
It will likely have explosive error.


Monitor and visualize the loss curve
    Loss is the Y-axis and #epochs is the x-axis
    Very high learning rate - loss increases with epochs - this is bad
    High learning rate - loss declines for a while, then levels off and does not improve further
    Low learning rate - loss declines smoothly but at a slow rate
    Good learning rate - loww declines and pretty quickly hits a low loss level

If the loss is flat and high, but after a while starts to drop, then this is
likely bad intiialization values.

If there is a big gap between training and validation, then this is likely overfitting.

Track the ratio of weight updates to weight magnitudes
    param_scale = np.linalg.norm(W.ravel()) # W is the weight matrix
    update = -learning_rate * dW  # dW is the gradient vector
    update_scale = np.linalg.norm(update.ravel())
    W += update
    print update_scale / param_scale
We want this ratio to be around 0.001

It is good to try random combinations of the hyperparameter. Don't do a fixed
grid search because one variable is likely to be more important than the others,
and in a grid they all change at the same rate. Randomly changing the hyperparameters
will expose which has a bigger effect.
    Bergstra and Bengio, "Random Search for Hyper-Parameter Optimization", 2012





================================================================================
Lecture 7
Training Neural Networks Part II
================================================================================

More details and tips. 

Fancier Optimization
====================
Training a NN is an optimization problem, where we try to minimize the loss.
We use stochastic gradient descent.
while true:
    weightsGradient = evalGradient(lossFunction, data, weights)
    weights += -stepSize * weightsGradient


There are some problems with stochastic gradient descent.
1. Loss may have a "high condition number"
The loss is very sensitive to one variable but not sensitive to another variable.
The loss curves may look like a taco shell - sensitive in 1 direction but not the other direction.
The condition number is the ratio of the largest to smallest singular value of the Hessian matrix
Stochastic gradient descent will zig-zag, because the gradient may be over and undershooting
on the sensitive direction, and zig-zag, while it may move very slowly on the less sensitive direction.

2. There may be local minima of loss, or saddle points.
Stochastic gradient descent is hill climbing.
True local minima are less common, but saddle points are common.
In a Saddle point, the loss goes up in 1 direction and down in another.

3. stochastic gradient descent is stochastic, so it computes the gradient on
a randomly subset, it is not the true gradient.


Momentum
--------
We can solve several of these issues by adding a "momentum" term to stochastic gradient descent.

Classic stochastic gradient descent:
    while True:
        dW = computeGradient(x)
        W += LearningRate * dW
We always step in the direction of the gradient.
x[t+1] = xt - alpha*Gradient

Stochastic gradient descent with Momentum
    vW = 0
    while True:
        dW = computeGradient(x)
        vW = rho * (vW + dW)
        W += LearningRate * vW

Now we take a step in the direction of the velocity, not just the gradient.
This builds up "velocity", which is a running sum of past gradients
    more recent gradients are more heavily weighted, there is an exponentially decaying weight.
Rho is a friction, and is usually 0.9

Initialize the velocity to 0.

The velocity accumulates, so it carriex us over local minima/maxima and also through saddle points.
The momentum will also average out the zig/zags, and help with poor conditioning.
The velocity will also accumulate on a low sensitivity direction and still learn faster.
Random Noise gets averaged out in the velocity sum, so it has less effect.

We step in a direction that is the vector sum of the previous momentum and the latest gradient.
    Compute gradient
    Add gradient to velocity
    Step in direction of the updated velocity

    v[t+1] = rho*v[t] + Gradient(x[t])
    x[t+1] = x[t] + (alpha * v[t+1])

Momentum algorithms teend to overshoot the optimal point, and then change course and go back to it.
This is still much faster than typical stochastic gradient descent.
    
Nesterov Momentum
-----------------
There is a variant of this: Nesterov Momentum.
    Step in the direction of the previous momentum
    Compute the gradient at this new point
    Step in the direction of the gradient

    v[t+1] = rho*v[t] - alpha*Gradient(x[t] + rho*v[t])
    x[t+1] = x[t] + v[t+1]

We can also rewrite Nesterov Momentum to compute loss and gradient at the same point
Let 

    v[t+1] = rho*v[t] - alpha*Gradient(x[t])
    xx[t] = x[t] + rho*v[t]
    xx[t+1] = xx[t] - rho*v[t] + (1 + rho)v[t+1]
            = xx[t] - v[t+1] + rho*(v[t+1] - v[t])

    dx = computeGradient(x)
    oldV = v
    V = (rho * V) - (learningRate * dW)
    X += -rho * oldV + (1 + rho) * v


AdaGrad
-------
Another optimization is AdaGrad
    Duchi et al
    "Adaptive subgradient methods for online learning and stochastic optimization"
    JMLR 2011

This keeps a running sum of the squared-gradients.
Now, when we update the param vector, we divide by the running sum of gradients.
This works to oppose the problems of sensitive/insensitive gradients
    If the gradient is large, then dividing by a large number will slow our change
    If the gradient is small, then dividing by a small number will increase our change
So, this helps for the case when you have variables with different sensitivities.
However, there are problems
    As the gradientSquared grows over time, it makes the step size smaller over time.
    So, you may slow learning as you approach the global min error
    You may also get stuck on saddles.    

    gradientSquared = 0
    while True:
        dW = computeGradient(x)
        gradientSquared += (dW * dW)
        W -= ((learningRate * dW) / (np.sqrt(gradSquared) + 1e-7))

Adagrad is not practical, but RMSProp seems to work well.
We add 1e-7 as a safety margin to ensure we never divide by 0.

RMSProp
-------
This still uses the estimate of the gradient, but now it decays so it does not cause everything 
to grind to a halt.

    gradientSquared = 0
    while True:
        dW = computeGradient(x)
        gradientSquared = (decayRate * gradientSquared) + ((1 - decayRate) * dW * dW)
        W -= ((learningRate * dW) / (np.sqrt(gradSquared) + 1e-7))

This is like momentum, but it used the square of the gradients, rather than just the gradients.


Adam
-------
This is a combination of momentum, and also scale with the square of gradients.

    firstMoment = 0
    secondMoment = 0
    while True:
        dW = computeGradient(x)

        # firstMoment is like momentum.
        firstMoment = (beta1 * firstMoment) + ((1 - beta1) * dW) 

        # secondMoment is a history of squared momentums, like RMSProp, 
        secondMoment = (beta2 * secondMoment) + ((1 - beta2) * dW * dW)

        # Move in the direction of momentum, but scale this by the sqrt of secondMoment
        # So we scale by the decaying history of gradients.
        W -= ((learningRate * firstMoment) / (np.sqrt(secondMoment) + 1e-7))

After the first time step, we divide by a small number, so we make a large step at the beginning.
We do this, whether that makes sense for the problem or not.
We add 1e-7 as a safety margin to ensure we never divide by 0.

See: Kingma and Ba, "Adam: A method for stochastic optimization", ICLR 2015

The real form of the Adam algorithm is

    firstMoment = 0
    secondMoment = 0
    for t in range(numIterations): 
        dW = computeGradient(x)

        # firstMoment is like momentum.
        firstMoment = (beta1 * firstMoment) + ((1 - beta1) * dW) 

        # secondMoment is a history of squared momentums, like RMSProp, 
        secondMoment = (beta2 * secondMoment) + ((1 - beta2) * dW * dW)

        firstUnbias = firstMoment / (1 - beta1**t)
        secondUnbias = secondMoment / (1 - beta2**t)

        # Move in the direction of momentum, but scale this by the sqrt of secondMoment
        # So we scale by the decaying history of gradients.
        W -= ((learningRate * firstUnbias) / (np.sqrt(secondUnbias) + 1e-7))

We use the unbiased estimates of the momentum and history of momentums, rather than 
the moments, so they will not start at 0.
This is a very good algorithm, and works well as a starting point for lots of problems.
Some reasonable starting points:
    beta1= 0.9
    beta2 = 0.999
    learningRate = 1e-3 or 5e-4



Learning Rate
=============
In all algorithms, the learning rate does not need to be a constant.
Instead, it can decay over time, so you slow as you approach the final answer.

    learningRate = learningRate - delta
or 
    learningRate = initalValue * e**(-kt)
or
    learningRate = initialValue / (1 + kt)

Lowering the learning rate can stop the search from bouncing and resume making progress.
Start with no learning rate decay, watch the lossBsEpochs curve and then add decay later


Second Order Optimization
=========================
Gradient descent uses the first order derivative to approximate the function.
This is a crude estimate, and can be improved by using a second order derivative.

We compute a Hessian Matrix (written as H), which is a matrix of second order derivatives.
So, we use both the matrix of first order derivatives (del) and second order derivatives (H).
    J(theta) ~= J(theta0) + (theta - theta0)-T * Del-J(theta0) + 1/2(theta - theta0)-T*H(theta - theta0)

theta = theta0 - Inverse(H)*delJtheta0)
This requires inverting the hessian matrix, which is impractical and expensive.
There are ways to approximate this inverse, but they do not work often.



Reducing Overfitting
====================
Once we optimize a model, we want to somehow perform better on the final test data.

One way to do this is to train multiple independant models. Then, at test time,
use all the models and average their results.
Each model uses a different intial state for training.

This may add 2% to performance. It is often done in competitions.

You can also snapshot a single model at different times, and use those different
snapshots at test time and average their results.




Regularization
==============
This improves the performance of a single model.
We add some things to the models to prevent overfitting.

We add some terms to the formula for loss function
Normally
    Loss = 1/N (SUM[i=1..N]SUM[j != y^i](max(0, f(x^i,W) - y^i + 1))

But we add a term:
    Loss = 1/N (SUM[i=1..N]SUM[j != y^i](max(0, f(x^i,W) - y^i + 1)) + lambda*R(W)

Some common forms of R(W) are:
- L2 Regularization. This uses Weight decay
    R(W) = SUM[k]SUM[l](W[k][l]**2)
- L1 Regularization
    R(W) = SUM[k]SUM[l](W[k][l])
- Elastic Net (L1 + L2)
    R(W) = SUM[k]SUM[l](W[k][l]**2 + W[k][l])

Dropout
-------
Another regularization technique is Dropout
On every forward pass, randomly set the activation output of some neurons to 0
See Srivastata et al, "Dropout: A simple way to prevent neural neyworks from overfitting", JMLR 2014
Probability of dropping a neuron is a hyperparameter. 0.5 is a common value.

This is usually done on fully connected layers. 
For example, in a simple 3-layer neural network:
p = 0.5
def trainStep(X):
    H1 = np.maximum(0, np.dot(W1, X) + b1)
    U1 = np.random.rand(*H1,shape) < p # dropout
    H1 *= U1

    H2 = np.maximum(0, np.dot(W2, X) + b2)
    U2 = np.random.rand(*H2,shape) < p # dropout
    H2 *= U2

    out = np.dot(w3, H2) + b3)

Why does this work? It is not clear. It may be to prevent the network from depending
too much on a single feature.
It also emulates model ensemble within a single model. Each dropout emulates a different
simpler model.

Now, at test time, we want to remove randomness. We want to average out the randomness.
Let z be the randomness, and we could integrate the result for all possible values of z
    y = f(x) = E(f(x,z)) = Integral(p(z)f(x,z))dz

We could approximate this integral.
Consider a single neuron
    E(a) = W1X + W2Y
During training, we may drop out values
    E(a) = 1/4*(w1x + w2Y) + 1/4*(w1x + 0Y) + 1/4*(0x + w2Y) + 1/4*(0x + 0Y)
this simplifies to:
    E(a) = 1/2*(w1x + w2Y)

So, at test time, we multiply the result by the dropout probability, but we do not
actually drop out any values.

People may use "inverted" dropout.
At training time, divide by p so at test time you do not need to do the multiply.

Note, during dropout we do not pass the gradient back through dropped out nodes.
This means training may take longer, but it still works.

Data Augmentation
------------------
We can increase the training set by applying simple permulations
- Flip horizontal or vertical
- Rotate
- Take crops of the image
- Color jiggering
- Zoom/shrink
- Stretch
- Lens distortions
- Shearing (? occlusion?)

Batch Normalization is another example of this.


Drop Connect
------------
Instead of randomly zeroing out entire values, randomly zero out elements of the weight matrix


Fractional Max Pooling
----------------------
Randomly pool over different regions
This is rarely used


Stochastic Depth
----------------
We randomly drop entire layers during training.
During test time, we use the entire network.
See:
    Huang et al, "Deep Networks with Stochastic Depth", ECCV 2016



Transfer Learning
=================
We want as much data as possible. 
- Train a CNN on ImageNet
- Re-initialize ONLY the last fully-connected layer
   Kepp all other layers the same as they were in ImageNet
- Retrain it on a related subset (like specific breeds of dog).

Additionally, you can update the last 1 or 2 layers.
This is useful when we start with a generic class, and then retrain with a subclass.

Transfer learning is the norm, not the exception.

We almost always pre-train. You can find related networks:
    PyTorch: https://github.com/pytorch/vision
    Tensorflow: https://github.com/tensorflow/models
    Caffee: https://github.com/BVLC/caffe/wiki/Model-Zoo



*** Tips
- Start with Adam optimization algorithm
- Regularization - Data Augmentation
- Pretrain


================================================================================
Lecture 8
Deep Learning Software
================================================================================

CPU vs GPU
===========
AI seems to generally prefer NVIDIA GPUs
NVIDIA dominates because it has has tried to opimize their architectures for AI

A CPU may have 4 cores, support up to 8 threads with hyperthreading.
NVIDIA Titan has 3840 cores and GTX 1070 has 1920 cores.
However, a GPU core is very limited, and are not comparable to a CPU core. 
But, the many more cores in a GPU allow lots of parallelism.

A GPU has a lot of local on-chip RAM (8-12GB) so avoids the system bus.

In some benchmarks, GPU's run 60x-75x faster than a comparable CPU.

Ideal applications for a GPU are matrix multiplication.
    Each element in the output matrix is computed in parallel
Also, convolution lends itself well to GPU's

You can write programs that run directly on the GPU.
NVIDIA provides CUDA, a C-like language that will run on the GPU.
However, it is hard to write high performance CUDA code.
Instead, NVIDIA publishes libraries that run CUDA code on the GPU for common tasks.
    In some benchmarks, cuDNN is 3x faster than generic CUDA

OpenCL is similar to CUDA, but runs on AMD and other GPU's in addition to NVIDIA.
However, it is slower and not used.


Some of the common frameworks
    Caffe (Berkeley) ---> Caffe2 (Facebook)
    Theano (U Montreal) ---> TensorFlow(Google)
    Torth (NYU) ---> PyTorch (Facebook)
    CNTK (Microsoft)
    Paddle (Baidu)
    MXNet (Amazon)
These frameworks all manage GPU and also prefetching data.

All of these build on the idea of computation graphs.
    Data moves along edges between nodes that perform 
Generally, the program defines the graph, and the framework will have data to do the back propagation.


Numpy Example:
---------------
To compute a simple comptational graph, (a=x*y, b=z+a, C=SUM(b)

import numpy as np
no.random.seed(0)
N, D = 3, 4

x = np.random.randn(N, D)
y = np.random.randn(N, D)
z = np.random.randn(N, D)

a = x * y
b = a + z
c = np.sum(b)

grad_x = 1.0
grad_b = grad_c * np.ones((N, D))
grad_a = grad_b.copy()
grad_z = grad_b.copy()
grad_x = grad_a * y
grad_y = grad_a * x

numpy does not run on a GPU
Computing gradients is a lot of boilerplate.
The goal of gradients is to resemble the syntax of the forward pass, but also compute the gradients and 
    also run on a GPU

To do the same computation in TensorFlow
-----------------------------------------
import numpy as np
np.random.seed(0)
import tensorflow as tf

N, D = 3, 4

with tf.devices('/gpu:0'):
    x = tf.placeholder(tf.float32)
    y = tf.placeholder(tf.float32)
    z = tf.placeholder(tf.float32)

    a = x * y
    b = a + z
    c = tf.reduce_sum(b)

# This computes all of the gradients
grad_x, grad_y, grad_z = tf.gradients(c, [x, y, z])

while tf.Session as sess:
    values = {
        x: np.random.randn(N, D), 
        y: np.random.randn(N, D), 
        z: np.random.randn(N, D),
    }
    out = sess.run([c, grad_x, grad_y, grad_z],
                    feed_dict = values)
    c_val, grad_x_val, grad_y_val, grad_z_val = out



To do the same computation in PyTorch
-------------------------------------
import torch
from torch.autograd import Variable

N, D = 3, 4

# Create the graph
x = Variable(torch.randn(N, D), requires_grad=true)
y = Variable(torch.randn(N, D), requires_grad=true)
z = Variable(torch.randn(N, D), requires_grad=true)

#To use GPU
#x = Variable(torch.randn(N, D).cuda(), requires_grad=true)
#y = Variable(torch.randn(N, D).cuda(), requires_grad=true)
#z = Variable(torch.randn(N, D).cuda(), requires_grad=true)

# Forward pass
a = x * y
b = a + z
c = torch.sum(b)

# This computes all of the gradients
c.backward()

print(x.grad.data)
print(y.grad.data)
print(z.grad.data)






TensorFlow(Google)
==================

As an example, this is a 2-layer fully-connected ReLU network on random data with L2 loss


import numpy as np
np.random.seed(0)
import tensorflow as tf

##################
# PART I:
# Define the graph
N, D, H = 64, 1000, 100

# Define the input nodes for the graph. This does not allocate memory, but sets
# up input slots.
x = tf.placeholder(tf.float32, shape=(N, D))
y = tf.placeholder(tf.float32, shape=(N, D))
w1 = tf.Variable(tf.random_normal((D, H))
w2 = tf.Variable(tf.random_normal((H, D))
# W1 and w2 are Variables, rather than placeholders. This means the data
# stays in the GPU and this is a big performace gain. We do not have to copy
# w1 and w2 back to CPU then into GPU on each iteration.

# Now, specify what computation we want to run on the variables.
# Here we specify the matrix multiplications
# This does not do any actual computation, it only creates a graph that will compute
# the specified calculations.
h = tf.maximum(tf.matmul(x, w1), 0)
y_pred = tf.matmul(h, w2)
diff = y_pred - y
loss = tf.reduce_mean(tf.reduce_sum(diff ** 2, axis=1)

# Update the weights as part of the computational graph.
optimizer = tf.train.GradientDescentOptimizer(1e-5)
# updates is a dummy node that creates a dependancy forces Tensorflow to also compute the w1 and w2
# This is needed when we make w1 and w2 part of the gpu
updates = optimizer.minimize(loss)

##################
# PART II:
# Now, run the graph
with tf.Session as sess:
    # Set up the weight variables w1 and w2
    sess.run(tf.global_variables_initializer())

    # Create some data that will feed into the graph. In this example, we
    # make random matrices and store them as values in a dictionary
    values = {
        x: np.random.randn(N, D), 
        y: np.random.randn(N, D),
    }

    # This actually runs the graph. 
    losses = []
    for t in range(50):
        # sess.run takes 2 params
        # The first is the values we want to compute, and the second is the list of input values.
        # it returns the results in the out variable.
        loss_val = sess.run([loss, updates], feed_dict = values)


There are wrapper libraries, like Keras, that wrap around tensor flow and
hide a lot of the boilerplate code




PyTorch (Facebook)
==================
There are 3 different layers of abstraction for a node
- Tensors - an numpy ndarray, but it runs on the GPU

- Variable - a node in a computational graph. It stores both the data and the gradient
    Variable is similar to teh Variable or Placeholder objects in Tensorflow

- Module - a neural net layer, so it has learnable weights. The Module
    Modules are similar to the wrapper libraries in Tensorflow

1. Tensors
-----------
Tensors are just numpy with GPU functionality.

#########################################################
# Tensors are like numpy arrays. There is no built-in notion of a 
# computational graph. But, Tensors can run on the GPU if specified.
# Here is a simple 2-layer network using ONLY Tensors
# No numpy needed
import torch

# Optional. Uncomment the line below to run on a GPU
#dtype = torch.cuda.FloatTensor

# Create some random tensors for data and weight
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.radn(N, D_in).type(dtype)
y = torch.radn(N, D_out).type(dtype)
w1 = torch.radn(D_in, H).type(dtype)
w2 = torch.radn(H, D_out).type(dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    loss = (y_pred - y).pow(2).sum()

    # Backward pass. Compute gradients for each layer
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update the weights using the gradients
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2



2. Variables
------------
Variables are a node in a computational graph.
If x is a Variable, then 
    x.data is a Tensor
    x.grad is a Variable of gradients with same dimension as x.data
    x.grad.data is a Tensor of gradients

Tensors and Variables have the exact same API. So, anythng that works on a Tensor will
also work on Variables (the magic of subclassing with operator overloading)

Variables record how they were created, so they can do BackProp.

#########################################################
# Here is a simple 2-layer network using Variables
import torch
from torch.autograd import Variable

# Create some random Variables for data and weight
# The requires_grad flag says whether we will want to compute
# gradients with respect to this Variable.
N, D_in, H, D_out = 64, 1000, 100, 10
x = Variable(torch.radn(N, D_in), requires_grad=False)
y = Variable(torch.radn(N, D_out), requires_grad=False)
w1 = Variable(torch.radn(D_in, H), requires_grad=True)
w2 = Variable(torch.radn(H, D_out), requires_grad=True)

# Run the graph.
# Note, we build up a new graph on each iteration.
learning_rate = 1e-6
for t in range(500):
    # Forward pass
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    loss = (y_pred - y).pow(2).sum()

    # Backward pass. Compute gradients for each layer
    if w1.grad: w1.grad.data.zero_()
    if w2.grad: w2.grad.data.zero_()
    loss.backward()

    # Update the weights using the gradients
    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data


You can customize the autograd function which does the forward and backward layers:
You can define forward and backward for any Variable.
Here is a new ReLU

class ReLU(torch.autograd.Function):
    def forward(self, x): 
        self.save_for_backward(x)
        return(x.clamp(min=0)

    def backward(self, grad_y):
        x, = self.saved_tensors
        grad_input = grad_y.clone()
        grad_input[x < 0] = 0
        return grad_input


#########################################################
# Here is a simple 2-layer network using Variables and
# our custom ReLU
import torch
from torch.autograd import Variable

# Create some random Variables for data and weight
# The requires_grad flag says whether we will want to compute
# gradients with respect to this Variable.
N, D_in, H, D_out = 64, 1000, 100, 10
x = Variable(torch.radn(N, D_in), requires_grad=False)
y = Variable(torch.radn(N, D_out), requires_grad=False)
w1 = Variable(torch.radn(D_in, H), requires_grad=True)
w2 = Variable(torch.radn(H, D_out), requires_grad=True)

# Run the graph.
# Note, we build up a new graph on each iteration.
learning_rate = 1e-6
for t in range(500):
    # Forward pass
    relu = ReLU()
    y_pred = relu(x.mm(w1)).mm(w2)
    loss = (y_pred - y).pow(2).sum()

    # Backward pass. Compute gradients for each layer
    if w1.grad: w1.grad.data.zero_()
    if w2.grad: w2.grad.data.zero_()
    loss.backward()

    # Update the weights using the gradients
    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data




3. Modules for Higher level Neural Networks
----------------------------------------------

This is like Keras or other wrappers for TensorFlow

#########################################################
# Here is a simple 2-layer network using the nn module
import torch
from torch.autograd import Variable

# Create some random Variables for data and weight
N, D_in, H, D_out = 64, 1000, 100, 10
x = Variable(torch.radn(N, D_in))
y = Variable(torch.radn(N, D_out))

# Define the graph
model = torch.nn.Sequential(
            torch.nn.Linear(D_in, H),
            torch.nn.ReLU(),
            torch.nn.Linear(H, D_out))
loss_gn = torch.nn.MSELoss(size_average=False)

# Run the graph.
# This still builds up a new computational graph for each iteration
learning_rate = 1e-6
for t in range(500):
    # Forward pass
    y_pred = model(x)
    loss = loss_fn(y_pred, y)

    # Backward pass. Compute gradients for each layer
    model.zero.grad()
    loss.backward()

    # Update the weights using the gradients
    for param in model.parameters():
        param.date -= learning_rate * param.grad.data




#########################################################
# Here is a simple 2-layer network using the optimizer module
# These can use several different update rules, like Adam or more.
import torch
from torch.autograd import Variable

# Create some random Variables for data and weight
N, D_in, H, D_out = 64, 1000, 100, 10
x = Variable(torch.radn(N, D_in))
y = Variable(torch.radn(N, D_out))

# Define the graph
model = torch.nn.Sequential(
            torch.nn.Linear(D_in, H),
            torch.nn.ReLU(),
            torch.nn.Linear(H, D_out))
loss_gn = torch.nn.MSELoss(size_average=False)

# Run the graph.
# This still builds up a new computational graph for each iteration
learning_rate = 1e-6
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass
    y_pred = model(x)
    loss = loss_fn(y_pred, y)

    # Backward pass. Compute gradients for each layer
    optimizer.zero.grad()
    loss.backward()

    # Update the weights using the gradients
    optimizer.step()


You can also define new Modules.
A module is a single layer of a neural net, and takes Variables as inputs and outputs
Each module contains weights
Modules can nest so a parent module can be made up of child modules

This seems to be a common design pattern.


#########################################################
# Here is a simple 2-layer network using the optimizer module
# These can use several different update rules, like Adam or more.
import torch
from torch.autograd import Variable

# Define out own nn module
class TwoLayerNet(torch.nn.Module):
    # The initializer. linear1 and linear2 are the two layers
    # Layers are of type torch.nn.linear which seems to implicitly assume that there is a matrix of weights.
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    # Forward. This will pass input to layer1, then the result of layer1 to layer2.    
    def forward(self, x):
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

# Create some random Variables for data and weight
N, D_in, H, D_out = 64, 1000, 100, 10
x = Variable(torch.radn(N, D_in))
y = Variable(torch.radn(N, D_out), requires_grad=false)

# Define the graph
model = TwoLayerNet(D_in, H, D_out)

# Run the graph.
# This still builds up a new computational graph for each iteration
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass
    y_pred = model(x)
    loss = criterion(y_pred, y)

    # Backward pass. Compute gradients for each layer
    optimizer.zero.grad()
    loss.backward()

    # Update the weights using the gradients
    optimizer.step()


You can also use the DataLoader library, which builds mini-batches of
data. It also has worker threads to prefetch the next round of data.
You will often write your own DataLoader for your particular files or data source.

#########################################################
# Here is a simple 2-layer network using the optimizer module
# These can use several different update rules, like Adam or more.
import torch
from torch.autograd import Variable

# Define out own nn module
class TwoLayerNet(torch.nn.Module):
    # The initializer. linear1 and linear2 are the two layers
    # Layers are of type torch.nn.linear which seems to implicitly assume that there is a matrix of weights.
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    # Forward. This will pass input to layer1, then the result of layer1 to layer2.    
    def forward(self, x):
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

# Create some random Variables for data and weight
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.radn(N, D_in))
y = torch.radn(N, D_out)

# Create the dataloader
loader = DataLoader(TensorDataset(x, y), batch_size=8)

# Define the graph
model = TwoLayerNet(D_in, H, D_out)

# Run the graph.
# This still builds up a new computational graph for each iteration
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
# Each iteration of the outer loop is one complete pass through all data
for epoch in range(10):
    # Each iteration of the inner loop is one chunk of data
    for x_batch, y_batch in loader:
        # A DataLoader returns Tensors, so we need to wrap these in variables.
        x_var, y_var = Variable(x), Variable(y)

        # Forward pass
        y_pred = model(x_var)
        loss = criterion(y_pred, y_var)

        # Backward pass. Compute gradients for each layer
        optimizer.zero.grad()
        loss.backward()

        # Update the weights using the gradients
        optimizer.step()



PyTorch provides pre-trained models.
Example:

https://github/pytorch.vision

import torch
import torchvision

alexnet = torchvision.models.alexnet(pretrained=True)
vgg16 = torchvision.models.vgg16(pretrained=True)
resnet101 = torchvision.models.resnet101(pretrained=True)


Use the Visdom package to watch loss statistics.


In TensorFlow, you first create a graph and then run many instances on the same graph.
    This is a "Static Graph"
    The framework can optimize the graph for speedup on later iterations.
    You can then run the graph without using the code that built the graph.
    The graph can be serialized to a data file, then run later in a different language.
    Train in Python and run in C++

In Pytorch, you create a new graph on each iteration.
    This is a "Dynamic Graph"
    The code is cleaner and simpler
    It lets the code make dynamic changes to the functions in the graph, for example
        use different weight matrices depending on the data. You can also do this in TF, but
        it requires special control-flow declarations in the graph. Now you are programming 
        with TF not Python
    Loops and conditionals in the graph are much simpler to declare. Again, you are
        programming in Python not TF's functional programming declarations

    
Dynamic graphs are used in recurrent networks, which operates on data that is sequences of varying length.
For example, a sentence may have a variable number of words

You may also have recursive networks
You may also have modules of networks that are assembled in different 





Caffe (UC Berkeley)
====================
Call pre-existing binaries, no need to write code
1. Convert data using binaries
2. Define a network with a text file
3. Define solver with another data file
4. Run the config files with another binary


Caffe2 (Facebook)
=================
New (released 2017)
Can define the graph with Python, like Pytorch, but then emit the graph data files
that can run later, and also be deployed in a production environment.

Advice
=====
TensorFlow is safe, and works well with many machines. Better for mobile.
PyTorch is best for research but is newer




================================================================================
Lecture 9
CNN Architectures
================================================================================

LeNet-5
==========
LeCun et al, 1998
First convnet that was used successfully in practice
Took an input image, used 5x5 filters, approx 5 layers.

AlexNet
========
Krizhevsky et al, 2012
First large scale conv net that did well on ImageNet. It outperformed all previous models.
It started the conv net trend. It was used widely for many different tasks until a few years ago, 
when it was surpassed in accuracy by newer models.

It uses a lot of data augmentation
This uses dropout 0.5
Batchsize = 128
It uses ReLU. It also used SGD with momentum 0.9
The learning rate is 1e-2, and it is manually reduced by 10 when the val accurracy plateaus
L2 with weight decay 5e-4
7 CNN ensemble 

Layers:
Conv1 -> MaxPool1 -> Norm1 -> MaxPool2 -> Norm2 -> Conv3 -> Conv4 -> Conv5 -> MaxPool3 -> FC6 -> FC7 -> FC8
So, 5 conv layers

First layer
Inputs are 227 x 227 x 3 images
96 different 11 x 11 filters, stride=4
Size of output = ((227 - 11) / 4) + 1 = 55 x 55 x 96
There are 96*(11*11*3) = 35K parameters

Pool1
This is 3x3 filters with stride=2
Output = ((55-3)/2) + 1 = 27x27x96
There are no parameters, a pooling layer does not have params. 

Conv1 96 11x11 filters at stride=4, pad=0
MaxPool1 3x3 filters at stride=2
Norm1
CONV2 256 5x5 filters at stride=1, pad=2
MaxPool2 3x3 filters at stride=2
CONV3 384 3x3 filters at stride=1, pad=1
CONV4 384 3x3 filters at stride=1, pad=1
CONV5 256 3x3 filters at stride=1, pad=1
MaxPool3 3x3 filters at stride=2
FC6 4096 neurons
FC7 4096 neurons
FC8 1000 neurons


VGG
=======
In 2014, there was another jump in the ImageNet competition.
Now, in 2014 they were much better than AlexNet.
GoogLeNet and VGG were both very strong networks.

This is a much deeper network than with smaller filters.
They went from 8 layers in Alexnet to 16 or 19 layers
They also used much smaller convolutions - 3x3
They also used a 2x2 MAX POOL for downsampling, with stride of 2

So, why use smaller filters?
Stack of 3x3 filters in a deeper pipeline has a same effective receptive field as a 7x7 conv filter
At layer 1 - the field is 3x3
At layer 2 - the field is 5x5 - you take a 3x3 of 3x3's, but there is a lot of overlap. Only the firstlayer
growps at the edges will introduce new pixels, and they only bring in a row/column at the edge that
was not in another filter, and you get this at both sides and top/bottom
At layer 3 - the field is 7x7

A 3-depth of 3x3 has the same receptive field as a 7x7, but because it is 3 layers, there are more
linearities. You also have fewer parameters with a smaller and deeper net.

This is a deper network, which means it has more layers.
This does not refer to the depth per pixel, which is the number of filters generating 
properties for each pixel.

Each filter is a different pattern we are looking for.
Deeper nets may have more combinations of patterns
Often, there are more filters at deeper layers, but this is not required.

Notation: Conv3-64 means 3x3 convolutions with 64 filters.

It did not use "Local Responde Normalization" LRN, because it did not help so they just removed it.


Classification - Identify the type of image
Localization - Classification, and draw a bounding box around the object to separate it from background
Detection - Localization - detection where there may be several instances of the object in the image



GoogLeNet
==========
Szegedy et al, 2014
This is a deep network, with 22 layers
The goal was to make a more efficient
There are *NO* fulle connected layers, which reduces the #parameters
    AlexNet had ober 130 million params, while GoogLeNet had only 5 million

It created a building block called an "Inception module"
This is a building block of 3 layers. You then stack these inception modules

The input goes into 4 layers in parallel
    1x1 Conv layer, output is WxHx128
    3x3 Conv Layer, output is WxHx192
    5x5 Conv layer, output is WxHx96
    3x3 MaxPool layer, output is WxHx256

The output of all 4 modules then go into a "Filter Concatenation" module.
The "Filter Concatenation" module combines the outputs into more bits per pixel.
    So the output is 672 bits per pixel in this particular example
The output of the "Filter Concatenation" module is the output of the entire Inception Module.

The problem is this creates many many bits per pixels in the output.
It keeps the same dimension of the image but greatly increases the bits per pixel.
This requires a lot of dot-products and is computationally expensive
Also, the pooling layer preserves the input depth per pixel, so the total bits per pixel
    will increase with each Inception Module

Google added "Bottleneck layers" which is a 1x1 Conv (so it preserves image dimension)
but would reduce the depth per pixel.

They would put these bottleck layers before the 3x3 and 5x5, and after the pooling layer.
The total # bits per pixel in the final output is still increasing, but it reduces the number
of calculations for all of the dot products.

The inception modules are stacked, 9 Inception modules (each has 2 layers)
There are also a few Conv modules before and after the stack of inception layers
Total 22 layers

There is a single backprop that goes through all layers in all modules.


ResNet
========
He et al, 2015
This won imageNet in 2015

It had 152 layers. This tried to explore what happens when we keep adding more and
more layers.
A deeper (56) layer network may have *worse* test error and also *worse* train error
that a less deep (20 layers) net. The deeper net has worse training error as well, so the
problem is not overfitting. Instead, the problem is inadequate optimization.

For eample, take a less deep model (like 20 layers) and add 20-30 identity map layers.
You reproduce the train error so the problem is how we optimize.

The solution is to add back in the input from two layers
It introduced the idea of "residual connections"

A typical map:
input X --> CONV --> ReLU --> CONV --> H(x) is the output

A "residual block":
input X --> CONV --> ReLU --> CONV --> F(x) is the output --> F(x) + Original input X --> ReLU
There is a "skip connection" where the original input is added

We always add x to the output, H(x) = F(x) + x
We optimize to fit the best H(x) - x
The "residual" is F(x), the output is H(x) = F(x) + x
So, we optimize for a lower value F(x) - X, and then we add x to whatever we compute.
Optimizing so the weights undershoot by an amout, then adding that amount back in.

Alternatively, instead of thinking of the function as a multiplicative scaler of the inputs,
we think of it as an added offset to the inputs, but the offset is itself a function of x. 
So, each output pixel was M*x, and now it is x + b where b is a function of x. Really, the 
output is x + N*x

Learning the residual seems to be less noisy than learning the multiplicative scaler.
This is a big shift, instead of thinking of the pipeline as scaling the input, it now adds
varying offsets to the inputs.

Or, it is just replacing each function from a dot product M*x to a slightly more
complex equation b(x) + M*x

As a side note, if the weights are 0, then this simplifies to
    H(x) = 0*x + x = x
In other words, the training can decide whether some layers should be removed by optimizing
the parameters to 0.

Also, the back propagation will send gradient down both inputs to the final
    H(x) = F(x) + x
This means the gradients backprop to *both* the conv layers (which make the F(X) and also
to the original input x
This is significant, because it allows the gradient of the output of a layer to directly affect the 
inputs, which are the outputs of the previous layer.
This allows gradients to propagate directly to all layers of the network.

ResNet would stack Residual blocks.
ResNet used 3x3 CONV layers
Within each block there is a 1x1 CONV that reduces the depth per pixel, a bottleneck layer 
like GoogLeNet.
They would use Batch Normalization after each CONV layer
They used Xavier/2 initialization
They used SGD + Momentum (momentun rate 0.9) for optimization
Learning rate was 0.1, but divided by 10 when the learning plateaued
Mini batch size is 256
Weight decay of 1e-5
No dropout
Between Residual Blocks, they would also would preiodically double the number of filters
and spatially downsample.
There is a single Fully-connected layer at the end to output the classes


Overall:
VGG is the least efficient.
AlexNet has lowest accuracy and is not memory efficiency
ResNet is good accurracy, medium efficiency
GoogLeNet is most efficient and also very accurate


Other layers:
Network in Network (Lin et al, 2014) introduced the idea of bottleneck layers


There are also some new efforts focused on improving on ResNet
==============================================================
- ResNet later versions
He et al, 2016
New block design

- Wide Residual Networks
Zagoruyko et al, 2016
Use residuals, but not a deep network.
Wide network (with more filters)

- ResNeXt
Xie at al 2016
Wide network (with more filters)
Several parallel data paths in each computational block

- Stochastic Depth
Huang et al, 2016
Drop out layers during training, then use them in test.

Alternatives to Resnet
================================

- FractalNet
FractalNet: Ultra-Deep Neural Networks without Residuals
Larsson et al 2017
Both shallow and deep pathways
Drop out layers during train time

This adds shortcut paths to the network.
This allows gradients to flow more directly backward.


- Densely Connected Convolutional Networks
Huang et al, 2017

- SqueezeNet
Landola et al 2017





================================================================================
Lecture 10
Recurrent Neural Networks
================================================================================

A "vanilla" network takes a single input, passes it through several layers, and has a single output.
    In -> Layer1 -> layer2 -> out

A recurrent Neural network may have many different network topologies

One to many
-------------
For example, images are mapped to a sequence of words for captioning.
The caption may have a variable number of words.

In -> layer11 -> layer12 -> out1
        |
      layer 21 -> layer22 -> out2
        |
      layer 31 -> layer32 -> out3


Many to one
-------------
For example, input video may have a variable number of frames
Or input may be a sequence of words with a variable number of words

in1 -> layer11
        |
in2 -> layer 21
        |
in3 -> layer 31 -> out


Many to many
-------------
For example, may a sequence of words in one language to a sequence of words in another words
The sequences may have different numbers of words, and input may be different size than output

In1 -> layer11 -> layer12 -> out1
        |
In2 -> layer 21 -> layer22 -> out2
        |
In3 -> layer 31 -> layer32 -> out3


Recurrent neural networks are also useful when the input and output data have fixed size.
For example, the network may look at N different parts of the image for classification.

A recurrent neural network is a network layer that has internal state.
The state is updated for each new input. 
So, the net is not idempotent, it is not a pure functional model.
In a sense, the state is also fed back into the model along with each new input.
I suppose the state is also part of the output of each computation.

    in -> Layer(state) -> out

Or as an expression
    h(t) = F(h(t-1), x(t))
Where:
    x(t) is input at step t
    h(t-1) is the old state
    h(t) is new state

A Fully-connected layer then may map h(t) to output(t)
F is the same at every time step, and uses the same weights matrix at every time step.
So, state is implemented as additional inputs, not the specification of F itself

So, the complete definition of a simple implementation of a recurrent is:
    h(t) = tanh(W1*h(t-1) + W2*x(t)
    y(t) = W3*h(t)
Note, this has 3 weight matrices
Note also we use tanh as the non-linear step, despite its previously discussed problems.

We intiialize this with an intiial state h(0)
Alternatively, we can think of a recurrent neural network as taking 2 inputs
This "unrolls" the recursive function.

   h(0) --> f --> h1 --> f --> h2 --> f --> h3 ...... --> f --> hn
        x1           x2           x3           ...... xN

This continues until we have fed all available inputs x1...xN
Additionally, we can think of a single node W, which is the weight node, that feeds into all f() nodes

   W -------+------------+------------+-------....... ----+
            |            |            |                   |
   h(0) --> f --> h1 --> f --> h2 --> f --> h3 ...... --> f --> hn
        x1           x2           x3           ...... xN

Making this explicit is important. Backprop will send its gradient to W at each step,
so W gets the sum all of the gradients from each time step.

Note, each h(t) feeds into a separate W2 which computes the outputs Y(t)
We can also make each Loss L(t) explit

   W -------+------------+------------+-------....... ----+
            |            |            |                   |
   h(0) --> f --> h1 --> f --> h2 --> f --> h3 ...... --> f --> hn
        x1  |        x2  |        x3  |        ...... xN  |
            \/           \/           \/                  \/
            Y1           Y2           Y3       .......    Y4
            |            |            |                   |
            \/           \/           \/                  \/
            L1           L2           L3                  L4

This requires a groundtruth label for each timestep of the sequence
The loss for the entire training step is the SUM of the loss from each timestep.
The global loss is then fed back into each timestep to compute the local gradient.
Then, all of these local gradients are added together to compute the total gradient on W

If there are variable inputs, (many to one) then we can feed them all in one at a time.
    Run the net once for each input
If there is one to many, then feed in the one input, but run the neural net many times 
    to compute each output.
In both cases,
    Each run will generate a local gradient. Only after you have run it all the times do you
    add the gradients to finally update W

For many-to-many, you have a separate encoder and decoder neural net.
    The encoder is a many-to-one, and sumarizes the variable length input as a single final state
    The decoder is a one-to-many, and will take the single final state and output a variable sized output



**********************
Many-to-one: Map a series of tokens onto a feature vector
One-to-Many: Map a feature vector onto a series of tokens
Many-To-Many: Map a series of tokens onto a series of tokens
**********************

Language Modelling
This is a common application of recurrent neural networks.
Each input may be a separate character or word.
Consider word completion. It takes a sequence of input characters and tries to predict the next character.

Each input is a vector of 26 elements, which is all 0's except a 1 at the position corresponding 
to that letter.
Each output is a vector of 26 elements, which is all integers that predict the liklihood of each letter
being the next letter in the word.

Each input is used to create a new state, which is the hidden layer.
    hiddenstate = W1 * x
    y = W2 * hiddenLayer


We wait until the end of the sequence, and then add up all of the gradients and only then update W.
This is called "Backpropagation through time"
This can be a problem if the sequences are long, we do not update during each input, only at the end
of the whole sequence.

In practice, people use "Truncated Backpropagation through time"
Run the model forward for K steps (often K=100), then back-propagate and update W, then repeat
for the next K steps and so on.

Here is an example of some code:
https://gist.github.com/karpathy/d4dee566867f8291f086


We can train this, then seed the network with a few starter characters, and it will generate
reasonable text. Take each predicted character and feed that back in as the next character.
If a network is well trained, then it can generate reasonable looking text.



So, how is the system learning?
To understand it, look at the hidden state vector at each time step, and show the 
patterns. Some cells seem to turn on after you start a quote, or get near a line break.
So, larger structure of the text seems to be identified and explicitly represented in the
state vector.
See:
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks
ICLR Workshop 2016

**********************
Even though we only pass in one character at a time, the system learns larger scale
ideas (like structure of quotes or paragraphs or C-code structure) because the state 
spans characters.
**********************

**********************
The state acts like a context, or a Baysean probability. It uses the first steps to 
change the liklihood of later steps.

In General
- Upto now, we have described Neural Networks as pure functional programming.
- Recurrent NN introduce the idea of state.
- We enter inputs sequentially, and accumulate state that spans a series of inputs
Previously, we used ground truth to feedback to the weights.
How do we train the state? What defines a "correct" state?
    Suppose x --> W1 * x ---> State --> W2 * state ---> output
We could have a fixed matrix W2 that maps state to output
Then, use the ground truth to feedback to the W1 matrix that generates the internal state.

**********************

There are also applications to vision, which will do vision captioning.
This can be useful for labelling a single image with many different properties.

Karpathy and Fei-Gei, "Deep Visual-Semantic Alignments for Generating Image Descriptions"
Mao et al, "Explain Images with Multimodal Recurrent Neural Networks"
Vinyals et al, "Show and Tell: A Neural Image Caption Generator"

Pass an image into a Convolutional Neural Net, which will emit a vector of feature flags.
Then, pass the feature vector into a Recurrent Neural Net to map features into annotations.
Note, we may pass in the first flat vector at the end of the CNN pipeline, maybe not the official
final feature model. 

We initialize the Recurrent NN with an initial internal state, and pass in the vector
from the CNN output. The Recurrent NN has formula
    h = tanh((W1 * x) + (W2 * h) + (W3 * V))
    Y = W4 * h
Where
    h is the hidden state
    x is the previously generated word, which is Y[-1]
    v is the feature vector from the CNN
    Y is the output caption, the next X, X[+1]

*******************
Why do we need both X and h? They are both state because X is the previous Y.
This seems redundant.
    
*******************

There is a special  tooken that can be returned as a Y value.
This tells us to stop iterating over the model.


Other models can use "Attention"
The CNN may return a grid of vectors, each vector for one part of the image.
The RNN can also select which part of the image to process next. This is just
    a second output that is used to select the section of the image to examine for
    the next state.


At this point, neural nets are modules, and we can combine them 


Multiple Layers
===============
Recurrent Neural Networks can also have more than one layer
For example, there are K layers (1...k) and T time periods (1...T)

    Fk(Y[k-1, T], H[k, T-1])

You take 2 inputs:
1. The state from your own layer from the previous time period
2. The output of the previous layer from the same time period

You then do several things:
    Combine the inputs (usually just concatenate them)
    Multiply by a weight matrix
    Pass the result through a non-linear function, like tanh
In practice, 2, or 3, or 4 layer RNNs are useful. 

To Backprop
    We receive dL/dy
    Pass back through tanh
    Multiply by the transpose of the weight matrix.
If the network has depth, then repeatedly multiply by the weight matrix will
cause the gradient to either explode or else vanish.

A hack is "Gradient Clipping". If the gradient is large, then just clamp it to a
max threshold.

    GradientNorm = np.sum(grad * grad)
    if (GradientNorm > threshold):
        Gradient = Gradient * (threshold / GradientNorm)

This is a bit crude, but seems to prevent exploding gradients. However, it does not
help with the problem of vanishing gradient. For that, we need LSTM:


Long Short Term Memory (LSTM)
=============================
Hochreiter and Schmidhuber, "Long Short Term Memory", Neural Computation, 1997

This uses a more complex recurrence relation
There are 2 hidden states at each time step: h[t] and c[t]. C[t] is called the "Cell state"

A module takes 2 inputs: X and h[t-1] just like a regular RNN
It then computes 4 intermediate values, called "gates" which are used to update the internal cell state

Like a Valinna RNN, the LSTM will first concatenate h[t-1] and X to make a longer vector
However, then the LSTM will multiply this long vector by a large matrix to compute
a 4 vectors
    i - input gate - How much to input into the cell state
    f - forget gate - How much to erase the cell state
    o - Gate gate - How much to write the cell state
    g - Output gate - How much to reveal cell state

The cell state is multiplied by the forget gate.
    C[t] = (f * c[t-1]) + (i * g)
So, the cell state is a series of counters that are incremented/decremented at each time step

Backpropagation
The gradient is only multiplied by the forget gate. This will often be a different
value for each layer, so we do not repeatedly multiply by the same weight matrix and have
less chance of expliding or vanishing gradients.

This direct path, which allows gradients to flow backward with little change or computation
is similar to ResNet.

These equations seem arbirtary and magic, but they have been shown to be useful and easily
derived.


Other RNN Variants
=====================

1. Gated Recurrent Unit (GRU)
Cho et al, 2014
"Learning Phrase Representations Using MN Encoder-Decoder for Statistical Machine Translation"
This tried different recurrence equations for LFTM

2. Jozefowicz et al, 2015
An Empiical Exploration of Recurrent Network Architectures
This tried a search over different recurrent 



================================================================================
Lecture 11
Detection and Segmentation
================================================================================


Until now, we have been doing just image classification.
A NN takes in an entire image, and outputs a vector that has an entry for every possible class.
It maps the entire image onto a class.


Semantic Segmentation
=====================
Identify each pixel as belonging to a different class. For example, Cat, grass, sky, tree.
This does not differentiate instances, so if there are 2 cats, the pixels for both are just 
labelled "cat" and we don't separate the instances.

One simplistic but bad approach is to use a sliding window, called a crop. And for each position,
categorize the central pixel of the sliding image.
This might work, but it is extremely expensive - you need a whole crop for each pixel.
It also replicates a lot of work.

A slightly better approach is a fully convolutional network, no FC layers. Each layer is 
3x3 with no padding. If we pass through an image, then the output could be a hige vector of size
C x H x W where C is the # of classes. So this is a vector with the liklihood of every class for
every pixel. 
This just uses existing techniques. 
The training data is hard, because we need data that labels every pixel in an image.
The net also is expensive to run, because all convolutions must preserve all pixels, so you
do not shrink the image.

An approach that is actually used, will (i) downsample the image, then (ii) upsample the image.
Downsample will clump ralated pixels together and identifyt he instances.
Upsampling will take a class for a region and apply it to each pixel.
This does most convolutions on a downsampled smaller image.
In simple image classification, we would then pass the last downsampled image to a filly-connected
layer and then generate the output.
Instead, here, we take the final downsampled image and then repeatedly upsample it.
This is efficient, because most computations are done on a smaller downsampled image.
It also intuitively makes sense that we think about the image as regions or blobs, and
look at the overall shape and arrangement of features rather than look at each pixel individually.
Downsampling combines the environment around each pixel with the pixel - it creates a
context for the pixel.

Examples of downsample include strided convolutions and pooling
Examples of upsampling are:
===========================
- Nearest Neighbor Unpooling
----------------------------
Each output is the nearest input pixel. This just copies each input pixel to
several positions in the output image. One input pixel value may be used several
times.

- Bed of nails Unpooling
----------------------------
Each output is either 0 or the nearest input pixel. Each input pixel is only used once,
and we fill in missing pixels with 0.
Each input pixel maps to a 2x2 array of output pixels. We always assign the input pixel to the 
same position in the output 2x2 array, and then all 0's to all other pixels.

- Max Unpooling
----------------------------
This is sort of like bed-of-nails, except the input pixel goes to a different position in
each 2x2 output pixel.
Each upsample layer is associated with a downsample layer in the network.
When we downsample, we record which element we used. For example, if we use max-pooling, 
then we recall which input pixel had the max value.
Later, when we upsample, we do Bed-Of-Nails, but we assign the input pixel to the output 
pixel in the position of the max value when we downsampled.

- Transpose Convolution
----------------------------
This is a learnable upsample. 
The input is smaller than the output - for example, input may be 2x2 and output
may be 4x4. 
Recall, when we downsample, the output pixel is the inner-product of the kernel matrix and
the input pixels. We may have a stride of 2, which means we move input window 2 pixels 
when we move output position by 1 pixel.

A Transpose Convolution Upsample maps a smaller input to a larger output, for example a
2x2 region to a 4x4 region.
We do NOT do a dot product with the kernel matrix and the input region.
Instead, we take a single input pixel and multiply it against each entry in the kernel matrix.
So, each input pixel makes a 3x3 array of output values. We copy those output values
to the output matrix. So, the kernel matrix is really a matrix of weights.
When we move each input pixel, we move by a stride, like 2, in the output image.
Each output pixel may be mapped by several different input pixels with overlapping
projections onto the output image. In this case, we just ADD all of the generated output pixel
results.

This same operation is also called "Deconvolution", which is a misuse of the term.
This is also called "Upconvolution" and "Fractionally Strided Convolution"
This is also called "Backward strided convolution" because the forward pass of a transpose
convolution is the same math as the backward pass of a normal convolution.

Consider a 1-dimensional example.
Input [a  b]
Filter [x  y  z]
Output [ax   ay   az+bx   by   bz]

The name Transpose convolution comes from the math.
A 3x1 convolution with stride 1 can always be written as 
Kernel:
x  y  z  0  0  0
0  x  y  z  0  0
0  0  x  y  z  0
0  0  0  x  y  z
Inputs:  [0 a b c d 0]-Transpose   (note the 0's are padding)
Output:  [(ay + bz)  (ax + by + cz)   (bx + cy + dz)   (cx + dy)]-Transpose

A transpose convolution is just the transpose of the weight matrix
Kernel:
x 0 0 0
y x 0 0 
z y x 0
0 z y x
0 0 z y
0 0 0 z
Inputs: [a b c d]-Transpose
Output: [ax  (ay + bx)  (az + by + cx)   (bz + cy + dx)   (dz + dy)   dy]-Transpose


So, Semantic Segmentation has a pipeline where the first convolutional layers downsample,
and then the second half does upsampling 




Classification and Localization
================================
Find a specific class of thing in an image and draw a bounding box around it
We assume there is only one instance of an object in the image.

The basic architecture is:
Input image into a CNN
The last stage of the CNN is a 4096 vector and will then feed into 2 
different fully connected layers
1. Map 4096 vector to a vector of category scores
2. Map 4096 vector to a bownding box (topLeftX, topLeftY, Width, Height)

There are now 2 training loss functions
1. Loss function of actual classification vector and predicted.
This does backprop to the first Fully connected layer
The result is categorical (discrete) daya so the loss function is usually something 
like SoftMax

2. Loss function of the actual bounding box and the preducted bounding box.
The result is continuous values, so the loss function is a "regression loss" 
which is usually L1 or L2. Usually, people will use an L2 loss function.
    Distance = SQRT((x' - x)**2 + (y' - y)**2 + (H' - H)**2 + (W' - W)**2)

In practice, we use a weighted sum of the 2 losses when we backprop through
the CNN. This weighting is really tricky, and is a hyperparameter.



Object Detection
================
Find all instances of a class in an image. There may be more than 1.
Draw bounding boxes around them.

One approach is to use a sliding window.
For each crop of the image, pass through a CNN, and do an image classification. 
So, we classify each crop, and then aggregate all of the results. 
However, deciding how to choose the crops is difficult.
To be safe this would require many many crops, and is very expensive.

Region Proposal
Use standard computer vision to detect objects
- Edge detection
- Pixel color density

Look for blobby regions in the image, select crops around these, and then
do image classification.

Girschick et al
"Rich Feature Hierarchies for Accutate Object Detection and Semantic Segmentation"
CVPR 20: 4
(may be around 2015)

There are many different Region Proposal Networks, each will generate
~2000 regions of interest.
The regions are all of different sizes, so first normalize their size.
Run each region through a convolutional network to classify the image.

There are several problems
- Still computationally expensive because of many proposed regions
- Hard coded region proposal algorithm, not learned

Fast R-CNN
Pass initial image through a CNN to generate high level feature map
Also, Pass initial image through a region proposal network
For each proposed region, take the feature map from the CNN for the region that
   was proposed by the feature proposal. 
So, we do the expensive computational only once for the entire image. Not separately
for each region.

Next, pass the cropped feature set through a fully connected NN.
Now, the main perf bottleneck is the region proposal.

Faster-R-CNN has a much faster algorithm for generating region proposals.
The neural network will do region proposals
So, now the CNN does 4 things:
- Propose regions of possible objects
- Make bounding boxes of possible proposed objects
- Final classification score
- Final box coordinates

Ren et al
"Facter R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
NIPS 2015


There are other approaches for faster object detection
------------------------------------------------------

YOLO - You Only Look Once
-------------------------
Redmon et al, "You Only Look Once: Unified Real-Time Object Detection"
CVPR 2016

SSD - Single Shot Detection
---------------------------
Liu et al, "SSD - Single Shot Detection" ECCV 2016


Divide the input image into a grid, like a fixed 7x7
For each grid cell, make several fixed size bounding boxes.
For each bounding box in each grid, predict several things:
- Offset from base bounding where the object is
- Classification scores for each object class

For each bounding box in each grid cell, compute 5 values (dx, dy, dw, dh, confidence)
So, we have 7 x 7 x (5 * B + C)
B is the # bounding boxes, C is # object classes
Then, pass the entire thing into a CNN

These combine region proposal and bounding box classigication as a single
forward step.



Instance Segmentation
=====================
Identify each pixel as belonging to a diffenent instance of a specific class of thing.
Find all instances of all classes.
This is a combination of the previous problems

He et al
"Mask R-CNN" arXiv 2017

This resembles Fast-R-CNN a bit.
Pass initial image through a CNN to generate high level feature map
Also, Pass initial image through a region proposal network
For each proposed region, take the feature map from the CNN for the region that
   was proposed by the feature proposal. 
Now, predist the segmentation mask for each proposal. This is an image-segmentation
problem in each region generated by a region proposal network

This extends Fast-R-CNN and it works very well


================================================================================
Lecture 12
Visualizing and Understanding Convolutional Networks
================================================================================

Neural Nets seem to work - they pass the testing data with good accuracy.
However, they are a black box, and it is difficult to trust this thing when we do
not know what it does.

This lecture covers some experiments that try to describe what a CNN is doing.
It is also potentially useful for debugging.


Draw Filters as Images
==================================
The first layer of a ConvNet is simple filters.
We can draw these, and see the basic patterns they are looking for (like horizontal line).
For example, AlexNet has 64 11x11 filters at the first stage.
Represent the weights of the convolutional filters as RGB values 
  (one weight for each channel which is is either R or G or B).
Then draw these as 11x11 images. There are 64 filters, so draw 64 images.
The Convolutional filter just does an inner-product on the input image.
The data that yields the max inner product is aligned with the row of the filter matrix - this
is just linear algebra. 
So, the filter row is maximized by an image that had values equal to the filter row itself.
We scale the weights to [0..255] range

Krizchevsky, "One Weird trick for parallelizing convolutional Neural Networks" arXiv 2014
He et al, "Deep Residual Learning for Image Recognition" CVPR 2016
Huang et al, "Densly Connected Convolutional Networks" CVPR 2017


But, when we do the same trick for the 2nd or 3rd or later visual layers, the data is not 
easily interpreted. The net is not thinking about the image itself, but rather on the features 
that were found at the first layer.



Nearest Neighbor for Final Feature Vector
==================================
We can also look at the output of the fully-connected layer that is second to the last layer.
The last layer maps features to classes, and generates a vector of class scores.
However, the second to last layer generates the vector of final feature values.
So the algorithm is:
- Pass a set of training images and record the feature vector for each image
- For each test image, get its feature vector
- Find the 5 training images whose feature vectors are nearest to the feature vector of the 
test image. Display those test images. These may be different pixel values but similar 
features.


Principle Component Analysis
==================================
Reduce a N-dimensional space of features down to 2-dimensions.
Then, plot the 2 dimensions on a graph.
There is a more complex algorithm, called t-SNE (t-distributed stochastic network embedding)

See:
Van der Maaten and Hinton "Visualizing Data using t-SNE" JMLR 2008

This will often show clusters
We can apply t-SNE to the last feature vector we recorded for a group of training images.


Activation Maps
=================
We can also look at the "Activation Maps" of intermediate layers.
For example, a feature map is 128 features x Width x height. 
This is a 3D chunk of numbers, which is called an activation volume.
We can instead view this as 128 separate 2-D maps with size width x height.
Each slice is an "Activation Map". It is a 2D matrix with the values for a single feature.
Look at each WxH slice 2D map as a grayscale image.
This will show bright spots at the part of the image that the feature map is focusing on.


Find Images that Light Up One feature at One layer
===================================================
We can also look at all of out training images, record the intermediate matrices for each 
training image.
Pick a layer of the net. This will output a C x W x H activation volume.
Pick one feature, so this is a W x H slice in the activation volume, and this 2D slice
is called an activation map.
Show the section of imput images that cause maximal activation for this activation map
in all of the training images.

Occlusion experiments
=================
Take an image and occlude different parts of the image, and record the classification vector for each
different occlusion position. Slide the occlusion around the image.
Record which position of the occlusion image cause the final classification to change the most.
Those are the parts of the image that seem to be used the most.

Saliency Maps
=================
Which pixels are important to make a classification?
This is the gradient of the class class score with respect to the input pixels
For each input pixel, if we change it, how much does the classification change.
Then, draw the gradients as grayscale, and it shows which parts of the image are
important for classification.

Simonyan, Vedaldi, and Zisserman
"Deep Inside Convolutional Networks: Visualizing Image Classification Models
and Saliency Maps"
ICLR Workshop 2014


Guided Back Propagation
=======================
Do saliency maps. BUT, we use guided backprop, which essentially only propagates positive gradients, 
not negative gradients.
So, it is a custom backprop that creates filter weights that do better on saliency map.


Gradient Ascent
=======================
This asks a more general image. Until now, we have used a test image and tried to
see how it is processed in the network. 
Instead, we can ask what types of pixels in general will affect a specific filter in the network.

We generate a synthetic image that causes max output from a neuron.
    Gradient descent - change weights to minimize loss
    Gradient ascent - fixed weights, but change pixels to maximize output

I* = arg max( f(I) + R(I))
    f(I) is the neuron output value
    R(i) is a "natural image regularizer"

We want the synthetic image to look like a "natural" image that we would test.
The R(i) is a "natural image regularizer" which helps the synthetis image look normal.

Take a synthetic image (like all 0's etc)
Pass forward through the network, compute the neuron output we are interested.
    It does not need to go to the end of the network
Back propagate to compute the gradient of the desired neurons output with respect to the pixel image
    There is no loss function. We only want the biggest possible neuron output
Make a gradient descent update to the pixel itself to maximize the score

An example of an R(I) "natural image regularizer" is 
    -1 * lambda(I)**2
This is an L2 function of the sum of squares of all pixels

During optiomization,
- Periodically Gaussian blur an image
- Clip pixels with small values to 0
- Clip pixels with small gradients to 0


Deep Dream
============================
This sounds like just a toy, but it is insightful.

Choose an image, and a layer in a CNN. This is a fully-trained CNN.
- Pass the image Forward through the net, compute the output of the desired layer
- Back propagate, using the layer's output as the intiial gradient.
Propagate this back to the image, and then mutate the image.
Notice, the CNN layer is fixed, the weights are fixed.
This changes the image, not the CNN.

This amplifies whatever features were detected byt he CNN in the image.
The initial gradient is the features detected, so this maximizes whatever
features were detected.
Another way of saying this is it maximizes the L2 norm of the detected features.

The code is simple. It does a few tricks (jitter the image before computing the gradient)
This is a regularizer. They also use the L1 norm of the gradient.

So, pass in an image of the sky, and it mutates the image with ghost like images.
The key is these ghosts are abstract forms of the data that was originally used to train 
the CNN.

So, really, a CNN has fixed "template images", that it looks for, but these seem 
to be stored as abstract features rather than a brittle pixel bitmap. 


Feature Inversion
=================================
Take an image, 
    pass it through a mature CNN
    get the feature vector
    reconstruct the image from its feature vector

If we reconstruct the image from the feature vectors of early layers, we can
make a detailed image.
If we use higher layers in the image, then we get a fuzzy image, that loses a lot of the detail.
This confirms our suspicions - the early layers look for simple bit patterns while
the later layers 



Texture Synthesis
=================================
Start with a small patch if an image, and try to generate a large map.
This is an old problem, and can be done without NN.
However, these break down when applied to more complex images.

We pass the image sample through a CNN. Get an activation map which
is a C x W x H matrix where C is the number of features.
This is a H x W grid, and for every point in the grid there is a C-dimensional vector
that describes the appearance at that point.

Then, take the feature vectors for 2 different points, and compute the outer
column, to create a CxC matrix.
This CxC matrix tells us something about the co-occurrence of features, if (i,j) is big then those
features are co-occurrence.
Do this for every pair of vectors, and map each one to a single point.
Then do this for all pairs of possible features.
    Now, we have a CxC matrix that shows the covariance of every feature on every point.

Now, use this matrix to make a CNN.
Pass through an initial image
Compute gradients
Backprop to update the image. 

So, this uses the features we are looking for to create an image with those layers.




================================================================================
Lecture 13
Generative Models
================================================================================

Unsupervised Learning
In supervised learning, we provide the network with (X, Y) and it tries to find a
mapping from X-->Y
In unsupervised learning, we provide X, no Y, and the network tries
to find relationships.
Some example 
- clustering. 
    Find groups of data items that are similar in some metric
- Dimensionality Reduction
    Find axes in which data has most variation, and remove low variability axes
- Learn feature representation
    Like training a NN, there is a loss function and gradients.
    Here, the loss function is the error when we use extracted features to 
        reconstruct the original data.
- Density Estimation
    Estimate hte underlying distribution of the data

Today, we will focus on Generative Models
Generative Models are a type of Unsupervised learning
We use training data to generate new samples with the same distribution.

These also solve Density Estimation as part of the solution.
We can use Density Estimation either as an explicit or implicit part of the system.

Generating data has many applications:
- Fill in missing parts of an incomplete data set (such as partially occluded image)
- Estimating future steps in a time series
- Create realistic training data

There are many types of Generative Models
Tractable Density
    Fully Visible Belief Networks
Approximate Density
    Variational Autoencoder
    Markov Chain (Boltzman Machine)
Implicit Density
    Markov Chain

See Ian Goodfellos, "Tutorial on General.....Networks" 2017


PixelRNN and PixelCNN
==========================
These are a type of "Fully Visible Belief Networks". They explicitly model
the density.

*** A training image is used to train the CNN to generate the image.
Later, the program generates new pixels that conform to the same relationships that 
were extracted from the training image.

*** We generate new pixels using weights in a neural network.
The neural network is trained to have weights that maximize the probability of the 
training data being reproduced as output.

p(x) is the probability of an image. We define 
    p(x) = PRODUCT[i=1...n](p(x[i] | x[1], x[2],....,x[i-1])

We use the chain rule to express the liklihood of x as a product of 1-dimensional
distributions. Here, x[i] is a previous value.

In practice, x are pixel values. So, the probability that pixel i has some value 
is the product of the probabilities of all previous pixels.

To train the model, we are trying to maximize the liklihood of the outputs, which are the pixel
values, given the training data. This is odd, since a pixel is a product of the probabilities
of previous pixels, so does this mean that pixel values just grow monotonically over time?

Anyway, how do we model the probabilities of all previous pixel values over time?
We use a neural network.


PixelRNN:
van der Oord et al 2016
    - Start with a pixel in the top-left corner of the image. Not sure how this gets
    initialized.

    - Next, use the known pixel to model adjacent pixels. We use an RNN (specifically an LSTM) for
    each adjacent pixel.

    - Repeat. We use known pixels and LSTM's to compute new adjacent pixels.

This works but is incredibly slow.

PixelCNN:
van der Oord et al 2016 NIPS
Salimens et al 2017 (PixelCNN++)
    - Start with a pixel in the top-left corner of the image. Not sure how this gets
    initialized.

    - Use a CNN to generate new adjacent pixels.
    Each new pixel has a CNN neural network to generate it.

    p(x) = PRODUCT[i=1...n](p(x[i] | x[1], x[2],....,x[i-1])
    Training maximizes the liklihood of training images
    The loss function is a softmax loss of pixel values

Training is faster. The CNN is configured to maximize the training data as output.
At generation time, it is still slow because you are running a CNN for each pixel.



Variational Autoencoders (VAE)
==============================
We now define a new variable z, and the probability of each p(x) is an integral over all
possible values of z.

    p(x) = Integral(p(z)*p(x | z) dz)

This is an issue, we cannot optimize it directly.
Instead, we optimize the lower bound on the liklihood.

VAE's are related to a more general class of programs, called Autoencoders.
Autoencoders take input data, X, and generate features called Z, which are lower
dimensions that the input. We do dimension reduction. Z should represent the higher
semantic level.

In practice we should also  be able to recerse, and use the high level features
to construct an image.

    X (input data) --Encoder--> Z ---Decoder--> X'

We want X' to be similar to X.

Autoencoders are trained with some training data.
The encoders were linear layers with sigmoid nonlinearities.
Later, these evolved and now use CNN's.

The encoder is often a 4-layer Convnet
The decoder is then a 4-layer upconv net
We train by encoding and then reconstructing the inputs. We want the reconstructed
input to be as close as the original input as possible.

We use L2 loss
    L2 = (x - x')**2

Notice, there are no external labels. Instead, we use the original image as ground truth to 
compare with the reconstructed image.

Once we have trained the net, we do Not use the decoder any more - it is only used
for training.

Instead, we can also use the

    X (input data) --Encoder--> Z ---Features CNN--> y'
The CNN generates predicted labels, and these can be used for supervised learning.

Variational Autoencoders
---------------------------
Variational Autoencoders use the high level features to generate new images
This is a probabilistic form of an autoencoder.
For example, z could encode facial features or landscape features.
    Kingma and Welling, "Auto Encoding Variational Bayes" ICLR 2014

We use an unobserved z vector to generate a new X. We make p(x | z)
We want Z to have a simple probability distribution, like a Gaussian.

            Z (input data) --Decoder--> X

We implement the decoder with a CNN
We train this to maximimize the probability of reconstructing the training data.

    p(x) = Integral(p(z) * p(x | z) dz)
Where:
    p(z) is a Gaussian distribution
    p(x | z) is the output of the neural network given Z as input
This integral is intractable - we cannot compute it for every possible Z

Now, by Baysean, p(z | x) = p(X | z) * p(z) / p(x)
But, we can create a separate encoder network

    X ----Encoder----> mean(Z | X) and diagonal covariantL SUM(z | x)
    Z ----Decoder----> mean(X | Z) and diagonal covariant: SUM(X | Z)

So, both the encoder and decoder produce probability distributions.
To get actual Z and X values, we just randomly sample over these distributions.
The endocders are also called "recognition" or "inference" networks
The decoders are also called "generation" networks

Now, write the log(p(x))
    log(p(x)) = ExpectedValue(z) * (log(p(x)))

Next expand p(x) using Bayes rule
    log(p(x)) = ExpectedValue(z) * (log( (p(x | z)*p(z) / p(z | x))))

Multiply by a constant (q(z | x) / q(z | x))
    log(p(x)) = ExpectedValue(z) * (log( (p(x | z)*p(z) / p(z | x)*q(z | x))))

Split into 3 terms
    log(p(x)) = ExpectedValue(z)*(log( (p(x | z)))) - ExpectedValue(x)*(log(q(z | x) / p(z)))
            + ExpectedValue(x)*(q(z | x)/p(z | x))

This makes 3 "KL" terms but she does not say what that means.

We cannot solve the entire expression, but only 1 term creates problems, and that is positive.
So, we can minimize the other 2 terms, and say this is <= the actual result. In other words, we
can solve for a lower bound (mayne not a tight lower bound, but a lower bound nonetheless).
So, we try to train a decoder/encoder that has maximum lower bound of the probability of reconstructing
the original training data. 

To Train the system:
====================
1. Start with input data x, 
    X ----Encoder----> mean(Z | X) and diagonal covariantL SUM(z | x)

2. Sample from the distribution of z, so make z values

3. Pass through a decoder
    Z ----Decoder----> mean(X | Z) and diagonal covariant: SUM(X | Z)

4. Sample from the distribution of x', so make x' values
The loss of the training data will be x - x'

5. Backprop through the network

To Use the System:
======================
1. Sample from the distribution of z, so make z values

2. Pass through a decoder
    Z ----Decoder----> mean(X | Z) and diagonal covariant: SUM(X | Z)

3. Sample from the distribution of x', so make x' values
The loss of the training data will be x - x'




Generative Adversarial Networks (GAN)
=====================================
See Ian Goodfellow et al, "Generative Adversarial Networks", NIPS, 2014

Here we do not try to even model the probability density function.
Instead, we will generate samples with a simpler distribution like random noise.
We will learn a transformation from random noise to the training data. That transformation
will be a neural network.

   Random noise vector ---Generator Network---> generated sample

We train the network by modelling a 2-player game and a discriminator network that
tries to distinguish between real and fake images.
    Player 1 will try to fool a discriminator with convincing fakes
    Player 2 will try to tell the difference between real and fake
So, we train with real images from a training set and also the generator to make fake
images. However, this assumes we have a working detector.

We will train with minimax.
    min-over-g(max-over-d)[Expected(log(D(x))) + Expected(log(1 - D(G(z))))]

Where:
D(x) is the liklihood the data is a real image, it is the output of the discriminator
    X is a real image
G(z) is a fake image, made by G() from random vector z
D(G(z)) is the discriminator result for generated fake data

We want D(x) to be close to 1 and we want D(G(z)) to be close to 0

We will alternate
- Gradient ascent on the discriminator, to maximize
    [Expected(log(D(x))) + Expected(log(1 - D(G(z))))] 

- Gradient descent on the generator to minimize
    Expected(log(1 - D(G(z))))

In practice, this does not work great.
So, instead of minimizing the liklihood of being correct, we maximize the 
liklihood of the discriminator being wrong
Gradient ascent on the generator to mazimize
    Expected(log(D(G(x))))




There have been later implementations of GAN, and these are starting to look
very realistic.
A new system uses convolutional networks.
See:
    Radford et al, "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial"
    ICLR 2016

This still uses a random Z vector to produce each new sample.
- The generator is an upsampling network, with fractionally strided convolutions
    Replace any pooling layers with fractional strided convolutions
    Use ReLU activation in all layers except the output which uses Tanh

- The discriminator is a Convolutional network
    Replace any pooling layers with strided convolutions
    Use Leaky ReLU activation in all layers

In both:
    Use batchnorm
    Remove fully connected hidden layers in deep architectures


This still converts Z to an image, and again uses a generator and discriminator for training, but
just a generator to make the images.

The system makes detailed and sharp images.
You can also do vector math with the random Z vectors. Adding 2 input vector will
make a coherent image that combines features of both z vectors that were added.

Better training:
    LSGAN, Mao et al
    BEGAN, Bertholet et al 2017

Transform between domains or change features (change scene from summer to winter)
    CycleGAN, Zhu et al 2017

The GAN Zoo is a website with many papers on new GAN projects




================================================================================
Lecture 14 
Deep Reinforcement Learning
================================================================================

Reinforcement Learning (RL)
==========================
Reinforcement Learning is like planning.
    An agent takes actions in an environment (the current state)
    These actions change the state
    There are rewards for "good" actions

The environment can return a "terminal state" which ends the process. This may be a goal or end state.

Some examples:
- Balance a pole on top of a cart (inverse pendulum)
Planning outputs force to move cart in any direction. Environment is pole angle, angular speed,
position and cart velocity
The reward is how upright the pole is

- Move a robot forward
Planning outputs torques to joints
Environment is joint positions, etc. 
Reward is how far you have walked

- Atari games
State is raw pixels of the game
Action: move game paddles
Reward: score increases with each step

- Go
State is position of all pieces
Action: place a new piece


Markov Decision Processes (MDP)
===============================
This is a form of Reinforcement Learning
There are variables
    S - set of possible states
    A - set of possible actions
    R - Distribution of rewards. There is a different reward for each (s, a) (state,action) pair
    P - Transition probability. This is the probability distribution of the next (state,action) pair
    Y - Discount factor (gamma)

At time t=0, we make an initial state, s0
while
    Select an action a[t]
    Select a reward corresponding to that action r[t] = R(. | S[t], A[t])
    Select the next state s[t+1] = P(. | s[t], a[t])
    Collect the reward r[t]

A policy function Pi maps S-->A to choose the next action for each state
The objective to is find the policy Pi* that maximizes total reward SUM[t>0](Y[t]*r[t])


Consider a simple example: Grid World
There is a 3x3 grid, you are in 1 square at each time, so state is current square
Aactions are to move to an adjacent square.
There are goal states, and the reward is a negative value for each move taken.
Goal is to reach a goal state in the least number of steps.

Note, this also includes randomness, the initial state and also the transition
function for each state, and even the resulting state after an action.

The optimal policy maximizes the expected total reward
Pi* = arg max Expected(SUM[t>=0] ( Y[t]*r[t] )
with 
    s[0] = P(s[0]) for some probability p
    a[t] = pi(. | s[t])
    s[t+1] = P(. | s[t], a[t]) for some probability p


Value-Function
This is the expected cumulative reward from a start state to a goal state.
ValueFunction takes a start state as an argument, and computes the sum of all rewards
from that start state until a goal state.
    V(s) = Expected(SUM[t>= 0](Y[t]*r[t]) | s[0] = s)

Q-Value Function
This takes as parameters a state s and action a, and computes the expected cumulative
reward 
    Q(s, a) = Expected(SUM[t>= 0](Y[t]*r[t]) | s[0] = s)
So, the Q function is the Value-finction if the first action is a.

Q* is the optimal Q function for a given start state and action
    Q*(s, a) = MAX(Expected(SUM[t>= 0](Y[t]*r[t]) | s[0] = s))
Q* satisfies the "Bellman Equation"
    Q*(s, a) = Expected(r + (Y*MAX(Q*(s', a'))))

Intuitively, if the optimal next-step is known, Q*(s, a), then the best
strategy is to take the action that maximizes r + Y*Q*(s', a')
In other words, take action to go from s-->s' and take the reward r 
for the optimal next step, and then find the optimal rest of the path.

By itself, this does not necessarily imply a greedy algorithm.
The optimal next step may be a step backward to take 2 steps forward later.

We can then iteratively apply this equation to compute the best next step by computing
the best path from all possible steps. It is sort of like a min/max game-theory tree, where
you consider all scenarios for each next possible step.

This is not scalable. You must compute Q(s,a) for every state action pair. The state
space may be very large.

We can use a function approximator to estimate the Q(s,a).
In general, a neural network is a good way to approximate any function.
This is Q-Learning. Define a neural network 
    Q(s, a; theta) ~= Q*(s, a)
the parameters theta are the weights of the network.



Q-Learning or Deep Q-Learning
===================
A CNN approximated the Q-value function

This is a Neural Net used to estimate the Q function that satisfies the Bellman equation
Note: this is really just using a neural net as a general mechanism to approximate a function.
Neural net classifiers were described as being able to compute an arbitrary boundary between 
positive and negative cases. This means the neural net is really just linear approximator.

Forward Pass:
Loss is L(theta) = Expected((y^i - Q(s, A, theta))**2)
    where y^i = Expected(r + (Y*MAX(Q*(s', a'))))  which is the actual Bellman Equation

Backward Pass:
Gradient update with respect to the parameters theta
Gradient is Del-L(theta) = Expected(r + (Y*MAX(Q*(s', a'))) - Q(s, a; theta)) * del-Q(s, a; theta)


Example: Atari arcade games
---------------------------
The Q(s, a; theta) function in a CNN, and the current state is a stack of the last 4 screenshot 
frames, each is 84x84 pixels (this is downsampled from the original screenshot, it is also converted from 
RGB to grayscale).

The CNN is 
    Input
    16-channel 8x8 conv with stride 4
    32-channel 4x4 conv with stride 2
    Fully connected with 256 channels
    fully-connected with output K channels (the Q values)

The output is k-channels because there are K possible actions at any given state in the game.
Each of the K output values is the Q(s, a) for one particular action a.
The different atari games ahve between 4 and 18 different types of actions.
So, the CNN computes all Q-values for all actions.

At each step, pick the action that immediately leads to a higher value state.
This requires that we can assign a value to each state, and that a single action
can lead to an immediate difference in the state value.

This is just a greedy algorithm. Just do local optimization to find the next best state.
This is hill-climbing search, and so of course never takes a step backward to take 2 steps
forward.

They modify this by occasionally taking a random action rather than the seemingly best action.
Choosing a random action is dome every N steps according to a random probability.

Experience Replay:
Learning from batches of consecutive states leads to inefficient learning, because
all the samples are correlated. It also allows bias, so for example, a player may favor the
left side of the game field.
So, to work around this, we record games and then randomly draw training batches from 
different points in the game
The state is just the game pixels, so we can get the state at any time with random access.


Policy Gradients
===================
This is a different approach.
It does not compute Q-values and plan individual steps. The Q-function is complicated
and requires learning the value of every state-action pair. This may be a very large space.

The policy may be simpler and we can define a class of policies
Each policy is parameterized by weights theta
The value of a policy is J(theta)
    J(theta) = Expected(SUM[t>=0](Y[t] * r[t]))
It is the expected sum of all future rewards for a policy

We want to find the optimal parameters theta* which lead to the most reward
    theta* = arg max J(theta)

We can do Gradient Ascent on the policy parameters.
Specifically, this is done by the REINFORCE algorithm

Each J(theta) is a sum of future rewards
    J(theta) = Expected(R(tau))
             = Integral (R(tau) * p(tau; theta) dtau
where p(tau) is the probability of tau
and where R(tau) is the "reward trajectory" of a sequence of states and actions
    Tau = (s0, a0, r0, s1, a1, r1, .....)
Each trajectory has a series of rewards.
Each policy then has a value, which is the sum of the rewards.

We cannot practically compute the gradient with respect to theta of Tau, since
computing the derivative of an expectation is imtracticable. 
Instead, we do an approximation
    del-p(tau) = p(tau) * (del-p(tau) / p(tau) = p(tau) * del-log(p(tau))
So, substitute this back to the expression we were doing gradient ascent on
    del-J(theta) = Integral[all tau)(del-log(p(tau)) * pi(tau) dtau
                 = Expected(R(tau) * del-log(pi(tau)))
We have converted a gradient of an expected-value into an expected-value of gradients
I think pi() is just a different probability distribution

p(tau) is the probability of a trajectory.
It is the product of the probability of each step in the trajectory.
    p(tau) = PRODUCT[t>= 0](p(s[t+1] | s[t],a[t]) * p(a[t] | s[t]))
So, log(p(tau)) is just doing a log of the above expression, multiply-->add
    log(p(tau)) = SUM[t>=0] ( log(p(s[t+1] | s[t],a[t])  + log( p(a[t] | s[t])))

Can we compute these *without* knowing the individual transition probabilities?
Differentiating this expression gives us:
    del-log(p(tau)) = SUM[t>=0] ( del-log(pi(a[t] | s[t])))
We are taking derivative with respect to theta, and the first term log(p(s[t+1] | s[t],a[t])
has NO theta in it, so it drop out.
Moreover, the second term does depend on individual transition probabilities.
It only is probabilities of actions.

The intuition here is:
    If reward from a trajectory is high, then increase the probability of all 
            actions in that trajectory.
    If reward from a trajectory is low, then decrease the probability of all
            actions in that trajectory.
This is simplistic, but the expected value averages out to be true over many samples.
   This has a problem with high variance.

There are some heuristics to improve the estimator
All of these optimizations are used in the "Vanilla REINFORCE" algorithm.

1. Variance Reduction
   Gradient estimator was del-J = SUM[t >= 0] ( r(tau) del-log(pi(a[t] | s[t])))
We want to increase the probabilities of an action only using the future reward of the action from a state
So, instead of assigning a reward to a possible action based on the total reward of the entire state, we look
at the total reward for the subsection of the plan that starts at the action in question.
The value of the action only depends on reward of the planafter that action.
    del-J = SUM[t >= 0] ( SUM[t' >= t] r(t'))  r(tau) del-log(pi(a[t] | s[t])))    

2. Add a discount factor
This ignores delayed effects
    del-J = SUM[t >= 0] ( SUM[t' >= t] gamma**(t' - t) * r(t'))  r(tau) del-log(pi(a[t] | s[t])))    
This is an exponential decay on much later rewards.
So, we only care about rewards that are near-term effects of the action in question.

3. Use a baseline
Don't look at the raw value of the reward.
The real question is whether a reward witha plan that includes an action, is better or worse 
than the expected value of the plan without the action.
This is similar to the value functions with Q-learning.
The intuition is we want to take an action in a state if the Q-value of taking that action is more
than the expected value of not taking the action.

We don't know Q and V, but we can learn them using Q-learning
Train both an actor and a critic.
    The actor decides which action to take
    The critic measures how good the action was

Initialize policy parameters theta and critic parameters phi
for iteration = 1,2,.... do
    Sample m trajectories under the current policy
    del-theta = 0
    for i=1,2,...m do
        for t=1,2,...,T do
            A = SUM[t' >= t] (gamma**(t' - t) r[t] - V(s[t])
            delta-theta = delta-theta + A[t] * del-log(a[t])
    delta-theta = SUM[i]SUM[t] del||A[t]||**2
    theta = alpha * del-theta
    phi = beta * del-phi

We alternate between learning and optimizing the policy and critic functions.


Recurrent Attention Model (RAM)
================================
This uses the REINFORCE algorithm
This still tries to classify an entire image
This will take a sequence of "glimpses" which look at only a subset of the image
This seems to be similar to the way people "see", and we can track eye movement
It saves computation, because you don't have to process the entire image.
It also lets you avoid clutter and distracting parts of the image

This uses reinforcement learning
    The state is the sum of glimpses
    The action is which part of the image to see next
    The reward is whether we correctly classify the image at the last time step

We use a simple neural network to process the new image section
    It outputs a x,y coordinate of the next image section
We use a recurrent neural network to process the series of image sections
    
The algorithm may start with a low res version of the entire image to decide where to start.
It looks for things like edges or brightness to decide where to start.
Once it has staretd, it uses each section to decide where to look next.



================================================================================
Lecture 15
Energy-Efficient Deep Learning
================================================================================

Reducing cost is important - the costs are significant
Memory access is 2-3x more energy consumption than math operations
So, one key is to reduce memory traffic if possible.
    We can represent floats as 32, 16 or even 8 bit floats. 
    The 8-bit floats are actually useful.
    Going from 32 to 16 bits reduces chip area and power consumption by 4x.

There are general purpose hardeware (CPU and GPU) but also specialized hardware (FPGA and ASIC)
    CPU is latency oriented, but GPU is throughput oriented
    FPGA is programmable logic (field programmable gate array)
        Less efficient but good for protototype
    ASIC - application-specific IC. Hard-coded gate array, cannot be reprogrammed.
        Google uses 8-bit integers in their TPU


Algorithms for Efficient Inference
===================================
There are several techniques

- Pruning
Prune some of the weights in a neural network and have the same accuracy
Some terms have very small weights, and are higher order approximations of a line, which is unnecessary accuracy.
The key is to retrain the remaining weights after pruning.
Human brains also seem to reduce #synapses between 1yo and adolescent


- Weight Sharing
Use an aproximization of similar weight values, and then also re-use the value.
Then, store the weight values in a table, and store a table index for the actual
weights. This reduces the size of each number to only as many bits as are needed to index the table.
We can also use huffman coding to use smaller #bits for commonly used weights.

SqueezeNet will reduce the number of channels. 
This reduces memory footprint and preserves accurracy.


- Quantization
Train the net with normal floating point numbers
Quantitize the results, what is min and max values of weights and precision
Then, represent each number with a code, that only has enough bits to span ((max-min)/precision) values
We may fine tune in normal floating point, and then convert to the special fixed point


- Low Rank Approximation
A comvolution layer can be broken into two separate smaller convolution layers
These smaller convolutions are faster

Similarly, break a single filly-connected matrix into a series of smaller matrices


- Binary/Ternary Net
We can represent a neural network with only 2 or 3 different weights
See: Zhu, Han, Mao, Dally, "Trained Ternary Quantization", ICLR 2017


- Winograd Transformation
Transform the data to make the math simpler and faster
See Andrew Lavin, Scott Gray, "Fast Algorithms for Comvolutional Neural Networks"
    https://arxiv.org/abs/1509.09308
We do a transform, then compute the nromal convolution dot-products, and then do an inverse transform.
So, the dot products work on a simpler and smaller matrix.


Hardware for Efficient Inference
===================================
Google TPU, accessed as a special disk
    The core is a 256x256 matrix multiplier, so it can do a matrix multiply in 1 cycle.
    The thoughput is 92 teraFLOPS
    24Mg on-chip buffer that is software managed
    2 DDR3 2133 MHz DRAM channels

A good GPU is a Nvidia K80

He then introduces his own ASIC for faster training. This provides a hardware implementation
of Quantization and Low Rank Approximation and more. For example, he does some pruning to avoid
computing any value with a 0 weight.



Algorithms for Efficient Training
===================================

- Parallelization
Data parallel - process K images in parallel. So you can increase the batch size
You can also divide a single image into blocks, and then process each block in parallel
You can split the output features of a fully-connected layer and do them in parallel


- Mixed Precision with FP16 and FP32
We can do multiplication in 16 bit, then do the summation-accumulator in 32 bit.
So, store weights as 16bits, feed forward with 16bit activations.
Backprop is also done in 16bit, including 16bit gradients.
But, the update weight = weight + learningaRate*gradient is done in 32bits.


- Model Distillation
Use several large/deep neuralnets (resnet, VGGNet, GoogleNet) to teach a simpler neural net.
The complex neural networks output a prediction, a vector of probabilities. These nets
may have large dynamic range, so the correct answer has high probability (like 0.9) while
the wrong answers have low probability (like 0.1 or 10**-6).
But, we narrow the range, so the right answer still has the highest probability
but we compress the range of values, such as between 0.99 down to 10**-6  to a smaller
range like 0.3 to 0.05
The simpler net then trains on these "softer" labels.


- DSD: Dense-Sparse-Dense Training
Train the full net and prune it to make a sparse net.
Then, you somehow infer the pruned weights and recreate the full net


Hardware for Efficient Training
===================================

Nvidia Volta GPUs (2017) introduced a tensor core
This is a GPU instruction that does a multiplication of two 4x4 matrices in 1 clock cycle.
This takes in 2 FP16 matrices, and outputs a FP32 matrix


This is the first in a wave of boards called Tensor Processing Unit (TPU)
Google also announced a Cloud TPU



================================================================================
Lecture 16
Adversarial Examples
================================================================================


Szegedy et al, 2013, "fool ImageNet classifiers imperceptibly"
Goodfellow et al, 2014, "cheap closed form of attack"

An adversarial example is one that has been manipulated to be misclassified.
If we do this carefully, it still looks grossly unchanged to a person.
Often, the change will only affect the least significant bits of each RGB value, so
it is unnoticed by people.

Fooling networks gives us insight into how they work, and where the weakness is.

There are sequences that turn various objects into something that is recognized by a ConvNet
as an airplane, but it looks unchanged to a human.
You use gradient descent to modify the image, so rather than updating the weights in backprop,
you update the input image.

After trainin a neural net, there are false positives, images that are incorrectly identified.
But, this is systematic, and not a random effect.

A neural net has to draw a boundary between items of a class and items not in the class.
However, this boundary may partition the whole space, and so cover a lot of points in the space 
that have never been tested.
The model underfits, it partitions the space along a simple boundary, and so lumps
in lots of unintentional data points as belonging to a class.
Moreover, points far from the boundary are classified with high confidence.

The boundaries are formed by rectified linear units, sigmoid, Maxout, and LSTM.

The mapping from an input to the model an output of the model is linear
The mapping from the parameters of the network to the  output of the model is NON-linear
In practice, the boundary between classes is often made of a few linear segments, It is not
a complex sequence of bends and trists.

When we build adversarial examples, we use a max-norm on the perturbation. This says that
no pixel can change more than a certain amount.

x' is the changed pixels.
    J(x', theta) = J(x, theta) + ((x' - x)-Transpose * Del-J(x))
such that we maximize
    J(x, theta) + ((x' - x)-Transpose * Del-J(x))
and 
    ||x' - x|| < epsilon

So, the L2-norm of the modified image can be large, but the changes cannot be concentrated in
some pixels. Instead, it has to be spread out as small changes to many pixels.

To make an adversarial example
Take the gradient of the cost used to build the network
    x' = x + epsilon*(sign(del-J(x)))
Of course, this is only *one* algorithm for making adversarial examples.
Some networks can defend against this but they cannot defend against other algorithms.

This assumes the network separates classes with a roughly linear boundary
We make a taylor series to approximate the boundary function

Adversarial examples are in linear subspaces.
They are not small pockets of the solution space. 

Adversarial examples are not noise. 
Generally, nosie has little effect.

Instead, in high dimensional spaces, if you start with a reference vector, then take
a random vector, the random vector will typically have 0 dot product with the 
reference vector.

***
Adversarial examples rely on the fact that you change the image in a vector direction 
that is orthogonal to the "cosc" the line that separates images of classes.
The line that separates classes has a gradient, and we want to change an image along a
vector that has a large dot-product with this gradient. That is likely to be a small change
that will move an image across the sepator line and change it from one class to another class.
So, the study of adversarial examples is a study of first finding the line that separates classes
and then finding small changes along a vector that is orthogonal to that line.
***

Changes along most directions do not cross the boundary and so random changes (ie, noise)
do not make adversarial examples.
See Tramer et al, 2017

The signs of the weights matters, not necessarily the weights themselves.
This seems to be much more important to the algorithm than the actual values.
As a result, you can make small changes to the pixels to flip the signs - and that
may not affect how humans see the image but it has a big effect on how software sees the images.
The result is to make an image interpreted differently to a human and a machine.
These small changes may be low-order pixel differences, so not even drastically changing the color.

Linear models are very vulnerable
However, quadratic functions (like "RBF networks") are much more resistant
Unfortunately, these models are basically just template matchers and so do not do well
on image test databases.


These are a problem with the basic math underlying many models. You can build adversarial
examples that will fool one type of model and then it will also fool different types of
models.


Defenses
========
Many different defenses have all failed.
For example, Generative models is not enough to solve the problem.