Lecture 1 - Introduction
-------------------------
This lecture was given in 2010, so preceeded the wider media coverage of AI.


Lecture 2: Goal Trees
------------------------------------
How do you solve complex problems? The big idea is to create a small set of problem transforms, each will change a complex problem
into simpler problems. Eventually, the problem is simple enough that it can be solved directly.
This is called "problem reduction". Essentially, this is recursion - repeatedly simplify the problem until you have a base case,
and then directly solve the base case.
 
Now, there are 2 sets of transformations
1. Safe Transformations - these always lead you toward a solution.
2. Heuristic Transformations - these may lead you toward a solution, but may not. You may have to backtrack if you hit a dead end.
 
For simple calculus problems some transforms are:
Safe Transformations - these always lead toward the solution
1. Integral(-f(x))dx = -Integral(f(x))dx
2. Integral(c*f(x))dx = c*Integral(f(x)dx
3. Integral(f(x) + g(x))dx = Integral(f(x))dx + Integral(g(x))dx
4. Integral(p(x)/q(x))dx --> divide
 
Heuristic Transformations - these may lead toward a solution
A. Integral(f(tanx))dx = Integral(f(y)/(1+y^2))dy
 
You then have a table of simple "built-in" integrals and their solutions
 
The algorithm is
1. Apply transforms. Each will replace the problem with one or more simpler problems.
    This replaces a goal with a conjunction (logical-AND) of several simpler goals.
2. Look to see if any simpler goal is in the table of primitive problems. If so, then replace it with the solution.
 
Continue looping through steps 1 and 2 until you are done.
 
The state of the program is an AND-OR tree. Each node is a problem, and the original problem is the root of
the tree. You build the tree as you solve the problem. You start with just the root, which is the original
problem, and a transformation will extend the tree by adding children nodes to the current node.
So, you start working on a leaf node and either solve it or else add simpler sub-goals as child nodes, which
turns the leaf node into an intermediate node.
At any point, the program is examining one node of the tree. You can do any tree traversal, depth or breath first search,
but depth first is most intuitive. There are several other orders, and the program can use heuristics to choose which
leaf node to work on next.
 
To solve a node, either find that node in the table of known problems with their solutions, or else solve the children
of the node. The children of a node may either be a AND or an OR, so you must solve either all children (the AND case) or
at least one child (the OR case).
 
Applying multiple different transforms to a node will create an OR-Tree of children.
Each transform makes one child and solving any transform's child node is sufficient to solve the parent node.
Some transforms will make an AND-node, by simplifying a parent node into several child nodes, and you
must solve all child nodes.
 
If you hit a dead-end, then back up the tree until you hit an OR node and try an alternate child.
If you solve a node (you find its solution in the table of known solutions) then back up the tree until
you hit a root or an AND node. If you hit an AND node, then solve the next child. If you hit the root, then you are done.
 
We start with a goal and reduce it until we have converted it into a simpler goal that we know how to solve.
So, we are working backwards, from a goal to a series of steps, and this is essentially "backward chaining".
 
In calculus, a fairly simple program can solve a lot of integrals.
    Average depth of the tree is 3, max depth is 7.
    Average number of unused nodes is approx 1. This means there are not a lot of unused dead-ends.
        So, the heuristics of which node to work on next is less important for this domain.
    Lookup table of known solutions was 26 integrals
    12 safe transformations and 12 heuristic transformations
 
We can characterize any domain with some basic questions
    What is the types of knowledge involved?
    How is the knowledge represented?
    How is the knowledge used?
    How much knowledge is required?
 
 
 
 
 
 
Lecture 3: Goal Trees
------------------------------------
This starts with a demo of a simple planning program, to arrange blocks. This is backward chaining, and
the code will have to solve intermediate tasks, like clearing a block before a new block can be put on
top of it, or clearing a block so it can then be moved onto another block.
 
The program generates plans, and stores its state, so it can explian which sub-goal it was solving when
it made a particular step in the plan.
 
The program has several procedures:
1. PutOn(movedBlock, targetBlock)
    FindSpace(targetBlock)
    Grasp(movedBlock);
        ClearTop(movedBlock);
            While-block-has-things-on-top
                GetRidofBlock(topBlock);
                    PutOn(topBlock, table);
    Move(movedBlock);
    Ungrasp(movedBlock);
 
The code leaves a trace, which function was called in which order
Why-questions: The reason a function is called, is the parent procedure in its trace and its parameters
How-questions: How a function did something, is the child procedure in the trace and its parameters
 
This trace is a Goal-tree, which is really an AND-OR tree.
So, in lecture 2 and in lecture 3, we built an AND-OR tree to search a space.
In lecture 2, this was the explicit state of the program, and in lecture 3, we leave the tree around after completing the problem
so we can examine the reasoning of the program.
 
This is backward chained programming.
Next we switch to talk about forward-chained programs.
As an example, consider a program to identify animals in a zoo.
 
Some attributes of animals
    Has-hair
    Has claws
    Has Sharp teeth
    Has Forward pointing eyes
    Eats meat
 
Some simple rules
R1: IF Has-hair THEN is-mammal
R2: IF Has-claws AND Has-Sharp teeth AND Has-Forward-pointing-eyes THEN Carnoivore
R3: IF Eats meat THEN Carnoivore
R4: If Is-Mammal AND Has-spots AND is-FAST THEN Cheetah
 
A forward-chained expert system also builds a Goal tree (an AND-OR tree) as it runs.
As a result, it too can answer questions about its own behavior.
 
To answer a "How" question, like How did you know X, answer with the preconditions that allowed the rule to fire which stated X
To answer a "Why" question, like Why did you want to know Y, answer with the result of a rule that used Y as a precondition,
 
So, a goal tree is used for both forward and backward chaining.
In both cases, it can examine the parent and child nodes in a goal tree.
 
Similarly, the rules for a forward chained expert system can also be used in reverse to do backward chaining.
In this case, you start with a goal, and ask Is this animal an X? Then you fire rules
where each rule will look for properties, and you simplify until you look for properties that are known.
Here again, you
 
So Forward and Backward chaining are very similar
    Both build AND-OR goal trees
    Both use rules to add child nodes to a leaf node
    Both can inspect their state (the resultant AND-OR tree) and explain why and how they did each step
 
Backward chaining starts with a goal or hypothesis and works backward
Forward chaining starts with known facts and works forward
 
An AND-node is a conjunction of preconditions to a rule
An OR node is several different rules, any one of which can fire
 
Forward chaining seems to be best suited for identification
Backward chaining seems to be best suited for making a plan or proving a hypothesis
 
In a Deduction mode, if a rule fires then the result remains true. Once you prove something then it remains true.
In other modes, you can add or remove facts in a database, so the results of one rule may be changed or removed by some future rule.
 
As an example, look at the problem of putting groceries in a bag.
 
Key Ideas
- Complexity may be due to the complexity of the environment or the problem, not the complexity of the program.
Complexity of behavior is MAX(complexity of program, complexity of environment)
So a simple program can have complex behavior when it solves a complex problem.
- Deal with specific cases, when studying a system
- Ask questions about things that appear to be the same bvut are handled differently
This exposes significant properties, and lets you write new rules
- Build a system and when it fails you are missing a rule or some description in the vocabulary
 
 
 
 
 
 
Lecture 4: Search
------------------------------------
Humans can do simple search pretty well using just visual processing. We don't understand how this works, but it seems
to be important. "We will never have a complete theory of human intelligence until we can understand the contributions of the human visual system to solving everyday problems"
 
Search is about any system of choices. Finding paths in maps are one, but solution spaces or any sequence of chices
need search.
- British Museum
Find every possible path. But, this will not visit the same node twice.
 
- Depth-First-Search (DFS)
Always take the left-most child. If you hit a dead end, then backtrack to the previous choice.
This uses backtracking.
 
- Breadth-First-Search (BFS)
Search level-by-level, so advance every of a node.
 
To implement search, we will create a queue of partial paths.
To implement DFS:
    Represent paths as lisp s-expressions.
    Initial path is a list (S) which is just the S start node
    Extend path Loop
        Pop path from head of queue
        Current node is last node in the path
        If there are no children, then delete it and Iterate Loop
        Otherwise:
            Look at each child of the node
                If child has not been seen before
                    Check if we are done, ie if the goal node is one of the children
                    Make a new path by appending each child of the current node to the list
                    Put each new path back onto the head of the queue (really, this is a FIFO stack)
                    Add new child to list of nodes we have seen before
        Iterate Loop
 
For BFS, just put new lists onto the tail of the queue not the head
 
Now we will start to assign a score to nodes, based on their proximity to the goal.
This will create a series of algorithms, the simplest is Hill-Climbing.
 
- Hill-Climbing
In DFS, we enumerate the child nodes in lexical order, which is just arbitrary.
In Hill-climbing, we enumerate them in order of their proximity to the goal node.
So, you add new paths to the front of the queue (FIFO) but add them in closest to goal is the first.
This is called an "informed search" and it requires that I know the distance of
each node to the goal. That may not always be possible, but when it is, hill climbing
is much faster than DFS.
You backtrack when you hit a local but not global min.
This has possible problems: local but not global maxima, flat spaces with no best direction,
 
- Beam
This is to BFS what Hill-climbing was to DFS.
So, when we advance each level in the BFS, we limit to only the K best options, based
on how close each child node is to the goal. K is the "beam width".
 
In Beam search, add only K paths, and add them to the back of the queue.
 
- Best first search
It only expands the path with leaf node that is closest to the goal, no matter where it
appears in the stack.
 
 


Lecture 5 - Search: Branch and Bound,  A*
---------------------------------------------------
How do you find the shortest path through a graph?
Once again, search is about examining a space of choices, not about maps specifically.
 
Branch-and-Bound
----------------
Maintain a queue of partial paths
This queue is sorted by length of the partial paths
Initialize the queue with a single path that is just the START node, the partial path length is 0.
Pick the next path to work on - this is the partial path with the shortest length
    This partial path has a last node where the partial path reaches
    You have a path length to get to this node
    Compute the length of a new partial path through the current route to each child of the last node
    Add the new partial paths to each child to a global list - sort them so the list is sorted by length of partial-path
 
Note, once you find a path to the goal, you have to continue searching, until
every other partial path has a longer length than the path we found to the goal.
 
 
Why is this important? Before, in the simpler tree searches, we only looked
at the children of a single node. We kept minimal state, how we got to the current node
and the description of the current node. (we may also keep a record of which nodes we
have visited before, but that is just a boolean flag on every node).
Now, we keep a global state, of all partial paths we have developed so far, and make
a global decision. Of any partial path developed so far, which do we search next.
So, instead of choosing which child in a particular node to search, we now think
more globally and decide which partial path in the entire tree do we search next.
 
As an optimization, never extend a path which visits a node we have already found a shorter path to before.
This can keep a "visited" flag on each node, and never extend a path to a node that has previously been visited.
We extend paths in order of length, and if we have visited a node before then we have found a shorter
path to that node.
 
A* Algorithm
----------------------------------------------
This uses predicted distance from each node to the goal to decide which
partial path to search next.
If you have 2 partial paths, one ending at node A and one ending at node B,
then compute the predicted distance of each partial path to the goal.
Predicted length of path through A is >= length to A + distance-as-crow-flys from A to goal
Predicted length of path through B is >= length to B + distance-as-crow-flys from B to goal
 
So, always search along the partial path with the lowest predicted length to the goal.
 
This *assumes* that you can measure "distance-as-crow-flys" between 2 nodes. That
may not be true in many problem spaces. This is an "admissible heuristic", which is a heuristic
that can predict a lower bound (in this case it's "distance as the crow flys").
 
This is branch-and-bound and a previously-visited list and an admissible heuristic (distance as the crow flys)
 
Note, an admissible heuristic may grossly underestimate the lower bound.
This is legal, after all, it only says the actual path will be >= the estimate and if the
estimate distance is too low (like 0), then this is still valid.
However, in this case, the admissible heuristic will always lead you to explore paths
that have actual distance much longer than the lower-bound estimate distance.
So, the admissible heuristic leads to an inefficient search.
 
Worse, if we use a visited flag (or a list of previously extended nodes) then we will
first reach a node through a long path with an artificially low admissible heuristic, and
then when we hit a common node with a shorter path (but a higher and more realistic admissible heuristic)
then we will not search that path because we visited the common node before.
 
So, a useful admissible heurist must not only provide a guaranteed lower bound but also
a "realistic" lower bound.
 
Admissible Heuristic = estimated distance between any node X and goal is <= actual distance from node X to goal
 
Consistent Heuristic = Absolute value((Distance between any node X and goal) - (Distance between any node Y and goal))
                    Must be Actual Distance from Node X and Y
 
A Consistent Heuristic does not have this problem of causing you to ignore the best path.
 


                                                                            
 
Lecture 6 - Search: Games, MinMax, Alpha-Beta
---------------------------------------------------
This is presented in the context of chess, since that is a common problem.
These are adversarial spaces, with 2 opponents trying to reach opposite goals.
 
Some ways to play:
- Human way: analysis + strategy + tactics ==> next move
    Nobody knows how to do this.
- If-Then rules: many many rules.
    These lead to weak play. This uses very local state, and no long-term strategy that spans rules
- Look ahead and evaluate.
Look at all the moves and all the possible consequences of each move.
Each is a new potential state of the game and we choose the bext state
and then pick the move that will lead us there.
This uses a function
status = function(current board state)
where status is a score, a numeric quantification of the "goodness" of the board state.
 
The status score is a combination of a bunch of sub-scores, each looking at a different feature
of the board. For example, in chess this may be how many pieces you have, are you in check, is
the opponent in check, etc.
 
How do you combine the subscores? One posibility is:
        score = c1f1 + c2f2 + ...
Where C1, C2 etc are weighting constants
      f1, f2 etc are functions, each evaluating the board and returning a score about some aspect of the state of the board
This combination of subscores is just a linear polynomial, and is called "linear scoring polynolial"
 
Consider a tree of moves. Each node is a state of the board and each edge is a move.
Branching factor is # children of a node
If we only look a certain depth down the tree
   # leaf nodes = breath ** depth
 
In chess, a whole game is ~100 moves (50 per player) and 10-14 branching depth.
So, there are 10 ** 100 nodes, and a Britich Museum is impossible.
    There are 10 ** 80 atoms in the universe
    There are 3 * 10**7 in a year
    10*9 nanoseconds per second
    There are 10**10 years in the universe
 
So, we may look ahead a finite depth, but the entire search space is too big.
We look several levels deep, so don't just think 1 step ahead, but we cannot go to
the bottom of the tree.
 
As an example, consider a simple tree with branching factor of 2 and depth of 2.
 
There is 1 root, 2 intermediate nodes, and 4 leaves
The leaves are 2, 7 and 1, 8
 
By convention, we consider 1 player to be the maximizing player. He wants the biggest possible score.
The opponent is the minimizing player, he wants the minimum possible score.
 
One simple implementation:
 
Min/Max Algorithm
---------------------------------------------
    First, decide who moves next at each depth in the tree
    You alternate moves, first max moves, then min moves then max moves, and so on.
    Each player takes turns, so each depth of the tree is associated with 1 player.
 
    You can decide the min or max move recursively.
        At a node with only leaves,
            Pick the child node that is either min or max depending on whether you are the min or max player.
            This will assign a min/max value to this node
 
        At a node whose children all have a min/max value
            Pick the child node that is either min or max depending on whether you are the min or max player.
            This will assign a min/max value to this node
    Start at the leaves, and work your way up the tree
This example assumes that we can identify the leaves.
You can start at the root and generate all possible permutations.
 
This is an exponential space, so it is computationally expensive.
How can we make it more efficient? In other words, how can we prune the search space to
reduce the amount of searching we have to do?
 
Start with a few observations:
1. If the parent node is a minimizer node, and it has a child node of value k, then the parent node will be <= k
The min player will be able to reduce it to at most k.
 
2. If the parent node is a maximizer node, and it has a child node of value k, then the parent node will be >= k
The max player will be able to reduce it to at least k.
 
So, if a maximizer looks at 2 child nodes, the next depth layer is min.
If one child node is <= 2 and the other child node is <= 1, then do not evaluate
the rest of the child node that is <= 1. If you do, then the min player will go to <= 1.
 
So, you evaluate the tree and propagate up constraints. The constraints will let you ignore partially
evaluated subtrees.
This is alpha/beta, which is just an optimization on top of the min/max algorithm.
 
You have to completely evaluate one child to know whether to partially or completely
evaluate another child of the same parent. However, this is true at each depth you reach.
You start at the leaves and recurse up the tree.
In practice this means you completely evaluate the left side of the tree, and then start pruning
more and more right children as you move up.
 
In the simple case, you compare the known min/max of a left child with a constraint of the right child.
This compares two children of the same parent node.
 
Once you have a score for the entire left half of the tree, you can start pruning
nodes by comparing a min/max constraint with the left half-tree. In other words,
you can decide to prune some deep subtree because you know that the top-most child decision
would go another way.
This is deep cutoff.
 
This can save some work, but it's not huge.
Ideally, the number of static evaluations (given depth d and branching factor b) will
reduce from b**d ===> 2b ** (d/2)
In practice, it comes close to this ideal case.
 
This can double the number of levels you can generate.
 
 
So far, we assumed the branching factor is always constant.
In practice, the branching factor can vary with the game.
So, the program should track what the branching factor it has, and adjust its
depth.
In practice, you compute optimal move for level d-1 before starting to search the d.
This is an insurance policy. If you don't finish level d, then you use the best move
for level d-1.
 
How much does this insurance policy cost?
For the first level, it's fixed cost, for level 1 its b, for level 2 it's b**2 and so on.
where b is the branching factor
 
So, to go d-1 levels down, this is 1 + b + b**2 + b**3 + .... + b**(d-1)
This simplifies to:
    Cost = 1 + b + b**2 + b**3 + .... + b**(d-1)
    b*Cost = b + b**2 + b**3 + .... + b**(d)
    - subtract cost from b*Cost
    b*Cost - Cost = (b + b**2 + b**3 + .... + b**(d))
                     - (1 + b + b**2 + b**3 + .... + b**(d-1))
    b*Cost - Cost = b**(d) - 1
    Cost(b - 1) = b**(d) - 1
    Cost = (b**(d) - 1) / (b - 1)
This is about (b**(d)) / (b) = b**(d-1)
So for a branching factor of 10, this would be just 10% added cost.
 
It gives you an insurance policy that if you ever run out of time you have
a nearly-as-good answer.
It also means you can keep searching deeper and deeper depths until you run out of time for the next move.
So, it adapts to the actual branching of the current game state and uses all
available time for making the best possible answer.
This strategy is called progressive deepening
 
So, you start with minimax and optimize it with alpha/beta and progressive-deepening
 
In practice, you can do deeper subsearch on some subtrees if there is a particularly
good possible state.
 
Note, now the algorithm also goes top-down. So at each level d, it computes
optimal move for that level and also computes all possible game state permutations
to commpute level d+1
 


     

Lecture 7: Constraints: Interpreting Line Drawings
---------------------------------------------------

An example is a 4-color map problem
Another example is interpreting line drawings to decide how many objects are in the image.


One approach - Guzman's Solution
--------------------------------
Look at the lines and find their intersections.
    Find arrows (->) and forks (-<)
    These tend to be 3 different faces of the same object.
    An arrow is looking at the far corner of a block from the middle
    A fork is looking at the far corner of a block from an angle above that corner

Number the faces
Represent the entire image as a graph.
    Each node is a face
    An edge exists between 2 nodes if they share one of these special junctions

If we say that at least one edge between faces means they are the same object, then we are too permissive, and everything becomes part of one giant object.
If we say that at least two edges between faces means they are the same object, then we are too conservative, and some faces are not part of one object

Instead, we say two edges between faces means they are the same object, and then we combine their nodes and iterate.
Combining the nodes will put more links between remaining nodes and usually/eventually all faces become part of an object.
    
This works because "the world is full of three-face vertexes", and this correlates to a arrow or fork
and "usually" correlates to 2-shared links in the greaph of faces. 


A second Approach - Huffman's Solution
--------------------------------------
This started with several assumptions
1 - The world is presented in "general position", so all cubed are on the same orientation to the viewer.
2 - The world is trihedral, so all verteces are the intersection of 3 planes
3 - There are 4 kinds of lines: concave, convex, boundary
   These differ based on which side of the line the object is on.
   Concave (-). This is a corner on the inside of a cube. The sides around the line project toward the viewer
   Convex (+). This is a corner on the outside of a cube. The sides around the line project away from the viewer.
   Boundary (arrow) depending on which side you see the object. So a boundary separates object from background
      The arrow points in the direction so the plane is on the right when going in the direction of the arrow
   Boundary counts as 2 labels, depending on the direction of the arrow.
This ignores shadows

Some vocabulary:
   The real world has vertexes and the model protrays these with junctions
   The real world has edges and the model protrays these with lines

There are only 18 ways to label a junction using these assumptions and labels.
There are 18 possible combinations of lines. 6 L's, 5 forks, 4 T's, 3 arrows


If there are 3 faces, then there are 8 octants.
    I guess this is like the 8 octants of a xyz 3-dimensional graph.
If we fill all 8 octants or 0 octants, then there are no vertices, so we ignore those 2 cases.

If just one of the 8 octants is filled, then you can view it from one of the remaining 7 octants
    From a diagnally opposite octant, you see a solid in one of the corners of the octant space
        You see 3 convex lines meeting at a point at the corner.
    From an adjacent octant, you see only 2 boundary lines, the head of one arrow connects to the tail of the other
    From above and to the side, you see an arrow with 2 boundaries (the arow) and a convex spine that is the shaft

If 7 of the octants are filled, then you are viewing it from the one open octant, and you see the 
    Inside corner of a box. This is an arrow with 3 concave lines meeting at a point

If 5 of the octants are filled, then you are viewing it from the one of 3 open octants
    The solid is 4 filled octants (taking up half the octant space and a top block.
    The 3 open octants are like a tunnel that makes a 90-degree turn around the top block
    At the apex of the turn, you see the point of the solid, so this is an arrow junction with
        2 boundaries (arrow heads point toward each other) and a convex
    At either the entry of the turn, there is a concave line along the floow and a boundary which is the edge that the tunnel turns around
        There are actually 2 ways to describe this, the arrow points toward the concave or the arrow points away from
        the concave. This is just which side of the tunnel's turn you are on.

If 3 of the octants are filled, then you are viewing it from the one of 5 open octants
    The solid is a 90-degree bend.

    You can also have a 3-way junction that is a groove and you are looking along the bottom of the groove.
        There are 2 edges on the 2 sides and a concave spine

    And, there are more combinations

    A T requires that one object obscures another object.
        The cross pieces of the T are all boundary lines with the obscurine object on the right
        There are 4 combinations. Note there are 8 permutations of a binary-direction arrow with a T of 3 arms (2**3) but half of these must be redundant

If 2 or 4 or 6 of the octants are filled, then you can have more than 3 faces visible.
    For now, we ignore the cases where there are more than 3 faces visible.

To label an image, you can start labelling the outside borders with arrows so this is an object in space against
a background.
Now, you can look at each junction where 2 lines are labelled and then look in the catalog to see the possible labellings
for the third line. There often is only one possible labelling.

Once you have labelled all lines, then make sure that every junction is a legal junction.
If one junction is not legel, then the image cannot be constructed from physical    
If all junctions are legal this may or may not be possible to create - it is necessary but not sufficient.
It may be possible to build a shape with more than 3 faces - but that is not allowed in this model.


David Waltz Solution
-------------------------------------------------
This tries to add cracks and shadows and light and junctions of 4 or more faces
This now has 50+ labels, and there are 1,000's of possible junctions

This used a depth-first-search program to explore the possible space. But that is combinatorically explosive.
So, don't try to DFS the space because the space is too big.
So, he had an alternate model - constraint propagation.
For illustration, consider Huffmans labels but in reality Waltz had a bigger vocabulary.

1. For each junction - list the possible junctions that could be that junction.

2. For each junction - look at the neighbor junctions. Discard any possible labelling that is not allowed by the list of possible labellings in a neighbor junction.
Two adjacent junctions share a line, and the line must have the same label in both junctions.
This will propagate constraints and prune the space of possible labellings.

So, constraints do a lot of pruning.

Now, in these examples, a constraint is binary. 
It excludes a possible labelling in an adjacent node, so a possible labelling is either possible/impossible.
Constraints don't give an analog "liklihood" score.




                                   

Lecture 8: Constraints: Search Domains
---------------------------------------------------

As an example, consider map coloring
    No adjacent states can have the same color

A depth-first-search would be far too expensive (10**18 years or so at 30/sec).
Instead, we can use constraint propagation.

Initially, assign a list of possible colors to each state.
But, at first, there are no constraints. Any state can be any color. We need 
something to get the process started.

If we rotate colors, then a state that borders 4 or more colors cannot get a color in a 4-color map problem.
So, we need to be able to back up, and allow 2 non-adjacent bordering states to have the same color.

Vocabulary:
- Variable V can have an assignment
- Value X can be assigned to a variable
- Domain D is a set of values
- Constraint C is a limit on variable values. In map coloring it is a limit on pairs of variable values.

So, states are variables, colors are values, domains are remaining color possibilities
The constraint is no 2 border states can have the same value.

Initialize the map so every state has a list of all possible colors
We do a Depth-First-Search, assign a value to each state as we traverse the map
Essentially, when assigning a color to a state, make sure that this choice will not create a problem for an adjacent state

Domain Reduction Algorithm
--------------------------
For each DFS assignment of a color to a state
    For each variable Vi "considered" (this means examining variables of adjacent states)
        For each value Xi in the domain (list of possible colors) of Vi
            For each constraint C (xi, Xj)
                If there is no value Xj in the adjacent state such that the constraint is satisfied, then discard Xi from the
                    list of possible colors for the current state Xi
                If the domain is empty, then back up to the next iteration of the parent loop

The intuition is, when assigning a color to one state, check to see if that will create
problems for any adjacent state.


One open question is what it means to "consider", or how much validity checks do we do when assigning a color.
There are several possibilities:

- Check nothing. This leads to invalid solutions.
- Check only adjacent states and make sure the 2 states are different colors. This cannot solve the problem - you lose possible solutions by picking a value too soon.
- Check every state can solve the problem. This is overkill - solve the entire problem for each step of solving the problem.
- Check just the neighboring states. This is the Domain Reduction Algorithm.
    So, when you pick a color for a state, make sure that every border state still has at least one option.
    This works, but it still goes down a bunch of dead ends.
    Still, it is practical.
- Check just the neighboring states - but if we have to change a domain of a neighbor state, then make sure this
    does not conflict with any of its neighboring states.
    This propagates variables whose domains have been reduced at all
- Check just the neighboring states - but if we have to change a domain of a neighbor state down to a single value,
    then make sure this does not conflict with any of its neighboring states.
    This propagates variables whose domains have been reduced to a single color

There are some optimizations to this general algorithm.
You can order the states you visit in the DFS to search the most constrained states (those with most neighbors) first.
This finds a solution faster. It prunes the search space most early in the search.
So, do the least constrained last.

Map reduction also is the same as many resource planning problems.
- Scheduling planes, which plane travels which route.
You have a graph, where cities are nodes and flights are edges.
We can name the edges f1, f2, f3, f4, f5
You can try to fly this schedule with different aircraft.
Now, each node has a list of possible aircraft.
Constraints: No aircraft can be on 2 routes at the same time. 
            Each aircraft has to be on the ground for a minimum time
Domains: the list of aircraft that can fly from this city
Variables: The cities
 





Lecture 9: Constraints: Visual Object Recognition
--------------------------------------------------

Recognizes Objects - faces is a good example

David Marr theories on Vision
-----------------------------
Vision is a series of steps
1. Start with camera input and first find edges
    This creates the "Primal Sketch"
2. Decorate the primal sketch with vectors that are normal to all faces
    This creates the 2-1/2D sketch
3. Convert the 2-1/2 D sketch into generalized cylinders
    A regular cylinder is a circle moving along an axis
    A generalyzed Cylinder is a circle moving along an axis but the circle can change diameter at different points
4. Match the generalized cylinder against a library of pre-known descriptions

But, nobody can make this work.

Alignment Theory of Recognition
-------------------------------
You can reconstruct any view of an object from 3 different 2-d images.
This requires a translucent image, so no vertices are obscured
This also assumes that you are far enough away that there are no perspective problems - you have an orthoscopic projection

Each object in a library has these 3 views
Now, you can map a new unknown object to a list of library objects.

For each library object, label the vertices p1, p2, p3, ...
Really, there is p1A for view A, p1B for view B and so on.
There is also a p1, p2, p3, .... for the unknown object.

To match an unknown objects, we create a series of equations 
p1 = alpha*P1A + beta*P1B + gamma*P1C + tau
p2 = alpha*P2A + beta*P2B + gamma*P2C + tau
.....
Note, the alpha, beta, gamma and tau are the same constants for all equations
This transforms the vertices from the unknown object into the spaces of the library of images
All of these are 2d drawings, a 2-d camera view of the object.

There are 4 unknowns, so we can take 4 sets of corresponding points and then solve
the system of 4 equations with 4 unknowns.
This should solve for alpha, beta, gamma, tau.
To decide whether an unknown matches the model, then every set of corresponding points should also
work with the same values for alpha, beta, gamma, tau.

As a simpler example, lets consider an object that is in the XY plane, and rotated around a vertical line in the Z axis.
This is a simple example, and can generalize to rotations in X, Y, and Z axes.

Suppose there is a point s that is in the model object when it is in some normal position.
    This point has an X,y coordinate, which we represent as a vector from the origin 0,0 to x,y
     x is xs and y is ys
We take 2 views of this model, view A and View B
    The A and B views were made by simply rotating the object from its standard positio around the Z axis.
    So there is a point in each rotation that corresponds to point s in the standard position
    Each of these is represented by a different vector 
        xa, ya,   and xb, yb
    So, the vector from 0,0 to the corresponding point in rotation A is thetaSA degrees off the vector from 0,0 to the point in standard position.
    So, the vector from 0,0 to the corresponding point in rotation B is thetaSB degrees off the vector from 0,0 to the point in standard position.

Let's say there is an unknown object, that is yet another rotation around the Z axis from standard position
    There is a point in each rotation that corresponds to point s in the standard position
    This is represented by a different vector 
        xu, yu
    The vector from 0,0 to the corresponding point in rotation U is thetaSU degrees off the vector from 0,0 to the point in standard position.

So, how do we express xa in terms of xs and ys and thetaSA
    The vextor from 0,0 to s has 2 component vectors, x and y
    To rotate s by thetaSA, 
        Rotate the x vector by thetaSA
        Rotate the Y vector by thetaSA
        Add up the contributions

    Xa = Xs*cos(thetaSA) - ys*sin(thetaSA)
And similarly:
    Xb = Xs*cos(thetaSB) - ys*sin(thetaSB)
    Xu = Xs*cos(thetaSU) - ys*sin(thetaSU)

Now, ALL points in the object go through the same rotation when the object is rotated once.
So, every set of corresponding points like p1a, p1b, p1u  or p2a, p2b, p2u  will all be at the same
angle offsets to each other.
We can treat sinSA, cosSA, sinSB, cosSB as constants.

We know xa and xb, these are from our models in the library and we can empirically measure them.
Now, we can use the system of 2 equations and 2 unknowns, and derive xs and ys
    Xa = Xs*cos(thetaSA) - ys*sin(thetaSA)
    Xb = Xs*cos(thetaSB) - ys*sin(thetaSB)
The unknowns are xs and ys, everything else can be measures.
So, we go from the 2 observed rotations to derive the standard orientation

Now, we can derive an alpha and beta such that
    xu = alpha xA + beta xB
    for some complicated alpha and beta which is a function of all the angles (details not explained)

If you add translation (moving the object along an axis) then you must include a tau term in the
original expression.
    p1 = alpha*P1A + beta*P1B + gamma*P1C + tau
So, now you have the standard position is derived by a systm of 3 unknowns (xs, ys, tau)
and so you need 3 equations. To get a third equation, you need a third reference orientation, C

For 3 degrees of rotation and translation you need 4 equations and 4 unknowns.
(BTW, this is why we use 4-dimensional matrices in graphics)

This scheme works great for simple rotatino and translation.
But, in real life, organic things like human faces may stretch and skew, not just rotate and translate.



Correlation Theory of Recognition
-------------------------------
Don't try to match entire faces, or even individual features
Look for intermediate-sized features, like a node and mouth or 2 eyes in a horizontal line

Consider a simpler model. A function that maps y=f(x). This could be a sine wave or anything.
Now, consider a stored library image, and it may be the same shape of signal but at an offset, so it starts 
at x=10. This is y=g(x)

The question is can we decide those 2 signals are the same.
Formally, this means is integral(f(x) * g(x - Offset)) dx
So, try this for every offset and use the one with the biggest result. That will
be the one with the closest offset.
(BTW, this is intuitively what a fourier transform does, except it uses as a pattern e**jw which is cosw + jsinw).

In 2-dimensions, we can use integral(f(x,y) * g(x - XOffset, y - YOffset))





Lecture 10 - Introduction to Learning
==========================================
There are several approaches to machine learning

I. Regularity Based Learning (AKA "brute Force learning" or "bulldozer" learning)
    - Nearest Neighbor (based on pattern recognition)
    - Neural nets (derived from ideas in biology)
    - Boosting (based on a theoretical approach)

II. Constraint Based Learning (more human-like)
    - One shot based learning
    - Explanation-based learning

This lecture will go into Nearest Neighbor learning

Take an image and derive a vector of feature values
Then compare the vector of feature values to a library of known feature vectors
    Pick the closest match.

Example is matching electrical cover platess.
Measurements include 
    - Total area of of the plate
    - Total area of all holes in the plate
There is error in recognition camera and variability of the subjects. So don't expect an 
exact match but pick the closest match.

In practice, each point in the library has a min and max legal value for each entry in the vector.

You can plot these on a N-vector graph.
The boundary between any two valid points is the midline that is a perpendicular 
bisector of the line between the two points.
This line is called a "decision boundary"
You can do this for every pair of points, and create a graph of lines that divides up the space.

This approach lets you identify things, if you have a pre-existing library.

If you have only some of the features in the value, then you can still try to find the closest match for those values.
This uses the principle that if 2 things are similar in some respects then they are similar in other
respects.

For example, for detecting blood cells, we might use values like:
    - Total area of cell
    - Area of nucleus

Alternatively, you may use the occurrence of keywords.
Identify the subject of a new article by:
    - Create a vector of 1/0 binary values that shows whether each keyword is present in the article
    - Compare the vector with a library of known vectors
For example, try to distinguish articles from Wired and "Town and Country"

Given an unknown, in the N-dimensional space, how do you decide which library item is closest?
The unknown object is some vector value, so it is a point in N-dimensional space.
Each known object in the library is also some vector value, so each is also a point in N-dimensional space.

There are several approaches:
1. You can use Euclidean distance from the unknown point to the "representative" point 
for each category in the library. This can lead to wrong answers.

2. You can draw a vector from the origin to the "representative" point for each category in the library
Decide which vector the new unknown point is closest to.
You can also draw a line from the origin to the unknown point.
Pick the vector to a reference point with the smallest angle to the vector to the unknown point.

Specifically, measure the cos(theta) where theta is the angle between each vector.

cos(theta) = ( Sum-Of(unknownVector[i] * referencePointVector[i]) ) 
                / ( magnitude(unknownVector) * magnitude(referencePointVector))

This is just the dot-product.
There are more complicated metrics, but this metric works pretty well and is simple and fast.

As an alternate problem:
Consider a robot arm with 2 joints, so it can move 2 different joints, each in a plane and
we can describe the angles of these joints as theta1 and theta2.
Now, try to plan motion so it moves the tip of the arm along a line parallel to the X-axis
Solving this with Newtonian mechanics is possible, but the math turns out to be incredibly complex, 
including coreolis forces because of the constrained degrees of motion.

Instead, we can solve this with a table, with columns for theta1 and theta2, velocity of theta1,
velocity of theta2, accelleration of theta1 and accelleration of theta2, torque on motor for
joint1, and torque on motor for joint2

Now, let the arm wave around randomly, and record the torques. 
This makes a table of random associations.

Now, we will divide the trajectory into small sections, each is a part of the path.
- For each small piece of the path, determine what are the angles for theta1 and theta2, 
velocity of theta1, velocity of theta2, accelleration of theta1 and accelleration of theta2
- Look up in the table, and use the distance metric to find the closest row that has the same
values for angles for theta1 and theta2, 
velocity of theta1, velocity of theta2, accelleration of theta1 and accelleration of theta2
- The closest match row, also has torque of motor1 and motor2, which control the joint for theta1 and
the joint for theta2.

The training phase is trying random torques and measuring the resulting positions.
This builds up the table.
Once the table is big enough, then there are enough data points that it can find the best torques.
In practice, there are some amazing robot demos of very dextrous actions that are learned.

Note, each row has some observed output values and some input values.
You pick the row based on the closest match of the output values, and then use the input values from the
same selected row to control the machine.


This can be applied to any system of control, including drugs and HTN.

You have 10**10 neurons in the brain outside the cerebellum, 
You have 10**11 are in the cerebellum 
Each cerebellum neuron has approx 10**5 neurons
So, there are 10**16 synapses in the cerebellum.
It's possible that the cerebellum is really just a massive lookup table.

There are some problems with this:
1. What if the space of samples does not cover the entire problem space? Then you cannot learn
enough to do the task.
For any dimension x
(Variance(x))**2 = 1/n * SUM(( x[i] - Mean(X))**2)
If we do not have variance in x, then replace x with x' where x' = x / variance(x)
The variance of the new x' is 1.

2. What if one of the variables does not matter? This gives bad values




Lecture 11 : Learning: Identification Trees, Disorder 
===========================================================
As a (toy) example, consider a program to identify vampires
We have a list of properties (make a shadow, fear garlic, ....)
We start with a series of known true cases.
Some of the reference points may have slightly different values, but they are both examples of the same class.

This is different than nearest neighbor because all of the reference points are slightly different
- Characteristics are non-numeric
- Some of the characteristics may not matter
- Some of the characteristics may matter only for some cases
- Some characteristics may be more expensive to check than others.

Each characteristic is a test, not a value.
So, we start by building a tree of tests. Each test may have any number of outcomes. 
    Note, one of these outcomes may be "I don't know")
This is called an identification tree. This is different than a decision tree, but is similar.

A good tree would
- No false negatives, no false positives
- A small simple tree. (Occams razor - the simplest explanation is often the best explanation)
- Path through tree has the minimum cost of tests

Each test has a bunch of inputs, and will map each input test case to one of the 
result values.
Each result value has a group of reference cases that have that result.
So, a test will divide a group of input cases into several subsets, each corresponding to one result value.

For each test
    Apply this test to each known case
    For each outcome, record the number of cases with that outcome that 
        are a true positive, a true negative and an unknown
This tells you how each test divides up the known cases.

The best test to use is one that produces homogenous groups for each test result.
In other words, all cases that have a particular output from a test are either all 
    true-positive or else all true-negative reference points

So, a test results may unambiguously identify some reference cases, ie all reference cases
with a particular test result are either positive or negative.
But, other result values from the SAME test may be ambiguous, ie both positive
and negative reference cases will have that test result value.

Measure how many reference cases each test can unambiguously categorize
For each test case
    Look at each output value, and only consider output values with either all 
    true-positive or else all true-negative reference points 

    Count the number of reference points that ended up in any output for this test

    The sum of counts (all reference cases that ended up in any homogenous set) is the
    score for this test. It tells us how many reference samples seem to be in an unambiguous result.

    We assume that if a test output is unambiguous (all reference cases with that value 
    for that test are either positive or negative) then this is true - ie, there is not
    some reference case we have not yet seen that would make some output value ambiguous.

The first test in the tree is the root node.
This should be the test that has the largest score, so it will unambiguously identify the
most reference points.

So, all of the cases that have an output which is unambiguous are solved.
Now, the test cases that have an output that corresponds to both positive and negative cases
must be further subdivided.


So, now take the remaining reference cases that are ambiguous.
This REPEATS the above process for measuring how unambiguous each test is, but we ONLY
consider the reference cases that were ambiguous in the first test we used.

Measure how many reference cases each test can unambiguously categorize
For each test case
    Look at each output value, and only consider output values with either all 
    true-positive or else all true-negative reference points 

    Count the number of reference points that ended up in any output for this test

    The sum of counts (all reference cases that ended up in any homogenous set) is the
    score for this test. It tells us how many reference samples seem to be in an unambiguous result.

    We assume that if a test output is unambiguous (all reference cases with that value 
    for that test are either positive or negative) then this is true - ie, there is not
    some reference case we have not yet seen that would make some output value ambiguous.

Continue repeating this until all cases are classified.


Now, in a large data set, usually no test will have a result that unambiguously identifies
all positives or all negatives.
This is ok, we will just always have to perform several tests to get one definite positive or
negative.
However, what is the best sequence of tests? (ie, fastest route to identification)
One way is to measure the amount of "disorder" or homogeneity in the cases that 
have each corresponding test result.

From Information Theory:
Disorder of a set = -((P/T) * log-base-2(P/T)) - ((N/T) * log-base-2(N/T))
P = positive
N = Negative
T = total (P + N)
The rations are all < 1 so logs are always negative, so -(-) is always positive.
Graph this with x-axis is (P/T) and y-axis is Disorder.
    The graph says
        If P/T=1/2 then disorder = 1
        If P/T=1 then disorder = 0
        If P/T=0 then disorder = 0
        L'hopital's rule says you can use the limit.
The curve looks like an upside down U that starts as 0, reaches a max of 1, and ends at 0.

Now, we can measure the quality of a test by measuring the disorder of cases for each test result value
and then add up the disorders for all sets corresponding to each output values.
Additionally, we will weight each sum by the fraction of cases that

Test-Quality = SUM-OF-EACH-OUTPUT-VALUE(Disorder(set of cases with this test result value) * (#cases with this result value / #cases handled by test)
Here, we are measuring Disorder, so a lower score is better.
So, take the test with the lowest weighted disorder score, and that is the best test.

Now, we can do as before.
Start with the best test, then for each result group perform the next best test.


Now, you may use the same test twice, but use different thresholds for separating the input cases into output.

We can use this same approach - a tree of tests for the cases we discussed in lecture 10 for 
    Decision Boundaries in Nearest Neighbors.
But, with a tree of tests, the decision boundaries are dis-continuous lines.

This approach is better than simple nearest-neighbors.
    Does not need to use all tests, only enough to make a decision

This is used many times, but in practice in medicine we convert the tree to a set of rules.
Each rule is just a path down the tree to a leaf node.




Lecture 12 - Learning: Neural Nets, Back Propagation
======================================================
This mimics biology, and is the result of a lot of effort over the years

Let's start with a simple overview of the biology of neurons.
This has a cell body, an axon (the axon may be bifurcated, ie branch),  and dendrites (which heavily branch)
Dendrites form the dendritic tree.

    Dendrites take input from other axon.
    The axon feeds outputs to other dendrites
    Each junction is a synapse, which has a lot of control mechanisms.

Not all synapses are the same. Some have more influence of the overall neuron excitation than others.
    Synapses may be weighted, "synaptic weights"

The input from all dendritic synapses add/subtract to determine whether to excite the axon.
This is a binary decision, though, so the axon is either excited or not excited.
So, there is an analog blend of weighted inputs, and then it either meets a threshold or not.


Model of Neuron
- List of inputs: I1, I2, I3,....
    All inputs are 0 or 1

- Each input is multiplied by a weight: w1, w2, w3,....
    So, we have weighted inputs: I1w1, I2w2, I3w3, .....
    This converts the binary input into a scaler integer
    Graphically, these are weighting boxes

- Combine the inputs.
    We sum all of the inputs.
    Simply summing the inputs is a big design decision
    I1w1 + I2w2 + I3w3 + .....
    Graphically, this is a summation box

- Measure the sum of weighted inputs against a Threshold.
    If (I1w1 + I2w2 + I3w3 + .....) >= Threshold, then the axon is 1
    If (I1w1 + I2w2 + I3w3 + .....) < Threshold, then the axon is 0
    Graphically, this is a threshold box

Now, we can combine neurons to make a neural net
A group of neurons may take a list of inputs, I1, I2, I3,....
map them to different neurons, weight them differently, sum them, pass through one or more
thresholds
This will produce one or more outputs, z1, z2, z3, ....
So, there are several inputs and several outputs.
We can model these as vectors.

[z1, z2, z2,...] = F(I1, I2, I3,....)
Where M is the input weights

We can train this, so we want the web to exactly match the training inputs to
the training outputs 
[Tz1, Tz2, Tz2,...] = G(TI1, TI2, TI3,....)

So, we want the regular function F to work like G for the training inputs.
We want a set of weights to make the general output match the training output.

We need a way of measuring how well the function is performing.
This is a "performance function", P, which is some difference between actual output and desired output.
Note this is only for training data, because we know the desired output, and we can run it through the net
to see the actual output for a specific set of weights.

In practice, make a vector that is the difference between the observed and desired outputs.
        P = [(Tz1 - z1), (Tz2 - z2), (Tz3 - z3),....]
Now, scale this to 1/2 * (P **2)
This squaring will make it always a positive number, and halving makes it mathematically convenient.
Graphically, the 1/2 * (P **2) is a U-shaped graph with trough at the (0,0)

Now, again for convenience, make P = -1/2 * (P **2)
So, now this is an upside-down W, with max at (0,0)

We can just do a Hill-Climbing DFS to explore the space of weights, and then pick the optimal
weights that will maximize the performance function.

Intuitively:
    We can graph the Performance Function in a space of weights.
    We can also highlight the iso-performance lines, combinations of weights that make the same performance
        This is like isolines in a topological graph, or iso-utility curves in economics
    Consider just 2 weights, w1 and w2, so this is a 2-d cartesian coordinates.

    There may be some point that is optimal performance. This corresponds to the max of -1/2 * (P **2)
    or also to the min of 1/2 * (P **2)

    Now, start with a random weighting, w1, w2, we can see where this set of weightings lands
    on the graph that maps weightings to performance.
    Now, we use hillclimbing to pick a set of weights that maximizes performance.

    We can do this for every set of input and output training data.
    Of course, if there are N weights, then this may become computationally expensive.

In theory, we can also adjust thresholds.
This may make the exploration more complicated, but there is a trick.
Trick #1:
Add an extra input, I0, and a weight w0. I0 is always -1, and the w0 is the threshold, 
so w0I0 is -threshold. 
Now, the axon fires if summation is >= 0, forget the threshold.
So, this casts the threshold as just another weight, and we are back to just optimizing the weights.

Trick #2
In practice, for faster searching of a large space of many weights, ascend on a vector 
perpendicular to each contour line.
deltaW is the change to the weight vector [deltaW1, deltaW2, deltaW3,...]
For any i (1,2,3,...)
    deltaW[i] = k * (delta P[i]/deltaW1 + delta P[i]/deltaW2 + .....)

This works, except it requires a continuous function.
Trick #3
To make this a continuous function, we need the output not a pure binary step function,
but rather are very steep sigmoid.
So, make w0 is 1 / (1 + e**(-alpha))   where alpha is the input

Now, let's consider the world's simplest neural net
One input I1, 
    p1 = w1*p1
Simple thresholding function, a sigmoid function
    y = Sigmoid(p1)
Now, pass through a second weighting function
    p2 = w2*y
Pass through a second sigmoid function
    z = Sigmoid(p2)
Calculate the performance of z
    P = -1/2 * (desired - z)**2
This is just a single line, so it is no longer a vector, and instead is just a scalar.
       w1                       w2
       |                        |
       \/                      \/
x ---(Mult)-P1->(Sigmoid)-y->(Mult)-P2->(Sigmoid)-z->(Performance)--> P = -1/2 * (d - z)**2

We want to optimize this, so we will need to traval along the max derivative.
So, we have to calculate partial derivatives of P with respect to different weights.

Specifically, we need partial derivatives of P with respect to w1 and w2
Start with w2, it's easier.

We use the chain rule, several times to go back each stage in the net.
    deltaP/deltaW2 = deltaP/deltaZ * deltaZ/deltaW2
but deltaP/deltaZ = (d - z)
    deltaP/deltaZ = (d - z) * deltaZ/deltaW2

We can use chain rule again to solve deltaZ/deltaW2
    deltaZ/deltaW2 = deltaZ/deltaP2 * deltaP2/deltaW2
But, deltaP2/deltaW2 is just y
    deltaZ/deltaW2 = deltaZ/deltaP2 * y
We can recombine these to get t hefinal answer:
    deltaP/deltaW2 = (d - z) * deltaZ/deltaP2 * y

Now, we go back to the original question, and compute deltaP/deltaW1
Again, use the chain rule
    deltaP/deltaW1 = deltaP/deltaZ * deltaZ/deltaW1
deltaP/deltaZ was already solved above for the w2 case.
    deltaP/deltaZ = (d - z) * deltaZ/deltaW1
Again, use the chain rule again
    deltaZ/deltaW1 = deltaZ/deltaP2 * deltaP2/deltaW1
deltaP2/deltaW1 is again the chain rule
    deltaP2/deltaW1 = deltaP2/deltaY * deltaY/deltaW1
Now, deltaP2/deltaY is just w2
To solve deltaY/deltaW1, we again use the chain rule
    deltaY/deltaW1 = deltaY/deltaP1 * deltaP1*deltaW1
deltaP1*deltaW1 is just x
    so deltaY/deltaW1 = deltaY/deltaP1 * x

We can recombine these to get th efinal answer:
    deltaP/deltaW1 = (d - z) * deltaZ/deltaP2 * w2 * deltaY/deltaP1 * x

OK, but now we have to solve these partial derivatives for the sigmoid functions.
    deltaZ/deltaP2 and deltaY/deltaP1

Recall, we defined these as
    1 / (1 + e**(-alpha))   where alpha is the input
So, we want deltaBeta/deltaAlpha
    deltaBeta/deltaAlpha = (1 + e**-alpha)**-1
    deltaBeta/deltaAlpha = -(1 + e**-alpha)**-2 * delta(1+e**-alpha)/deltaAlpha
    deltaBeta/deltaAlpha = -(1 + e**-alpha)**-2 * e**-alpha
    deltaBeta/deltaAlpha = e**-alpha / (1 + e**-alpha)**2
    deltaBeta/deltaAlpha = (1 + e**-alpha - 1) / (1 + e**-alpha)**2
    deltaBeta/deltaAlpha = 1/(1 + e**-alpha) * [(1 + e**-alpha)/(1 + e**-alpha) - 1/(1 + e**-alpha)]
But, 1/(1 + e**-alpha) = beta
    deltaBeta/deltaAlpha = beta * [1 - beta]

We chose the sigmoid function carefully, so it is continuous (which allows us to take derivatives)
and also acts *almost* like a digital threshold function (which is what we used originally),
and also the derivative of the output with respect to the input is just a function of the output.

So, now deltaZ/deltaP2 = z(1 - z)
We can recombine these to get the final answers:
    deltaP/deltaW2 = (d - z) * (z*(1 - z)) * y
    deltaP/deltaW1 = (d - z) * (z*(1 - z)) * w2 * (y *(1 - y)) * x

So, the deltaP/deltaW2 partial derivatives only depend on z and y
So, the deltaP/deltaW1 partial derivatives only depend on w2 and x and y
This is the back propagation algorithm

So, if there are 100 inputs, then the Nth input depends only on the N-1st 
Or, for a fixed number of neurons in a column, the total amount of calculation is linear in
terms of the number of neurons.

These are very simple neural nets, 
    There are no feedback loops so an output never becomes an input.

This is just scaling inputs to fit desired outputs. It is just fitting a curve.
It maps inputs to some desired curve output.

There are some problems with neural nets.
- How do we encode all the inputs into numeric values
- Sometimes, the curve gets tightly bent to fit the samples. This is called "overgitting" and 
it leads to brittle responsiveness.
- The rate constants can lead to positive feedback loops which will wildly oscillate 
rather than converge on the final answer.
	



Mega-Recitation 4: Neural Nets
======================================================
This lecture is going to present the specific strategies for solving
neural nets, including Back Propagation.

Each weight in the net may affect its final output result, so when we want to
explicitly describe this, and compute performance with respect to some
weight embeeded deep inside the net, we have to do a lot of recursion.
Consider a neural net as a series of operations, with step 1, 2, 3, ... through N
and step N produces the final output. 

Now, stage N-1 will send its output to stage N. Eventually this reaches the final
output of the net. But, this means that any effect that stage N-1 has on the final
output is first filtered through N (and any directly connected outputs to stage N-1).
This is inherently a recursive problem. The effect of weight N-1 on the performance
can be described in terms of the effect of stage N on performance. 

So, rather than describe the effect of each stage on net performance in 
isolation, we start at the end and then describe the effect of the last
stage on net performance in absolute terms, then the effect of the
next-to-last stage effect on net performance in terms of the next to last
stage and the effect of the last stage on net performance.

So, we describe the effect of stage N-1 in terms of the effect of stage N.

This means technically, we need to recursively compute the network performance in terms of
the output of each stage. We call this delta(J) where J is the stage number.
delta(j) is not the true network performance, but it is the performance (desired - actual)
value for that specific stage in the network.


First, a few equations:
-----------------------
Sigmoid 
This is: y = 1 / (1 + e**-x)
Derivative of Sigmoid is 
derivative dSigmoid/dx = y(1 - y)

Performance Function
    -1/2 (desired - actual) **2
This is 1/2 and squared to make the derivative simpler.
dPerformance/dx = (desiredOutput - actualOutput) where x is the final output.


W'
This is the new weight, after a previous step computed a value and then did back-propagation.
Each step will compute an observed output, then use backpropagation to revise the weights for
the next step.

Each W is the input for a multiplier.

W' = W + (alpha * Input * delta(J))
where 
    alpha is a constant, it's the size of steps you take in a hill-climbing search.
    Small alpha is small steps, and big alpha is big steps

    Input is the input coming into the multiplier. Note you use the value
    for the input from iteration N to compute the next weight used for
    iteration N+1

    delta(J) is a partial derivative of the performance of the net with respect 
    to this weight. It tells you how this weight affects the performance of the net.
    
    For the last node in the network, this is just:    
        delta(J) = dPerformance/dw = dPerformance/dSigmoidOut * dSigmoidOut/dw
    We already know that: dPerformance/dSigmoidOut = (desiredOutput - actualOutput)
    So, delta(J) = (desiredOutput - actualOutput) * dSigmoidOut/dw
    But, dSigmoidOut/dw = actualOutput(1 - actualOutput)
    So, for the last weight in the pipeline
        deltaJ = (desiredOutput - actualOutput) actualOutput(1 - actualOutput)
    This is the base case.

    Now, for any other node in the middle of the network, we can compute
    delta(J) recursively.
        dPerformance/dw = dPerformance/dSigmoidOut * dSigmoidOut/dw
    But, now dSigmoidOut/dw is solved recursively.
    This is why we needed the base case, which was the last node.
    So, dSigmoidOut/dw is SUM(weight(i)*delta(I))
    where i is every node that takes an output from the weight we are evaluating
    This is the recursion, so we already computed the delta(J) for the output
    nodes, and now we use that to compute this node that is one step back.

    ** Key Idea **
    So, this scheme requires that we compute the delta(I) for the outputs
    of nodes and then use those to compute the changes in weights.
    By using the next weight



Now, let's REPLACE the sigmoid function with a different function.
Now, if we change the sigmoid function to something else, then we replace
actualOutput(1 - actualOutput) with the derivative of the new sigmoid function.

As an example, consider a new function that is just a step function that compares 
the input to a threshold. It now will add a negative weight to the input. 
    Out = In - weight, and then output either 0 or 1 depending on whether the 
    result is positive or negative.

In this example, there are no sigmoids in a particular net. 
In this case, our deltaJ for the final node SIMPLIFIES to 
        deltaJ = desiredOut - actualOut
In this case, our deltaJ for the final node SIMPLIFIES to 
        deltaK = SUM(w(i) * delta(i)) for all I that node K sends outputs to.
Note, these are the immediate outputs only.
Note, this multiplies the weight at the current node with the delta of an output node.

Now, in this case we will compute new weights for *both* scaling nodes (out = w * input)
and threshold nodes (out = -(weight) + input)




Now, different types of nets can do different kinds of identification.
This really comes down to the number and configuration of non-trivial nodes, ie, the Sigmoids, not adders or multipliers.
So, the real "logic" in the neural net is the placement of the threshold/sigmoid functions.
More specifically, the logic is defined by the number of sigmoids that take an original input to the net, so the
first stage of the net. Other sigmoids in the net that take inputs from earlier sigmoids really
do the equivilent of logical AND or OR, but do not put new constraints on the original inputs.
So, looking at the sigmoids at the first stage of the net will tell us about the behavior
of the whole net.

For now, assume there are just 2 inputs, x and y, and we can plot these on a 2D graph.
We can then plot which values of x,y get recognized (ie, have a positive net output) by the net.
Each sigmoid in a net will usually add a single "line" to the bounding region of inputs that are recognized.
So it can identify a linear relationship between the inputs x and y and identify all input combinations above or below that line.
Adding more sigmoid nodes will add more lines that define the boundary of "identified" x,y values, 
so it can recognize a more complicated region of inputs.

This first stage of sigmoids will each create recognize values in a half-plane, bounded by a line
    If the sigmoid takes both x,y then the recognized half plane can be bounded by a diagonal
    If the sigmoid takes only x or only y then the recognized half plane can be bounded by a horizontal or vertical line
        Note, you can take both x and y and still bound a halfplane by a horizontal or vertical, for example
        by assigning a weight of 0 to one of the inputs.
The second stage will combine the half-planes, with either a UNION (logical OR) or the Intersection (logical AND)
The third stage will combine the regions of the second stage, with either a UNION (logical OR) or the Intersection (logical AND)
And so on.

- A net with no sigmoids, is really just a big weighted sum. Its output is just all x and y scaled

- A net with one sigmoid detects inputs along a single line.
    If the net takes both x and y as inputs, then the line can be at any angle.
    If the net takes only x OR y as input, then the line is either vertical (parallel to Y axis) or horizontal (parallel to x axis).

- A net with 3 sigmoids detects inputs along a space bounded by 3 lines
Note, these bounding lines may be horizontal or vertical or diagonal.
You can look at each bounding line, and 
    If it is a horizontal/vertical line, then it is formed by a sigmoid that takes only one input (just x or just y)
    If it is a diagonal line, then it is formed by a sigmoid that takes both inputs (a and y)




Lecture 12a - Neural Nets Part A (Fall 2015)
==================================================================
In 2010, this course felt the neural nets were less interesting - it was almost removed from the course.
By 2012, however, Jeff Hinton published a paper of image recognition using neural networks that categorized
pictures. This was powerful, and made neural networks popular again.

This is inspired by biology. First a quick review of neuron.
There is cell body, axon, dendrites. 
Axons connect to the dendrites of other neurons at synapses.
If there is enough stilumation in the dendritic tree, then the axon depolarizes, which will then feed into another neuron.
Note, this is digital, so the axon either depolarizes or it does not, it all depends on whether the inputs to the dendrites
exceeds some threshold.

In nerual nets in the literature, there are several components:
- Inputs like x1,x2,..xn which are either 0 or 1. This may be the output of a pre-synaptic neuron
- Multipliers, multiply their input by a weight w1, w2, ...wn
- Adders will add up several weighted inputs
- Threshold box - it is 1 iff the input is above some threshold. Each will have a threshold t1, t2, ...tn
 The output is z, which is 0 or 1

 This model has some important properties we see in real neurons
 - all-or-none discrete behavior
 - Cumulative influence of inputs
 - Synaptic weight (each input is weighted differently)

It does not have some possibly important properties of neurons
- Refractory period
- Time varies - whether the sequence of arrival of inputs to the dendrites
- Axonal bifurcation - depolarization may go down either branch
So a lot of neurons are a time-based system.

A neural net is a collection of weights and thresholds. It takes inputs x1, x2, ...xn and
outputs z1, z2, .... zn

So, Z is a vector which is a function of input vector, weight vector, and threshold vector.
Training a neural net is just modifying the weights and thresholds so we get the desired output for known inputs.
So, a neural network is a function approximator - it can train to emulate any function.

We use some sample data with known outputs.
We compare the current function of the net as the difference between actual and desired value.
Performance is P = f(DesiredVector, ActualVector)
A simple model is just subtracting: P = AbsoluteValue(desired - actual).
But, this makes the math hard, so we pick another function that will make the math simpler.
    P = 1/2 * AbsoluteValue(desired - actual))**2

    P = - (AbsoluteValue(desired - actual))**2)   It's always negative and the best value is 0.

So, we adjust weights and threshold to maximize performance, so make P the closest to 0.
You can think of this as searching in the space of all possible weights and thresholds. We can do
a simple hill-climbing. But, that is computationally intractible - a recent successful neural net had 60 million parameters.

A better way is to start taking partial derivatives.

Consider a simple net with 2 weights, w1 and w2
dPerformance/dw1
dPerformance/dw2

Change in weight vector = (dPerformance/dw1 * i) + (dPerformance/dw2 * j)
This ensures that we move in the direction toward the optimal solution.
Really, we will move at a rate constant, Rate, which encapsulates the size of the step in our search.
Change in weight vector = Rate * ((dPerformance/dw1 * i) + (dPerformance/dw2 * j))

Now, this approach requires that the functions be continuous, not some digital/discontinuous threshold. 
So, we use a steep sigmoid for the threshold, which is still continuous but close to a step function.

    First, we will add in an additional input, which is always -1, and is immediately multiplied by a weight w0.
    This is added to the summation. Now, the threshold is simply centered around 0. 

    Second we replace the discontinuous step function with a sigmoid.
    We can use many step functions, but for mathematical simplicity, we use F(x) = 1 / (1 + e**(-x))

Let's work with a simple net, just 2 neurons

x -- (Multiplier * w1, output P1)---> (Sigmoid, output y) --> (Multiplier * w2, output P2)---> (Sigmoid, output z)

So, Z is the final output.
Performance = 1/2 * (D - z)**2
Where Z is the final output variable and D is the desired output.

So, now let's compute the partial derivatives.


dP/dw2
-------
We need to use the change rule.
dP/dw2 = dP/dz * dz/dW2
Really, we can also use the chain rule to simplify dz/dW2 and replace it with dz/dP2 * dP2/dw2
So:
dP/dw2 = dP/dz * dz/dP2 * dP2/dw2
Note, this is simply backtracking through the neural net. Each partial derivative is dOutput/dInput for some stage of the neural
net. Multiplying these together allows us to compute the output with respect to the orignal inputs.

dP/dw2
-------
We can also use the chain rule here
dP/dw1 = dP/dz * dz/dP2 * dP2/dy * dy/dP1 * dP1/dw1

Now, this is just the product of partial derivatives. We can re-order them to actually look
at the neural network from back to front.
dP/dw2 = dP2/dw2 * dz/dP2 * dP/dz
dP/dw1 = dP1/dw1 * dy/dP1 * dP2/dy * dz/dP2 * dP/dz

Some of these are easy to do:
    Performance = 1/2 * (D - z)**2, 
    so 
    dP/dz = 2*1/2*(D - Z) = D - z

    P2 = Y*w2
    so
    dP2/dW2 = y
    Similarly, dP1/dW1 = x

    dz/dP2 is a bit tricky. This is the sigmoid.
        In general b = 1 / (1 + e**(-a))
        db/da = d(1 / (1 + e**(-a)))/da = d((1 + e**(-a))**-1)/da = (-1 * (1 + e**(-a))**-2) * de**(-a)/da
        but de**(-a)/da = -1 * e**(-a)
        so db/da = (-1 * (1 + e**(-a))**-2) * -1 * (e**(-a)) =  (1 + e**(-a))**-2) * e**(-a)
        This is really e**(-a) / (1 + e**(-a))**2)
        We can rewrite this as:
        e**(-a) / (1 + e**(-a))   *   (1/(1 + e**(-a)))
        We can rewrite this as:
        (1 + e**(-a) - 1 ) / (1 + e**(-a))   *   (1/(1 + e**(-a)))
        We can rewrite this as:
        ((1 + e**(-a)) / (1 + e**(-a) - 1/(1 + e**(-a)))   *   (1/(1 + e**(-a)))
        We can rewrite this as:
        (1 - 1/(1 + e**(-a)))   *   (1/(1 + e**(-a)))
        But remember that 1/(1 + e**(-a) is b
        We can rewrite this as:
        (1 - b)   *   b
        So, the derivative of the sigmoid output with respect to the input is expressed purely in terms of the output.
    So, dz/dP2 has output z, so this is z(1 - z)
    Similarly, dy/dP1 has output z, so this is y(1 - y)

Substituting these, we get
dP/dw2 = y * z(1 - z) * (D - Z)
dP/dw1 = x * y(1 - y) * dP2/dy * z(1 - z) * (D - Z)

I *think* dP2/dy = w2

A small rate constant will lead to slow training.
A large rate constant may pass over the best answer
Really, you want to change the rate constant with performance
    If performance is decreasing, then the rate constant is too big
    If performance is increasing, then you are headed in the right direction, so try increasing the rate constant.

Now, suppose we have a slightly more complex neural net, 4 neurons.
It is just 2 paths of the 2-neuron example we did above.

x -- (Multiplier * w1, output P1)---> (Sigmoid, output y) --> (Multiplier * w2, output P2)---> (Sigmoid, output z)
x -- (Multiplier * w1, output P1)---> (Sigmoid, output y) --> (Multiplier * w2, output P2)---> (Sigmoid, output z)

Now, the performance function has 2 actual outputs and 2 desired outputs.
If the 2 streams interact (so intermediate outputs of 1 path is input to steps in the other path).
This can be tricky, because there are many different paths through the network.

Now, a lot of these partial derivatives are shared between paths.
A lot of the work to compute one path will be reused for other paths.
The computation needed is linear with the depth of the network.
The computation needed is exponential with the width of the network.


By the way, a fast fourier transform also reuses partial results.




Lecture 12b - Deep Neural Nets Part B (Fall 2015)
==================================================================
This discusses applying neural nets to larger problems, so many many more parameters to the net and
many many more layers in the net.

Really, deep nets is a misnomer. They are usually very wide, and often not more than 10 layers deep.

In a net with several horizontal paths, there may be cross-connections between paths.
However, if all paths lead to some intermediate node, then the performance will vary with the output of that intermediate
node, no matter how many inputs in turn affect the intermediate node.
These "bottlenecks" in the net help reduce the complexity somewhat, because once you express how
performance varies with the intermediate node, you don't have to explicitly show performance of
every path back to the beginning. Instead, then you separately show how the performance of the
intermediate node is affected by its inputs. In a sense, divide the neural net into vertical slices,
and solve each slice independantly, using just the slice that immediately precedes it. Or, solve each node
just in terms of its immediate inputs, not every path back to the original inputs for the entire net.

The dot-product is SUM(x1w1 + x2w2 + x3w3 + ....)
This is applying an nx1 and 1xn vector. It is what a neural net does when n inputs are separately weighted and
then pass through a summation node.
If neural nets model what happens in our brain, then the dot-product seems to be something intrinsic
to biological computing.

Image recognition with neural nets use the idea of Convolution, which is NOT really the same as the convolution integral in Fourier analysis.
    Convolution in fourier analysis has a concept of memory when it analyzes a signal.
This starts with a low res image, say 256 x 256
Then, go over the image, looking at only a 10 x 10 pixel square at a time - move the square to eventually scan the entire image.
    A neuron will produce an output for each 10 x 10 square
Then shift the square to the right by 1 pixel, so there is overlap between 2 consecutive squares.

The convolution function will eventually produce a 2-dimensional grid of "points" each is a value of applying a function to one 10 x 10 square.
I am not sure how the coordinates map. Maybe the top-left x,y of each 10x10 pixel square is the point in the new matrix?

Next we do "Max Pooling"
In the result 2D grid, we break it up into squares of points. I am not sure of the square size.
    For each square, find the maximum value in the square.
    Then shift the square to the right by 1 point and repeat the process, so there is overlap between 2 consecutive squares.

The maximum-selection will eventually produce a 2-dimensional grid of "points" where each point is the max value of one square of points.
I am not sure how the coordinates map. Maybe the top-left x,y of each square in the old matrix is the point in the new matrix?

This is called "Pooling" and basically we end of with a 2D matrix of points, which represent the highest function 
of a corresponding region in the original image.

One step of convolution and pooling is called a kernel.
You may do up to 100 kernels on one image. I am not sure how they differ. Perhaps each is a different neuron, that tries to match
the image to a different image in the library?

Old fashioned neural nets would feed each pixel as an input to a neural net.
The new approach requires a lot more computation.
It will have a library of N different "types" of thin, and M examples of each type.


There are several specific ideas layered on top of this basic architecture.

Auto-Coding
----------------
A vector of inputs (x1, x2, ..., xn) go into an initial layer of neurons.
A second layer of neurons, called a hidden layer, has fewer neurons (like 0.1 as many) and takes output from the first layer
A third layer of neurons then takes its input from the output of the second layer
This generates an output vector (z1, ..., zn). We can now compare this to the desired values (d1, ....,dn)

Initially, we can let the desired values be the original input values (x1, ...., xn)
Why is this useful? Because the second layer, the "hidden layer", is smaller, so it is a more concise representation (or just lossless compressed)
of the input. This is interpreted as a "generalization" of the input.
Once we train the neural net, we have a middle layer that is some kind of encoding of a generalization of the properties we are looking for.
This is a non-obvious encoding - we do not know how the data is encoded, and it does not seem to correlate to obvious properties
in the input data. But, practically, it seems to work.

You can do this layer-by-layer, and use the first layer to train the second layer, and so on.
Another name for this is "Boltzman machines" and so on.

The hidden layer often has 1 neuron for each class.

Consider a much simpler example: recognize animals by how tall they are.
There are 3 animals, a cheetah, a zebra, a giraffe, and use their shadow as the input to a neural net.


Sigmoid Modelling
--------------------
We can also modify the weights and threshold to change the shape of the sigmoid to optimally
fit the positive and negative examples. We want both positive and negatives to land on the flat parts of the sigmoid.
This tends to make a steeper curve that changes from 0 to 1 after the last negative example 
and before the first positive example.


Soft Max
--------------------
We have a neuron that identifies each class. Really, it generates a probability that the object we
are examining is in the class.
So, for a picture, each neuron will assign a probability that the object is in one class.

Really, each neuron implements a function F1(x) which returns a probability that x is in class 1, and
F2(x) matches it to class 2, and so on.
Probability of class J is Fj(x) / SUM(Fi(x)) for all i.

Now, we can combine Soft-Max and Auto-Coding.
Start with a net with an input layer, a hidden internal layer and a large output layer, and train it.
Now, remove the output layer, and replace it with a new output layer. The new output layer has the same number of
neurons as the hidden internal layer, and each new output neuron takes input from every neuron in the hidden internal layer.
Now, freeze the weights to the hidden internal layer and re-train the new output layer. 
The new output layer is trained quickly and becomes very accurate.

Dropout
-----------------------
A common problems in neural nets is to get stuck on local maximum.
A solution is that at every iteration, a neuron has a probability of being removed from the next step.
It's not clear whether it will return for future iterations or not. It probably does.

This seems to prevent getting stuck on a local maximum state.


There is redundancy here.
Once a neural net is trained, you can remove neurons and very quickly retrain it.

But, a wider net can avoid getting stuck on local maxima. 
So, you can train a net with 10 neurons per layer, then after it has trained, remove all but 2 neurons per layer and quickly retrain it.
But, you cannot start with a network of only 2 neurons per layer and train it because it will get stuck on local maxima.





Lecture 13 - Learning: Genetic Algorithms
======================================================
This is another effort for mimicing biology - specifically mimicing evolution.

A single cell has 1 chromosome from each.
Review of mitosis and meiosis, forming gametes.
There are several steps for randomization:
    - Whether paternal or maternal chromosome goes into each gamete
    - Crossing over

We will represent a chromosome as a simple string of values.
It's binary values for now, but they could be any alphabet.

There are several randomization steps
    - These can mutate 0->1 or 1->0
    - Crossing over - splice the prefix and suffix of adjacent chromosomes
    - Genotype->Phenotype expression, which depends on environment

Each individual may have different fitness.
We can measure the fitness with a simple numerical score

Fitness then determines a probability of survival
This will implement selection, which individualy provide genetic material for the next generation.

This is all essentially hill-climbing. 
    Randomization just looks at all local directions and fitness picks the steepest slope.


We will select one survivor for each generation.
Probability[i] is probability that individual i survives to the next generation
Fitness[i] is fitness of individual i

Here is a first guess:
Probability[i] = Fitness[i]

The sum of all probabilities should add up to 1.0 because at each generation 
we will select one survivor.
One way to ensure all probabilities add up to 1 is to scale them by the sum of all fitnesses.

Probability[i] = Fitness[i] / SUM-for-all-i(Fitness[i])

Here is an example:
------------------------
This is a genetic algorithm that looks for an optimal value in a 2D space

Fitness(x,y) = (sin(x*omega))**2 * (sin(y*omega))**2 * e**((x+y)/sigma)
where omega and sigma are arbitrary constants.
Really this function is chosen to make nice looking iso-utility lines, that look
like a topological map.

Each chromosome consists of 2 numbers, x and y.
The values may be fractions.

A mutation function may randomly add or multiply an x or y by some value
A crossover operation may randomly the x between 2 adjacent chromosomes or the y's between
2 adjacent chromosomes.

Without crossover, this is essentially a hill-climbing search.
It can get stuck at local maxima or minima

We care less of the absolute fitness value, and more about the ordering of fitness
So we order the candidates by fitness 
    the highest fitness has a fixed probability Pc of survival
    the second highest fitness has a probability (1 - Pc)*Pc of survival
        Where 1-Pc is the chance that the highest did not survive, 
    the third highest fitness has a probability ((1 - Pc)(1 - Pc))*Pc of survival
    In general, P(n-1) = (1 - Pc)**(n-2) * Pc
    The probability of the last chance is (1 - Pc)**(n - 1)


The "step size" makes a big difference on whether these algorithms will get stuck on a local maxima.
A larger step size helps it jump over local maxima.
The "step size" is the amount of randomization in each step, and not the number of iterations the software uses for each step.

Another improvement is to select the survivors based not only on fitness but also diversity.
So, you want both a fit and diverse population.

Compute diversity as a numerical distance between each new candidate and all candicates that have
previously been selected.
So, you keep a history of all previously selected individuals to compute this diversity.

We have a 2D space, x axis is fitness, y axis is diversity
We want the max of both fitness and diversity, the top right corner of the space that maximizes both x and y
We can then draw diagonals from that top-right corner, that are iso-survival lines.

On one hand, you get to a maximal value, but once you reach it you don't narrow down because the diversity component 
forces you to remain spread out.
Ideally, at one point you want to reduce the importance of diversity.

There are a few cases where this has been useful.
- Planning. 
    Cross-over allows you to mix the frint-half of one plan with the back half of another plan
    Or you can use a forward-chain expert system and then randomize the rules slightly

Example
Making creature from Blocks
Each chromosome has bits for 
    Which blocks are included
    Where blocks connect


Now, this all uses a very simplistic idea of evolution.
For example, we really don't understand how genotype maps to phenotype or thngs like speciation.
    
This is all essentially hill-climbing. 
    Randomization just looks at all local directions and fitness picks the steepest slope.

So, a lot of the value of these programs is in the selection of the fitness model and how to
represent the problem. However, genetic algorithms are not too magic - it's basically hill-climbing
in a space of possible solutions.




Lecture 14 - Learning: Sparse Spaces, Pnonology
==================================================================
Winston does not feel Neural Nets are very promising - they can work but usually not very well.
This is his opinion, but this is from a guy with a lot of knowledge and experience. Neural Nets are
popular in the field, and Winston said his lab is doing some work on them, but they may not
live up to the promise. He feels simpler approaches, like Nearest Neighbor, are still valid for
learning problems when no other solution seems to fit.

Making a noun parallel can use simple rules, like adding an "s" to the end. But this may
be simple from a spelling perspective but complex differences for the sounds. The "s" is pronounced
differently for dogs (a z) or cats (an s).

Phonology theory from Morris Holley
    A speaker translates a word to phonemes to create sounds.
    A listener hears these sounds, converts them to electrical signals then translates the signal to a vector of booleans for
        "distinctive features"
    It seems 14 different distinctive features can describe all phonemes. So, there are potentially 16,000 phonemes.
        Some of these are: sylabic, voiced, continuent, strident
    But most lanaguages have fewer than 100 phonemes (English has approx 40)

Oddly, a listener wil often "infer" acoustic sounds.
The McGurk effect, we watch a speaker's lips and infer the sounds. 
But, there are other examples, including the "Phoneme Restoration Effect"

There are 5 muscles that form words
    Tongue
    Two Lips 
    ?? Palate? 

He is describing a machine (Yip and Sussman)
    Vision system
    List of words
    List of features
    Set of Constraints (plural constraints, etc)
    Buffer of phonemes

Step 1
    The vision input system will send inputs to the List of features, 
    The vision input will also to the list of words to select a word

Step 2
    A selected word will select features from the list of features
    A selected word will push pnonemes onto the phoneme buffer

Step 3
    The list of features will activate constraints
    Constraints examine the phonemes in the buffer and can also push phonemes onto the phoneme buffer

For example, if the buffer has a list of phonemes for a noun, and this is plural, then the constraint

This system can run either way. To generate speech from a word, and also to map heard sounds to words.
The constraints can be learned, from positive and negative examples.

In Yip and Sussman, they tried to understand which phoneme is used for the plural "s", either a ZZZZZZ or a SSSSSSS.
They found examples of both types of words, then mapped each to phonemes, and decide which phonemes seem to matter
and which do not.

The space of phonemes is sparse. This actually makes it easier to distinguish between different phonemes, because
any 2 phonemes differ in several ways.

In the Scientific Method - start with a problem and end with an experiment.

Most AI is done backwards - trying to fit a mechanism to a problem. It's "mechanism envy" or
"falling in love with the mechanism". For example, it tries to apply neural nets to every problem.

Sussman-Yip started with a problem, and then built an experiment to decide how the mechanism works.
They followed the scientific method
    They designed a good representation of the problem.
    From this representation, constraints emerge
    From the constraints, they can design an experiment

It all starts with a good representation of a problem.
1. It makes the right things explicit
2. It exposes constraints
3. Localness. Are related things in the world also related in the model
 




Lecture 15 - Learning: Near Misses, Felicity Conditions
==================================================================
As an aside, this is the type of learning algorithms that we learned in 1984 in 6.034, before neural nets and other types
of learning were added to the course (though those other techniques may have been understood by AI people at the time, they
were not in this particular introductory course in 1984 when I took it).

Example: Recognizing a simple arch. This is 2 vertical blocks and a horizontal block across their tops.

You start with an initial positive example and an initial negative example.
Both examples are the 3 blocks.

Start by representing this as a simple symbolic form.
A graph, each node is a block and each edge is a relationship between 2 blocks.
Simple edge is "supported by". So, for the positive example, the top cross piece has
"supported by" relationships to the 2 posts.

An abstract representation is important, because it removes unimportant details, like the color of the arches.

Basically, we compare the positive and negative models and find relationships in one and not the other.
    Relationships that are in positive and not in negative are required
    Relationships that are in negative and not in positive are vetos

The model evolves with pairs of positives and near-misses.
Negatives must be "near-misses" to be most useful. They expose one relationship that is missing or violated.

    Relationships that are in two positive examples are Not-Necessary OR else may be a OR-disjunction of 2 values.
    The model can say "Does not matter" if we have seen positive examples with all possible values.
    Otherwise, it is a disjunction-OR

Now, there is also an Is-A property of each node.
In the arch example, the Is-A can be types of blocks: brick, wedge, dowel, etc.
We can combine positive examples with an OR operation.
Alternatively, the values may be arranged in a hierarchy. Brick and Wedge are both blocks. We can combine
2 positive examples 

Part of the algorithm depends on representation. Describing the right features and making thise explicit.
Another key part is the order that the positive and negative examples are presented.

Start with a "seed" which is the first positive example.
For each positive example, apply the heuristics to broaden the model
For each negative example, apply the heuristics to narrow the model

We can represent the space of possible models as a tree of nodes, each node is a possible model.
Edges connect a parent to a child node if the child is a refinement of the parent after seeing a positive or negative example.
We can search this space, looking for an optimal model, which is the goal node.
The space can be large, so do a Beam search to constrain the space.

The heuristics have been identified.
1. Require link heuristic. Must be present in positive cases
2. Forbid Link heuristic. Must not be present in a positive case
3. Extend set. Broaden a list of allowed values
4. Drop-Link. A link may have any value in positive cases
5. Climb-tree heuristic. A link may have any value in a more abstract class.

Note, this both generalizes and specializes the model. A required link is a specialization
step. An extend or drop-link Climb-tree is a generalization step.

Specialization uses near misses
Generalization uses positive examples

In general, there is a teacher (source of examples in a specific order) and a student.
Felicity Conditions (felicity: the state of being happy)
    The student knowledge is a graph, (nodes are states of knowledge, links are learning steps)
    The teacher must understand the initial state of the student.
    The teacher fixes mistakes in past lessons by refreshing
    The teacher fixes mistakes in future lessons by saying "wait"
    The student must trust the teacher
    
If a student has limited memory, then give series of specific examples
If a student has large memory, then give batches of many examples

As a general rule:
    Better students "talk to themselves". They explain the concepts to themselves.
        They say more problem-domain-specific statements
        They also say more "monitoring" statements, like "this is hard". But, these may also be metaevaluation, so
            monitoring and debugging the problem solving process

    Anecdotally, making you talk to yourself during the problem-solving process does seem to
    help you solve problems better.

Well-packaged ideas have 5 common qualities
1. Symnbol that is associated with the idea. A visual "handle". Learning uses the "arch"
2. Slogan that is associated with the idea. A verbal handle. But, this also forces you to succinctly describe the essence of the idea.
    It's an elevator pitch. Learning uses the names "arch learning or one-shot learning"
3. Surprise. An idea does something very different. Arch Learning improves the model with each training example, not thousands.
4. Salient. Something that "sticks out". A significant idea that becomes the focal point. This is similar to the slogan.
5. Story. Present the idea as a story.
Your ideas are like children, you want them to have the best life possible. So, you package them for the most success.




Lecture 16 - Learning: Support Vector Machines
==================================================================

All of the learning systems discussed so far are pattern recognition systems.
In the abstract, this is identifying positive examples in a space of possible cases.
As a simple example, consider a 2-D space, and there are positive and negative examples at different points in this space.
All of the learning systems (nearest-neighbors, neural nets, etc) try to create a function that maps (x,y) into +/- in
a correct way. In reality, it is trying to identify which parts of the space are positive examples.

To me, the interesting thing about nearest neighbors is that it is a control system, not just a pettern recognition system.
It predicts actions to correctly respond to a state, not just recognize whether the state is an example of a specific class.

In these pattern recognition systems, the goal is to draw a line that separates the positive from negative examples.
If you have the function for this line, then you can do pattern recognition by deciding if any state (x,y) is
above or below this line. Or, in a more complex (realistic) scenario, this is a threshold which is the minimum
requirements to be in the class and recognize patterns by deciding whether any instance (x,y,z,a,b,...) is above
or below the threshold.

This function is called the "decision boundary", it is the threshold.
Now, this threshold function ideally is evenly between the positive and negative examples, not just skirting one of them.
This leaves "room" for error and other new positive or negative examples. 

|                   +           
|                                         
|    -          +               
|                                       
|                    +          
|       -                                     
|           -                        
---------------------------
Imagine a diagonal line separating the groups of + and - examples.
Suppose you have a vector, w, that starts from the origin that is perpendicular to the center dividing line between positive and negative examples.
Now, take any vector x from origin to a point (x1,y1). Project it onto the Perpendicular vector and decide whether it is 
on the + or - side.

Really, to recognize any example (x,y), compute "w dot x >= c" where c is some constant.
If "w dot x" is >= c then it's a positive example.
Or, if we let b = -c, then this is just "(w dot x) + b >= 1" for positive and "(w dot x) + b" <= 1" for negative. 

So, what is w? x is just a vector to some (x,y). 
Create a constraint, "(w dot x+) + b >= 1", where x+ is any positive example.
Create another constraint, "(w dot x-) + b <= -1", where x- is any negative example.
So, for any xi (where Xi is any positive or negative (x,y))
    "(w dot xi) + b >= 1", where x+ is any positive example.
    "(w dot xi) + b <= -1", where x- is any negative example.

For convenience (a mathematical convenience), define a new variable yi such that yi = +1 for positive examples
and -1 for negative samples. yi has a different value (+1 or -1) for each known positive and negative sample.

Now, multiply both sides of the equations "(w dot xi) + b >= 1"  by yi
For positive examples: "yi * ((w dot xi) + b) >= yi" = "(w dot xi) + b >= 1" (since yi is +1 here)
For negative examples: "yi * ((w dot xi) + b) <= yi" = "-((w dot xi) + b) <= -1" (since yi is -1 here) 
            = "((w dot xi) + b) >= 1" = "(w dot xi) + b >= 1" (this is the same as the positive example).

So, (w dot xi) + b >= 1
or yi((w dot xi) + b) - 1 >= 0

Now, we can say the edges of the boundary between positive and negative examples is exactly 0.
These are the positive examples closest to the - region and the negative examples closest to the + region
In the lecture, Winston calls these values "gutters" because they are the edges of the "street"
where the "street" is the strip of the sample space that is empty and separates the region of positive samples 
from the region of negative samples.

    yi((w dot xi) + b) - 1 = 0

Additionally, we want a wide as possible a separation between the region of positive samples 
from the region of negative samples. In Winston's words, we want to maximize "the width of the street".
Really, this should be maximize the distance between the intersection of u and the top of the negative bounds
and the intersection of vector u and the bottom of the positive bounds.

Let x+ be the smallest positive example and x- be the biggest negative example. Each of these is
represented as a vector x+ = (x+,y+) and x- = (x-,y-) (note, I am using x+ as both the vector name 
and also the x-coord of the vector, sorry for the confusion. Same for x-).

To start, take the difference between vectors x+ and x-. This is at some unknown angle.
But, then do the dot product of this and vector w, which is orthogonal to the "gutter"
or the side of the area dividing between negative and positive examples.
So:
    width = (x+ - x-) dot (w / ||w||)

Where w is the orthogonal vector
    ||w|| is the length of w
    (w / ||w||) is a unit length normal vector

This is a scalar value, it's the width of the stream.

Now, combine this with the previous constraint, that for all points that lie on the gutter (the edge of the dividing area)
    yi((w dot xi) + b) - 1 = 0
If this is a positive example, then yi = 1
    (w dot xi) + b - 1 = 0
 or (w dot xi) = 1 - b

Now, we can substitute this into the width expression for x+ dot w
First, expand the original expresion
    width = [(x+ dot w) - (x- dot w)]  / ||w||
Now, substitute for the x dot w
    width = [(1 - b) - (-1 + b)]  / ||w||
    width = 2  / ||w||

Now, we want to maximize w, to maximize the distance between the negative and positive example boundaries.
So, maximize 2 / ||w|| which means minimize ||w||
For mathematical convenience, this in turn means minimize (1/2 * ||w||**2)

We will use Lagrange multipliers. This lets us maximize/minimize an expression that is also subject to some constraints.
It allows us to find the min/max of the expression without thinking o fthe constraints.
Define a new function L = function - SUM(alphai * contraints)
where alphai is some weight constant
    L = (1/2 * ||w||**2) - SUM(alphai * (yi((w dot xi) + b) - 1))

Now, find the derivative and set it to 0.
(Winston is skimming over some of the algabraic details here)
The derivative is:
    dL / dw = w - SUM(alphai * yi((w dot xi) + b) - 1))
BTW, differentiating with respect to a vector is almost like differentiating with respect to a scalar.
Also, derivative of ||w|| is w (because it's a lot of Sqrt(squares))
Also, derivative of d(alphai*yi*w dot xi)/dw is just alphai*yi*xi.
It seems that a derivative of a dot product is similar to a differential of a multiplicative product.

So....
    dL / dw = w - SUM(alphai*yi*xi)
Since we set the derivative to 0
    0 = w - SUM(alphai*yi*xi)
or:
    w = SUM(alphai*yi*xi)

Intuitively, w is a linear sum of the positive and negative samples.
alphai may be 0 for some i, so this is not necessarily all samples.
Also, remember that yi is +1 or -1

Also, take the derivative with respect to b
    dL / db = SUM(alphai * yi)
Since we set the derivative to 0
    0 = SUM(alphai * yi)

OK, this is a bit tricky. Next, we have an expression for w, which is w = SUM(alphai*yi*xi)
Let's plug that back into the Lagrange function.
    L = (1/2 * ||w||**2) - SUM[alphai * (yi((w dot xi) + b) - 1)]
Or do some algebra
    L = (1/2 * ||w||**2) - SUM[alphai*yi((w dot xi) + alphai*b) - alphai]
Here, Winston switches from ||w|| back to the vector w without any explanation. Not sure why these are equivilent,
but he may be using the definition of ||w||. 
He also seems to interchange multiplication and dot-products. Not sure why these are equivilent but I think he may just
be expanding the definition of the dot product.

But, now we can also expand the expression a bit by doing the multiplicative-distribution.
    L = (1/2 * ||w||**2) - SUM[alphai*yi((w dot xi)] + SUM[(alphai*b)] - SUM[alphai]
So now he gets:
    L = (1/2 * w**2) - SUM[alphai*yi*xi*w] - SUM(alphai*yi*b) + SUM(alphai)

So, after substituting:
    L = (1/2 * SUM(alphai*yi*xi)**2) - SUM[alphai*yi*xi*SUM(alphai*yi*xi)] - SUM(alphai*yi*b) + SUM(alphai)
Note the second multiplicand uses j subscripts. This is same value as the i subscripts, but we use a different
j subscript to tell them apart when we start manipulating them.

Next, factor out the constant b
    L = (1/2 * SUM(alphai*yi*xi)**2) - SUM[alphai*yi*xi*SUM(alphai*yi*xi)] - b*SUM(alphai*yi) + SUM(alphai)
But, recall that earlier we showed 0 = SUM(alphai * yi) so this is
    L = (1/2 * SUM(alphai*yi*xi)**2) - SUM[alphai*yi*xi*SUM(alphai*yi*xi)] - b*0 + SUM(alphai)
    L = (1/2 * SUM(alphai*yi*xi)**2) - SUM[alphai*yi*xi*SUM(alphai*yi*xi)] + SUM(alphai)
Similarly, SUM[alphai*yi*xi*SUM(alphai*yi*xi)] = SUM[alphai*yi*xi]*SUM(alphai*yi*xi)]
So:
    L = (1/2 * SUM(alphai*yi*xi)*SUM(alphai*yi*xi)) - SUM[alphai*yi*xi]*SUM(alphai*yi*xi)] + SUM(alphai)
Or
    L = -(1/2 * SUM(alphai*yi*xi)*SUM(alphai*yi*xi)) + SUM(alphai)
Or
    L = -(1/2 * SUM(alphai*alphaj*yi*yj*xi*xj)) + SUM(alphai)

Why do we care? It tells us that the optimization only depends on the dot product of pairs of samples.

Let's change focus. Now, plug the expression for w (which is w = SUM(alphai*yi*xi)) back into the original
decision rule (if (w dot u) + b is >= 0 then it's positive) where u is an unknown vector
So:
    if (SUM(alphai*yi*xi) dot u) + b is >= 0 then it's positive

Why do we care? It tells us that the decision only depends on the dot product of pairs of samples.

Note that this will NOT work if the positive and negative values are NOT linearly separable.
It only works if there is a straight line between the two regions.

|              -   -           
|       -   -     -                       
|    -        +     -          
|        +         -                    
|   -        +   -
|       -       -                              
|           -                        
---------------------------

Now, if the points are not separable in one space, then transform the points into another space
where they may be easily separable.
The maximization only depends on the dot products.
Really, we do not need the transformation of individual points from start space to solution space.
Instead, we only need a function that converts the dot products in start space into dot products in solution space.
This function is called a kernel function, Phi
Some popular kernel functions are:
1. Linear:        Phi(u dot v) = (u dot v + 1)**n
2. Radial-Basis:  Phi(u dot v) = e**-(||u - v||)


So, this transformation allows us to use the optimal solution we derived from simple spaces
This still has the problem of over-fitting but not the problem of local min/max as neural nets do.





Lecture 17 - Learning: Boosting
==================================================================
This is used for binary classification
    Is that a person or a tree?
    Is that coffee or tea?

The math looks hard, but this is easy to implement.
It uses the intuition that a group of problem solvers can do better than any simple problem solver.

Assume that there are classifier functions, which output either -1 or +1
    Let h be a classifier function. 
    It has an error rate between 0.0 and 1.0 which is the fraction of cases that are wrong (so 0.0 is perfect)

We can make a good classifier that has a low error rate by letting several weaker classifiers each run and
vote.
    Let H (capital H) is a better composite classifier.
    H(x) is a function of h1(x), h2(x), h3(x)
    Simply: H(x) = sin(h1(x) + h2(x) + h3(x))
    If 2 or more are +1, then sin() is +1. So if 2 of 3 vote +1 then the answer is +1

Ideally, the parts of the solution space where one h is wrong the other 2 will br right.
Of course, if 2 or more are wrong for some value, then the composite will also be wrong.

So, which classifiers do we use?
- Start with the h1 which has lowest error rate
- h2 which is h1 with a modified set of inputs - where the inputs that is gets wrong are magnified/distorted
    We weigh the samples that have an error more heavily.
- h3 which is h1 with a modified set of inputs - where the inputs that h1(x) != h2(x) aremagnified/distorted
    We weigh the samples that have an error more heavily.

We can organize this as a tree. Each node is a function and edges link functions to input functions
H is the composite function, it is root
H has 3 child nodes: h1, h2, h3

Now, we can recurse. 
    h1 in turn may be a composite of h1.1, h1.2, h1.3.
    h2 in turn may be a composite of h2.1, h2.2, h2.3.
    h3 in turn may be a composite of h3.1, h3.2, h3.3.

We can create other classifiers from Decision Trees.
    This is not the full tree, but we may just use a "tree stump"
    If a decision tree has n tests, we can make a subtest from any subset.
    Additionally, we can also take the output unchanged, or invert it, so this makes twice the number of possible stumps.
    So, if there are 3 tests, then we could take any single test, and any pair of tests, and 

But, we could use ANY type of classifiers for boosting: neuralNets, NearestNeighbors, and more.
For the rest of this discussion, we will use dicision tree stumps.

For a decision tree stump, the error rate for N samples is
    1/N * SUM(wrong cases)
Each sample has a weight associated with it: w1, w2, w3.
At time 1, all weights are equal, 1/N. 
So, really, the error rate for N samples is
    1/N * SUM(weight of wrong cases)
Weights are constrained
    SUM(weight of ALL cases) = 1.0
So, if some weights are heavier then others must be smaller


So, how do we combine simpler classifiers into a composite classifier
    H(x) = sin( (alpha1 * h1(x)) + (alpha2 * h2(x)) + (alpha3 * h3(x))   ..... )
Where
    alpha1, alpha2, alpha3 are weights we use to scale each classifier function
Assume there are N samples. At time 1, 
    w1 = w2 = w3 = .... = 1/N

We will iterate. At step t
    Pick the h classifier function that has the lowest error rate
    This lowest error rate is used to compute alpha for that function
    This lets us compute the weights for step t+1

Let the weight for input i be wi
   wi at step T+1 = ( wi at step T / Z ) * e**(-alpha at T * h(x) at time t * y(x))
      where Z is some normalizing constant. It makes all new weights add up to 1, which is a constraint we earlier identified.
      where y is a function of x. This has value +1 or -1
      y is actually the correct answer. It is used to make the product positive or negative depending on whether it is correct or not.
        When h(x) is correct then h(x) * y(x) = 1 * 1 = -1 * -1 = 1
        When h(x) is INcorrect then h(x) * y(x) = 1 * -1 = -1 * 1 = -1
This is mathematically convenient

We want to minimize the error of  wi at step T+1 = ( wi at step T / Z ) * e**(-alpha at T * h(x) at time t * y(x))       
Essentially, compute a derivative and set it to 0.
Winston skips the math here, but the result is Error is minimum if
     alpha at step T = 1/2 * ln((1 - error at t) / error at t)
So, the error rate is bounded by a decaying exponential function.


So, this may combine many classifiers.
This begs a question - is this really *combining* classifiers or else making a new classifier from any input functions.
Really, the input functions could be any function, not just classifiers.

Now, in general:
    Weight at t+1 = ((weight at t) / Z) * (y)
Where y is some correction factor that depends on whether we correctly or incorrectly classified in step t
      Z is a scaling constant
But, this is the error we derived above Error at step T = 1/2 * ln((1 - error at t) / error at t)
We can plug in the expression for error that we derived above.
Note, because this is -ln or +ln, we will invert the ratio when it's -ln

    If we classified correctly in step t, then:
    Weight at t+1 = ((weight at t) / K) * (sqrt[  error at t / (1 - error at t) ])
    You get the same sin from h(x) and y

    If we classified INcorrectly in step t, then:
    Weight at t+1 = ((weight at t) / K) * (sqrt[  (1 - error at t) / error at t   ])


We scale the SUM[ Weight at t ] term by a Z such that all w's to add up to 1.
    1 = Correct Classifications + INcorrectClassifications
Or expand the derivation of weight
    1 = SUM[ sqrt[  (1 - error at t) / error at t ] * (Weight at t) / Z ] 
        + SUM[ (Weight at t) / Z * (sqrt[  (1 - error at t) / error at t   ])  ]
We can factor out the error term
    1 = (sqrt[  (1 - error at t) / error at t ] * (SUM[ Weight at t ] / Z)) 
        + sqrt[  (1 - error at t) / error at t   ])  ] * (SUM[ weight at t ] / Z)
Or
    Z = (sqrt[  (1 - error at t) / error at t ] * SUM[ Weight at t ]) 
        + sqrt[  (1 - error at t) / error at t   ])  ] * SUM[ weight at t ])
Then, Winston says for Correct cases, SUM[ Weight at t ] = error rate at t, and for
                   for INcorrect cases, SUM[ Weight at t ] = (1 - error at t)
So:
    Z = (sqrt[  error at t / (1 - error at t) ] * (1 - error at t)) 
        + sqrt[  (1 - error at t) / error at t   ])  ] * (error at t))
Do some algebra and divide the denominator into the numerator
    Z = (sqrt[ error at t ] * sqrt[1 - error at t]) 
        + sqrt[1 - error at t] * sqrt[error at t])
But the two added terms are equivilens (a*b = b*a) so
    Z =  2 * sqrt[ (1 - error at t) * error at t ]
    
Let's now substitute this expression for Z into the expression for the weights
    If we classified correctly in step t, then:
    Weight at t+1 = ((weight at t) / (2 * sqrt[ (1 - error at t) * error at t ])) * (sqrt[  error at t / (1 - error at t) ])
    Weight at t+1 = ((weight at t) / (2 * (1 - error at t))

    If we classified INcorrectly in step t, then:
    Weight at t+1 = ((weight at t) / (2 * sqrt[ (1 - error at t) * error at t ])) * (sqrt[  (1 - error at t) / error at t   ])
    Weight at t+1 = ((weight at t) / (2 *error at t)

Next
    If we classified correctly in step t, then:
    Weight at t+1 = (1 / (2 * (1 - error at t)) * SUM[(weight at t)]
    But SUM[(weight at t)] = 1 - error at t
    So
    Weight at t+1 = (1 / (2 * (1 - error at t)) * (1 - error at t)
    Weight at t+1 = 1 / 2

    If we classified INcorrectly in step t, then:
    Weight at t+1 = ((weight at t) / (2 *error at t)

So, for all correct answere, the SUM[Weight at t+1] = 1/2
The sum of ALL weights is 1, so for all INcorrect answers, the SUM[Weight at t+1] = 1/2

OK, before we said that all weights add to 1. Now we see that all correct weights add to 1/2 and all incorrect weights add up to 1/2
In other words, the weights of Correct and INCorrect weights contribute equally.

There are some simplifications
If we have N samples, we do NOT have to evaluate tests between 2 correct or between 2 incorrect guesses.
We only care about tests that separate a correct from an incorrect.




Lecture 18 - Representations: Classes, Trajectories, Transitions
==================================================================
None of the systems so far emulate human intelligence.
Humans evolved, and in an evolutionary time have developed intelligence quickly.
Brain size has increased, BUT neantherthols had bigger overall brains but not as smart.

It looks like a small group of ~1000 humans (based on mitochondrial DNA) changed their mental
machinery and then quickly became smart and took over the planet.
To loosely paraphrase Winston who in turn loosely paraphrases Chomsky:
"The ability to take 2 concepts and make a 3rd concept without disturbing the original 2 concepts. And the 3rd concept can be used as input to another iteration, so this proceeds without limit."

In evolution, this change happened very rapidly, and seemed to not affect brain size.
Winston says, this "symbolic ability" enabled story telling and understanding. So, "stories" seems to be
a deep mechanism. Linguists distinguish between "inner language" the language with which we think, and
spoken language. Often, bilingual people will remember a conversation but not remember which language it was in.

Consider a semantic net. We start with nodes and links.
"Combinators" are links, they connect nodes
There are also links between related links. 

"Reification" is the process of making links themselves the endpoints of other links. It is making a link
an equivilent alternate to a node as the input to a link.

There is a localization process, called a "frame", by Minsky, that is a group of related links and nodes.
This is like a context. It's a bunch of related nodes and frames, that values can be "plugged into", so
you can bind a frame to particular values and make an instance of the frame for one specific application.
Winston calls frames a "localization" layer. It "shines a spotlight" onto a 

This is very general, but so low level it does not help much with specific problems.
There is a "parasitic meaning". We know something, then draw a diagram, and project our own
understanding onto the diagram, but the diagram itself is not a meaning.

We will not consider the types of representation that are needed to understand stories.

1. Classification.
We can create hierarchies of types, like tool->hammer->Ball Peen
Most people know a little about most things, but a lot about a few things.

2. Transitions
One representation is to consider changes in the state. This is different than how engineers think - they think 
in terms of "state", the state of something at one time. Alternatively, we can think of things as transitions or
changes or transformations. 

So, for every state, there is a timeline. Consider an event. We can model this as 3 time regions:
    Before event
    During event
    After event
There are state variables, and for each of the different time regions we can model each with a vocabulary of change:
    Increasing, not-Increasing
    Decreasing, not-Decreasing
    Changing, Not-changing
    Appear, Not-Appear
    Disappear, Not-Disappear
    Not applicable 

3. Trajectory
A lot of what we say is about objects moving along trajectories.
An object, starts at a source, is moved by an agent, follows a path, moves toware a source
An agent may affect the object, possibly with a tool.
There may be an assistant, and also a beneficiary of the transition.

This is really a Minsky frame, and these are slots in the frame that can be bound to different 
elements of different stories.

In some sample databases (like Wall St Journal corpus of 50,000 sentences), about 25% of sentences 
are transitions or trajectories.


Story Sequences
A story sequence can be a single sentence, like "Pat Comforted Chris"
We can construct a "role" frame for this with slots:
    Agent=Pat
    Action=? Comfort is vague
    Object=Chris
    Result: Transition frame
        Transition Frame
            Object=Chris
            Mood=improved
             
Multiple ways of thinking about something. Winston paraphrases Minsky as saying, "if you can only think 
of something in one way then you have no recourse when you get stuck".

If the sentence were instead "Pat Terrorized Chris", we get similar frames, but now in the
transition frame the Mood is "decreased"

If the sentence were instead "Pat Kissed Chris", we get similar frames, but now in the role frame there
is also a "hallucination" slow, which is actually a "Trajectory" type frame.
    Object=Pat's lips
    Destination=Chris' lips
But, this is still vague. A father kisses his daughter on the forehead, a boy kisses a girlfriend on the lips, etc.

Now, notice that our internal language also has a mechanism for understanding stories, "Sequence". 
We don't want just a giant jumble of frames, but rather a sequence over time. 
Our learning is rooted in sequence. It is hard to replay a piece of music from the beginning, you have to start on a bnoundary.
Similarly, you often recall your SSN from the beginning, not just the last 4 digits.

We also have libraries of stories. 
In particular, our frames are stored in a hierarchy.

Event
  - Disaster
      Earthquake
      Hurricane
  - Party
      Birthday
      Wedding

All of these frames have different slots, and they are the key attributes of a story that we look for.

Parsing stories is still difficult, because it can be confusing what pronouns refer to.
This leads to some general rules for good writing.
1. Don't use pronouns. It makes it hard for the reader to know what you are referring to.
2. Don't use "former" or "latter". They make the reader stop and look back to a previous sentence to see which you are referring to.
3. Don't call a "shovel" a "spade". Don't usee different words for the same object.





Lecture 19 - Architectures: GPS, SOAP, Subsumption
==================================================================
This lecture discusses architectures - which combines the ideas presented so far.
This will be important in understanding how human intelligence works.

GPS - (General Problem Solver)
This is the "Problem Solving Hypothesis"
This models the worlds in states, with a start state c and goal state s.
Measure the distance (in some metric), and then forward-chain to progress to intermediate states that are 
closer to the goal state. This is also called "Means ends analysis"


SOAR - (State Operator and Result)
This is a second generation from the general problem solver, built by the people behind GPS
This is the "Symbol System Hypothesis" - humans are symbol manipulators.
It has several parts
   Short term memory
   Long term memory
   Vision and actuators
Most of the work is in the short term memory. This was influenced by Cognitive Science modelling of human thought.
Short term memory held
   Assertions
   Rules
A process would shuttle rules and assertions between short and long term memory.
This is a focus mechanism, so only part of your knowledge is active at any time
The was also an elaborate "preference" system for choosing between rules when more than 1 rule can fire at the same time.

It also had "prblem spaces", which are spaces of dicrete states. So, each problem has a specific problem space that is searched using the rules.
So, it's a rule-based system.
It also has a "universal subgoaling". When you cant think of what to do next, that becomes the next problem. So, this is forward/backward chaining.


Emotion Machine model - Minsky
Thinking comes in layers
This is based on the "common sense hypothesis"
When we solve a problem, we think on different levels
1. Instinctive reaction. Built-in reflexes
2. Learned reaction. We learn responses and protocols
3. Deliverative thinking. She does problem solving
4. Reflective thinking. Think about how we solved problems. This implies we store past problem solving sessions and then anylize/debug/optimize them later with more data to learn.
5. Self-reflecting layer. Think about how different goals interact
6. Self-concious thinking. Think about goals in a social context
The Open Mind project was an attempt to gather common sense ideas from the web.


Subsumption Architecture (Rod Brooks)
This is based on the "creature hypothesis" - make a machine as smart as an insect.
Traditionally, robots had separate vision systems, reasoning systems, action systems, and these were separate.
Brooks reorganized this around separate layers that deal with the world, but each has some of 
Example of a hierarchy
   1. Seek
   2. Explore
   3. Wander
   4. Avoid Objects
This is really a hierarchy of abstraction layers.
Each layer has its own vision and reasoning system, and each layer is built on top of the lower layers.
Some key features:
- No representation. Don't try to build a representation of all data and the world
- Use the world - the sensory input, as data rather than an abstract model. So, everything is reactive.
- All mechanisms (?plans? rules?) are all finite state machines
There are no plans, only reactionary behaviors. You do not have an abstract map of the terrain, but rather
wander and react to walls and obstacles.
It could build a map as it wanders, and it may do this. 
This was commercialized as the Roomba robot


Minsky Layers seem to be a superset of these
Deliverative thinking - this is SOAR
Instinctive reaction - This is Subsumption
Learned reaction - This is Subsumption


Genesis Architecture (Winston's lab)
"Strong story hypothesis"
This is centered on language. The language system does 2 things
1. Guide and interact with perception/input systems
2. Describe events

Some questions use perception to find an answer - like "am I wearing blue jeans"
Some questions are not written down, but we can imagine (ie simulate) an event in our mind and tell from the simulation
what happens, like "What happens if you run with a bucket full of water"

A language allows you to describe events in the context of your culture and social environment.
This can also tell stories. Telling and understanding stories may be essential, and the core of "understanding"

To illustrate the essential quality of stories, consider a series of experiments from Psychology
A rectangular room, all walls are white. In each corner is a basket that may contain food.
   You put the food in one of the 4 backets while rat is watching
   Spin the rat, then let it go for food.
   It will go to the correct corner or the diagonal opposite, but those 2. So, it understands that the room
   is a rectangle.
   This is true for rats, children, and adults.
   Now, paint 1 wall blue.
   The rat and the child still go to the 2 diagonal corners with equal probability.
   But, adults, always get the right corner.
   Children always get the right corner once they use thee words "left" and "right" in language.
   So, language is more than communicating ideas, it is deeply linked with using and manipulating ideas.
   
   But, an adult that is speaking what they hear over headphones, is now back to going to
   the 2 different corners with equal probability.
   So, distracting or otherwise occupying the language center will cause you to lose some reasoning
   function.

   This may be a bit like the layers in Brooks model, except now one of the layers is language.

Two key ideas
1. If you want to be smarter, look, listen, draw, talk
These seem to the drive more mental machinery and cause better processing of ideas.
Taking notes in class helps because it engates your motor and linguistic machinery.

2. Beware of fast talkers. They are jamming your language processor and so you cannot think.





Lecture 21 - Probabilistic Inference - Part I
==================================================================
For an event, we can build a table.
Each column is a possible explanation. There are N possible explanations so N columns.
There are 2**N rows, one row for each possible explanation of columns.

Add another column, and keep a count of how often the different combinations represented by each row happens.
Then, the probability of each row is the #occurrences of that row divided by the total # observations
This is a number between 0 and 1.
Add a column for the probability of each row.

The probability of each event is the sum of the probability of each row in which the event is true.
The probability of each event given a precondition is the number of events with both the precondition AND the event 
divided by total # occurrences of precondition. 

This is powerful, but there is a problem because any realistic application has *lots* of rows.
For N variables there are 2**N rows.

So, we will go over probability without explicitly making a table.
Some basics of probability:
    P(a) is probability of event a.
The basic axioms:
0 <= P(a) <= 1
P(true) = 1
P(false) = 0
P(a) + P(b) - P(a AND b) = P(a OR b)  (the intuition is to not double-count the P(a AND b), it's part of both P(a) and also P(b)

Conditional probability
    P(a | b) means probability of a given b
    P(a,b) means probability of a AND b
P(A | b) = P(a,b) / P(b)
A normal probability is #events / #samples
Suppose #samples is P(b), the #events of of a in this world would have to be the same as #envents of "a AND b" since
we only consider the events where b is true.
So, in this subset of all events, 
P(a | b) = #events of a / #all events
         = P(a,b) / P(b)

Similarly, by just algebra, P(a,b) = P(a | b) * P(b)


Now, divide the space of all events into 3 parts.
Consider 3 variables a,b,c
Let y = P(b,c)
P(a,b,c) = P(a,y) = P(a | y) * P(y) = P(a | b,c) * P(b,c) = P(a | b,c) * P(b | c) * P(c)

In general
P(x1,x2,...,xn) = Product of (Pxi | xi-1,....,x1) for all i between 1 and n inclusive
This is the "Chain rule"

Independence
P(a | b) = P(a) if a is independant of b
The probability of b does not matter
So, in Venn diagram terms, P(a | b) is the ratio of the overlap of a and b, to the entire area of b
If this is equal to P(a) then that ratio is the same as the ratio of A to the size of the entire event space.

Conditional Independence
P(a | b,z) = P(a | z)  if a is independant of b
The probability of b does not matter
So, in Venn diagram terms, P(a | b,z) is the ratio of the overlap of a and b and z, to the overlap of b and z
That ratio is the same as the ratio of the overlap of a and z to the size of all of z.
In general:
P(a,b | z) = P(a | z) * P(b | z)


Belief Nets
We can diagram events, as nodes, and causality as edges.
Several nodes may cause a single event, ie, a racoon or a burglur can both cause a dog to bark.

Given a node's parents, every node is independant of all other non-descendant nodes.
In other words, a node is only dependant on all of its immediate parents, not any indirect ancestors.
Now a node *might* depend on its descendants.

For each node, you can make a probability table, with all of it's parent nodes as
variables, so if there are N parent nodes then there are 2**N rows
Now, you still have to find some probabilities, but far fewer than if you just created a single giant table of all variables in the system.
So, a Belief Network is a way of simplifying work because you can safely ignore variables that do not have any effect on
a particular variable you are modelling.
Moreover, once you have a belief network, then you can create a single large table mechanically.

Start with a probability of all variables true
P(a,b,c,d) 
But, you can compute this with the chain rule
P(a,b,c,d) = P(a | b,c,d) * P(b | c,d) * P(c | d) * P(d)
But, P(a | b,c,d) = P(a | b) if a is independant of c,d




Lecture 22 - Probabilistic Inference - Part II
==================================================================
A bit of review:
    The Joint Probability Table tells us everything, but it is huge and combinatorically explosive

We used "Inference nets" also called "Bayes nets" that have a node for each event and edges if one 
event will cause another event.
This assumes no loops.

Any variable in the graph is independant of any non-descendent if the variable's parents are true.
Here, he uses "descendant" to mean parent or input nodes. 
All causality flows through the parents, if the parents are true then you can ignore anything upstream.

The Inference graph allows us to derive a full Joint Probability Table with only a limited number of 
probabilities. It prunes non-causal relationships so it only requires the essential probabilities.
To do this, we use the "Chain rule"
P(x1,x2,...,xn) = Product of P(xi | xi-1,....,x1) for all i between 1 and n inclusive
So, this is a probuct of N terms, where each term is P(xi | xi-1,....,x1) so each variable
only depends on variables that come after it. For example:
   P(x1,x2,x3) = P(X1 | X2,x3) * (PX2 | x3) * P(x3)

Each node is a variable. So, how do we arrange the variables for the Chain rule expression?
Start with the graph, and take a leaf node. Append it to the list, and remove it from the graph.
    Now, iterate with the simpler graph
This orders the variables so each variable precedes all of its parents.

Now, we can use the chain rule with the variables in this order. For example:
P(C,D,B,T,R) = P(C | D,B,T,R) * P(D | B,T,R) * P(B | T,R) * P(T | R) * P(R)

But, a variable is Independant of any non-descendant node other than its parents.
So, a variable is not dependant on nodes that are not parents.
So, now we can start to simplify the expressions to remove non-parents from the expression.
So, if only D and B are parents of C, then P(C | D,B,T,R) = P(C | D,B)


We can start with observed data and use that to build a table of probabilities.
Each observed event gives one True/False value for every variable.

For each variable that is independant, we record the total number of observations and the number of 
    cases when the variable is true. This gives us #true / #events, so a probability of each independant variable.

For each variable that depends on 1 or more parent variables, we keep a similar tally.
    Create a Probability table.
    Each row is a combination of input values, and we record the number of times we saw each combination of values.
    and also the number of times the output variable was true when we saw each combination of input values.
    So, again, for every row, we have a total #events and also #times output variable is true.

This uses observed data to build the probability tables.
Next, we use the probability tables to predict behavior.

Again, start with an Inference graph, but now start at the leaf nodes.
For each node, generate a rendom probability, to tell you whether the node is true/false.
Once we have assigned values for all input nodes that input into a particular intermediate node,
then we can use those values to select a row int he probability table for the intermediate node.
Again, generate a random number and if it mees the threshold for that row, then the intermediate
node is true, otherwise it is false.


Now, we need to decide whether we have the dependencies in the Inference graph correctly designed.
In other words, are there links between every pair of interdependant nodes and no links between independant nodes.
We can check this with rules of Baysean Inference.
By definition we know that:
    P(a | b) = P(a,b) / P(b)
so:
    P(a | b) * P(b) = P(a,b)
But, we can also start with P(b | a) = P(b,a) / P(b)
So, P(b | a) * P(a) = P(b,a) = P(a,b)
So, P(a | b) * P(b) = P(a,b) = P(b | a) * P(a)
So, P(a | b) * P(b) = P(b | a) * P(a) 
So, P(a | b) = P(b | a) * P(a) / P(b).   This is Bayes Rule.

This can be used to determine whether evidence will imply a conclusion.
Let
    a = conclusion
    b = evidence

It may be harder to decide P(a | b) which is conclusion given evidence
But, it may be easier to decide P(b | a) which is evidence given conclusion.
    P(conclusion | evidence) = P(evidence | conclusion) * P(conclusion) / P(evidence)

Suppose there are several possible conclusions. We have some evidence, and we know the probability of the evidence
given each conclusion. We also know the baseline pre-test probability of the conclusions.
So, this is all we need, to decide the probability of each conclusion.

I don't understand what he is saying here. 
    All probabilities add up to 1.
    1 = P(conclusion1) + P(conclusion2) + ....
    1 = P(conclusion1 | evidence) + P(conclusion2 | evidence) + .....
    1 = P(evidence | conclusion1)P(conclusion1)/P(evidence) + P(evidence | conclusion2)P(conclusion2)/P(evidence) + ....
    P(evidence) = P(evidence | conclusion1)P(conclusion1) + P(evidence | conclusion2)P(conclusion2) + ....

But, suppose there are several different pieces of evidence.
    P(conclusion | evidence) = P(evidence1,evidence2,...evidenceN | conclusion) * P(conclusion) / d
    where d is some denominator we are not specifying

Now, suppose the conclusion is independant of some piece of evidence.
When variables are independant, 
    P(evidence1,evidence2,...evidenceN | conclusion) 
            = P(evidence1 | conclusion) * P(evidence2 | conclusion) * ..... * P(evidenceN | conclusion)
So, 
    P(conclusion | evidence) = (P(evidence1 | conclusion) * P(evidence2 | conclusion) * ..... * P(evidenceN | conclusion))  * P(conclusion) / d
    where d is some denominator we are not specifying

Now, do this for every possible conclusion.
The comclusion with the highest probability given the evidence is the one we select as being most likely.

We can also use this to select between competing conclusions after a series of repeated tests.
We use the same test, over and over, but it may give a different value each time. 
Consider flipping a coin - the same flip may come up heads or tails each time.
We build a sequence of probability values for each possible conclusion. Each possible conclusion
has a liklihood of the test result.
So, 
    Start at a node with the pre-test probability.
    Run one test, and get the result
    Find the probability of that result for each possible conclusion
    Multiply that probability witht he current probability of the possible conclusion

We can evolve several possibile conclusions in parallel. At any point, pick the possible conclusion with the highest
probability.

Similarly, if there are 2 competing Inference graph topologies.
Each has a probability of selecting a specific conclusion for some evidence.
Generate an evidence and result. 
Compute the probability of each Inference graph reaching this result.
Repeat N times. Multiply the probabilities of all N results for each possible Inference graph.
Select the Inference graph with the highest probability of producing the observed results.

Now, we are selecting structures. 
But, the space is likely too large, when there are 30 or more variables.
So, this is a search problem. Take the loser model and modify it randomly. This is a random walk through the tree.

At each step, we compare models.
At the start, each model has the same N variables, and no edges.
For one model, pick 2 random nodes, and make an edge between them.
Loop
    Get a series of evidence
        Compute the probability that each model chooses the observed result
        Multiply that with the running probability of that model
    Pick the better model, which has highest probability.

    Now, take the loser, and randomly mutate it (add or remove an edge)
Repeat

Actually, instead of product of probabilities, use the sum of logs of probabilities.
The probabilities get so small we need to add logs rather than multiply numbers.

Typical applications:
    Medical diagnosis
    Software debugging

Probability analysis is the right thing to do when you don't really understand the system
Like biology where we do not know all the pathways.





Lecture 23 - Model merging
==================================================================

Structure discovery was discussed in the previous lecture. 
This is important, and may be integral to machine learning.

Suppose you have 2 stories. Represent each story as a sequence of nodes.
We can model this as a basic graph. Events are nodes and edges are time sequencing
A-->B means "A followed by B".
Next, we can try to combine similar stories, make a graph that has the same basic structure
as 2 original graphs, but with optional detours or optional changes in details.
This is "Baysean Story Merging"

Each node is really a small mini-graph, of actor nodes and action nodes.
We can merge 2 larger nodes, if their mini-graphs have comparable action nodes.
Note, comparable means things like "run" and "Flee" are the same, but also looser similarities.

This can be used for story understanding.
- Parse the story into sentences, and each sentence into a mini-graph with actor nodes and action edges.
- The story is a sequence of sentences, so it is a sequence of minigraphs.
- Compare minigraphs to try to make an abstract minigraph that is a generalization

So, story understanding is a general idea, where "story" can also be "theory" or "scientific model".
A story is any sequence of events, and understanding stories is really just understanding sequences
of events and can be used for scientific investigation.

This is "cross model coupling". Each story graph is a model, and this tries to "couple" or find
similarities between 2 models.

This can be used for more than stories. Winston gives an example of finding similar Bird calls.

Think of a visual picture as a sequence of pen strokes. So, a picture is a linear story, it
is a linear sequence of graphical primitives, like "horizontal line", "box" and so on.
Now, we can do image recognition as comparing 2 sequences of graphical primitives.
This is a bit like string matching - take a pattern and try to see if it is a loose/sloppy fit
to a known prototype.

In speech recognition, we do a Fouruier spectrum. This reduces the series of peaks.
But, we also watch speakers, and see the movement of their lips. This is a second series of
input, but this time it is a series of visual shapes of the speaker's lips.
When we listen, we are correlating the two models, so some lip shapes correspond to the sound peaks.


To make a match between 2 story lines:
For each node in one sequence, 
    Make weighted links to all possible matching nodes in the other sequence. 
        Each link is weighted by how close a match the 2 nodes are.

    Find pairs of nodes that are similar to the same things, in a constant proportion.
    For example, if 
        - node 1 has a link to prototype 1 of weight 4 and a link to prototype 2 of weight 2
        - node 2 has a link to prototype 1 of weight 2 and a link to prototype 2 of weight 1
    Node 2 has the same links as node 1 with a 2x weight. 
    So, Node 2 and node 1 are both instances of the same thing.


    Review:
    * The right representation makes you smarter
    * Sleep makes you smarter
    * You cannot learn unless you almost know
    * We think with mouths eyes and hands
    * The Strong Story Hypothesis
    "The mechanisms that enable us humans to tell and understand stories are the evolutionary augmentations
    that separate our intelligence from that of other primates"
    * All great Ideas are simple