Lecture 1 - Introduction ------------------------- This lecture was given in 2010, so preceeded the wider media coverage of AI. Lecture 2: Goal Trees ------------------------------------ How do you solve complex problems? The big idea is to create a small set of problem transforms, each will change a complex problem into simpler problems. Eventually, the problem is simple enough that it can be solved directly. This is called "problem reduction". Essentially, this is recursion - repeatedly simplify the problem until you have a base case, and then directly solve the base case. Now, there are 2 sets of transformations 1. Safe Transformations - these always lead you toward a solution. 2. Heuristic Transformations - these may lead you toward a solution, but may not. You may have to backtrack if you hit a dead end. For simple calculus problems some transforms are: Safe Transformations - these always lead toward the solution 1. Integral(-f(x))dx = -Integral(f(x))dx 2. Integral(c*f(x))dx = c*Integral(f(x)dx 3. Integral(f(x) + g(x))dx = Integral(f(x))dx + Integral(g(x))dx 4. Integral(p(x)/q(x))dx --> divide Heuristic Transformations - these may lead toward a solution A. Integral(f(tanx))dx = Integral(f(y)/(1+y^2))dy You then have a table of simple "built-in" integrals and their solutions The algorithm is 1. Apply transforms. Each will replace the problem with one or more simpler problems. This replaces a goal with a conjunction (logical-AND) of several simpler goals. 2. Look to see if any simpler goal is in the table of primitive problems. If so, then replace it with the solution. Continue looping through steps 1 and 2 until you are done. The state of the program is an AND-OR tree. Each node is a problem, and the original problem is the root of the tree. You build the tree as you solve the problem. You start with just the root, which is the original problem, and a transformation will extend the tree by adding children nodes to the current node. So, you start working on a leaf node and either solve it or else add simpler sub-goals as child nodes, which turns the leaf node into an intermediate node. At any point, the program is examining one node of the tree. You can do any tree traversal, depth or breath first search, but depth first is most intuitive. There are several other orders, and the program can use heuristics to choose which leaf node to work on next. To solve a node, either find that node in the table of known problems with their solutions, or else solve the children of the node. The children of a node may either be a AND or an OR, so you must solve either all children (the AND case) or at least one child (the OR case). Applying multiple different transforms to a node will create an OR-Tree of children. Each transform makes one child and solving any transform's child node is sufficient to solve the parent node. Some transforms will make an AND-node, by simplifying a parent node into several child nodes, and you must solve all child nodes. If you hit a dead-end, then back up the tree until you hit an OR node and try an alternate child. If you solve a node (you find its solution in the table of known solutions) then back up the tree until you hit a root or an AND node. If you hit an AND node, then solve the next child. If you hit the root, then you are done. We start with a goal and reduce it until we have converted it into a simpler goal that we know how to solve. So, we are working backwards, from a goal to a series of steps, and this is essentially "backward chaining". In calculus, a fairly simple program can solve a lot of integrals. Average depth of the tree is 3, max depth is 7. Average number of unused nodes is approx 1. This means there are not a lot of unused dead-ends. So, the heuristics of which node to work on next is less important for this domain. Lookup table of known solutions was 26 integrals 12 safe transformations and 12 heuristic transformations We can characterize any domain with some basic questions What is the types of knowledge involved? How is the knowledge represented? How is the knowledge used? How much knowledge is required? Lecture 3: Goal Trees ------------------------------------ This starts with a demo of a simple planning program, to arrange blocks. This is backward chaining, and the code will have to solve intermediate tasks, like clearing a block before a new block can be put on top of it, or clearing a block so it can then be moved onto another block. The program generates plans, and stores its state, so it can explian which sub-goal it was solving when it made a particular step in the plan. The program has several procedures: 1. PutOn(movedBlock, targetBlock) FindSpace(targetBlock) Grasp(movedBlock); ClearTop(movedBlock); While-block-has-things-on-top GetRidofBlock(topBlock); PutOn(topBlock, table); Move(movedBlock); Ungrasp(movedBlock); The code leaves a trace, which function was called in which order Why-questions: The reason a function is called, is the parent procedure in its trace and its parameters How-questions: How a function did something, is the child procedure in the trace and its parameters This trace is a Goal-tree, which is really an AND-OR tree. So, in lecture 2 and in lecture 3, we built an AND-OR tree to search a space. In lecture 2, this was the explicit state of the program, and in lecture 3, we leave the tree around after completing the problem so we can examine the reasoning of the program. This is backward chained programming. Next we switch to talk about forward-chained programs. As an example, consider a program to identify animals in a zoo. Some attributes of animals Has-hair Has claws Has Sharp teeth Has Forward pointing eyes Eats meat Some simple rules R1: IF Has-hair THEN is-mammal R2: IF Has-claws AND Has-Sharp teeth AND Has-Forward-pointing-eyes THEN Carnoivore R3: IF Eats meat THEN Carnoivore R4: If Is-Mammal AND Has-spots AND is-FAST THEN Cheetah A forward-chained expert system also builds a Goal tree (an AND-OR tree) as it runs. As a result, it too can answer questions about its own behavior. To answer a "How" question, like How did you know X, answer with the preconditions that allowed the rule to fire which stated X To answer a "Why" question, like Why did you want to know Y, answer with the result of a rule that used Y as a precondition, So, a goal tree is used for both forward and backward chaining. In both cases, it can examine the parent and child nodes in a goal tree. Similarly, the rules for a forward chained expert system can also be used in reverse to do backward chaining. In this case, you start with a goal, and ask Is this animal an X? Then you fire rules where each rule will look for properties, and you simplify until you look for properties that are known. Here again, you So Forward and Backward chaining are very similar Both build AND-OR goal trees Both use rules to add child nodes to a leaf node Both can inspect their state (the resultant AND-OR tree) and explain why and how they did each step Backward chaining starts with a goal or hypothesis and works backward Forward chaining starts with known facts and works forward An AND-node is a conjunction of preconditions to a rule An OR node is several different rules, any one of which can fire Forward chaining seems to be best suited for identification Backward chaining seems to be best suited for making a plan or proving a hypothesis In a Deduction mode, if a rule fires then the result remains true. Once you prove something then it remains true. In other modes, you can add or remove facts in a database, so the results of one rule may be changed or removed by some future rule. As an example, look at the problem of putting groceries in a bag. Key Ideas - Complexity may be due to the complexity of the environment or the problem, not the complexity of the program. Complexity of behavior is MAX(complexity of program, complexity of environment) So a simple program can have complex behavior when it solves a complex problem. - Deal with specific cases, when studying a system - Ask questions about things that appear to be the same bvut are handled differently This exposes significant properties, and lets you write new rules - Build a system and when it fails you are missing a rule or some description in the vocabulary Lecture 4: Search ------------------------------------ Humans can do simple search pretty well using just visual processing. We don't understand how this works, but it seems to be important. "We will never have a complete theory of human intelligence until we can understand the contributions of the human visual system to solving everyday problems" Search is about any system of choices. Finding paths in maps are one, but solution spaces or any sequence of chices need search. - British Museum Find every possible path. But, this will not visit the same node twice. - Depth-First-Search (DFS) Always take the left-most child. If you hit a dead end, then backtrack to the previous choice. This uses backtracking. - Breadth-First-Search (BFS) Search level-by-level, so advance every of a node. To implement search, we will create a queue of partial paths. To implement DFS: Represent paths as lisp s-expressions. Initial path is a list (S) which is just the S start node Extend path Loop Pop path from head of queue Current node is last node in the path If there are no children, then delete it and Iterate Loop Otherwise: Look at each child of the node If child has not been seen before Check if we are done, ie if the goal node is one of the children Make a new path by appending each child of the current node to the list Put each new path back onto the head of the queue (really, this is a FIFO stack) Add new child to list of nodes we have seen before Iterate Loop For BFS, just put new lists onto the tail of the queue not the head Now we will start to assign a score to nodes, based on their proximity to the goal. This will create a series of algorithms, the simplest is Hill-Climbing. - Hill-Climbing In DFS, we enumerate the child nodes in lexical order, which is just arbitrary. In Hill-climbing, we enumerate them in order of their proximity to the goal node. So, you add new paths to the front of the queue (FIFO) but add them in closest to goal is the first. This is called an "informed search" and it requires that I know the distance of each node to the goal. That may not always be possible, but when it is, hill climbing is much faster than DFS. You backtrack when you hit a local but not global min. This has possible problems: local but not global maxima, flat spaces with no best direction, - Beam This is to BFS what Hill-climbing was to DFS. So, when we advance each level in the BFS, we limit to only the K best options, based on how close each child node is to the goal. K is the "beam width". In Beam search, add only K paths, and add them to the back of the queue. - Best first search It only expands the path with leaf node that is closest to the goal, no matter where it appears in the stack. Lecture 5 - Search: Branch and Bound, A* --------------------------------------------------- How do you find the shortest path through a graph? Once again, search is about examining a space of choices, not about maps specifically. Branch-and-Bound ---------------- Maintain a queue of partial paths This queue is sorted by length of the partial paths Initialize the queue with a single path that is just the START node, the partial path length is 0. Pick the next path to work on - this is the partial path with the shortest length This partial path has a last node where the partial path reaches You have a path length to get to this node Compute the length of a new partial path through the current route to each child of the last node Add the new partial paths to each child to a global list - sort them so the list is sorted by length of partial-path Note, once you find a path to the goal, you have to continue searching, until every other partial path has a longer length than the path we found to the goal. Why is this important? Before, in the simpler tree searches, we only looked at the children of a single node. We kept minimal state, how we got to the current node and the description of the current node. (we may also keep a record of which nodes we have visited before, but that is just a boolean flag on every node). Now, we keep a global state, of all partial paths we have developed so far, and make a global decision. Of any partial path developed so far, which do we search next. So, instead of choosing which child in a particular node to search, we now think more globally and decide which partial path in the entire tree do we search next. As an optimization, never extend a path which visits a node we have already found a shorter path to before. This can keep a "visited" flag on each node, and never extend a path to a node that has previously been visited. We extend paths in order of length, and if we have visited a node before then we have found a shorter path to that node. A* Algorithm ---------------------------------------------- This uses predicted distance from each node to the goal to decide which partial path to search next. If you have 2 partial paths, one ending at node A and one ending at node B, then compute the predicted distance of each partial path to the goal. Predicted length of path through A is >= length to A + distance-as-crow-flys from A to goal Predicted length of path through B is >= length to B + distance-as-crow-flys from B to goal So, always search along the partial path with the lowest predicted length to the goal. This *assumes* that you can measure "distance-as-crow-flys" between 2 nodes. That may not be true in many problem spaces. This is an "admissible heuristic", which is a heuristic that can predict a lower bound (in this case it's "distance as the crow flys"). This is branch-and-bound and a previously-visited list and an admissible heuristic (distance as the crow flys) Note, an admissible heuristic may grossly underestimate the lower bound. This is legal, after all, it only says the actual path will be >= the estimate and if the estimate distance is too low (like 0), then this is still valid. However, in this case, the admissible heuristic will always lead you to explore paths that have actual distance much longer than the lower-bound estimate distance. So, the admissible heuristic leads to an inefficient search. Worse, if we use a visited flag (or a list of previously extended nodes) then we will first reach a node through a long path with an artificially low admissible heuristic, and then when we hit a common node with a shorter path (but a higher and more realistic admissible heuristic) then we will not search that path because we visited the common node before. So, a useful admissible heurist must not only provide a guaranteed lower bound but also a "realistic" lower bound. Admissible Heuristic = estimated distance between any node X and goal is <= actual distance from node X to goal Consistent Heuristic = Absolute value((Distance between any node X and goal) - (Distance between any node Y and goal)) Must be Actual Distance from Node X and Y A Consistent Heuristic does not have this problem of causing you to ignore the best path. Lecture 6 - Search: Games, MinMax, Alpha-Beta --------------------------------------------------- This is presented in the context of chess, since that is a common problem. These are adversarial spaces, with 2 opponents trying to reach opposite goals. Some ways to play: - Human way: analysis + strategy + tactics ==> next move Nobody knows how to do this. - If-Then rules: many many rules. These lead to weak play. This uses very local state, and no long-term strategy that spans rules - Look ahead and evaluate. Look at all the moves and all the possible consequences of each move. Each is a new potential state of the game and we choose the bext state and then pick the move that will lead us there. This uses a function status = function(current board state) where status is a score, a numeric quantification of the "goodness" of the board state. The status score is a combination of a bunch of sub-scores, each looking at a different feature of the board. For example, in chess this may be how many pieces you have, are you in check, is the opponent in check, etc. How do you combine the subscores? One posibility is: score = c1f1 + c2f2 + ... Where C1, C2 etc are weighting constants f1, f2 etc are functions, each evaluating the board and returning a score about some aspect of the state of the board This combination of subscores is just a linear polynomial, and is called "linear scoring polynolial" Consider a tree of moves. Each node is a state of the board and each edge is a move. Branching factor is # children of a node If we only look a certain depth down the tree # leaf nodes = breath ** depth In chess, a whole game is ~100 moves (50 per player) and 10-14 branching depth. So, there are 10 ** 100 nodes, and a Britich Museum is impossible. There are 10 ** 80 atoms in the universe There are 3 * 10**7 in a year 10*9 nanoseconds per second There are 10**10 years in the universe So, we may look ahead a finite depth, but the entire search space is too big. We look several levels deep, so don't just think 1 step ahead, but we cannot go to the bottom of the tree. As an example, consider a simple tree with branching factor of 2 and depth of 2. There is 1 root, 2 intermediate nodes, and 4 leaves The leaves are 2, 7 and 1, 8 By convention, we consider 1 player to be the maximizing player. He wants the biggest possible score. The opponent is the minimizing player, he wants the minimum possible score. One simple implementation: Min/Max Algorithm --------------------------------------------- First, decide who moves next at each depth in the tree You alternate moves, first max moves, then min moves then max moves, and so on. Each player takes turns, so each depth of the tree is associated with 1 player. You can decide the min or max move recursively. At a node with only leaves, Pick the child node that is either min or max depending on whether you are the min or max player. This will assign a min/max value to this node At a node whose children all have a min/max value Pick the child node that is either min or max depending on whether you are the min or max player. This will assign a min/max value to this node Start at the leaves, and work your way up the tree This example assumes that we can identify the leaves. You can start at the root and generate all possible permutations. This is an exponential space, so it is computationally expensive. How can we make it more efficient? In other words, how can we prune the search space to reduce the amount of searching we have to do? Start with a few observations: 1. If the parent node is a minimizer node, and it has a child node of value k, then the parent node will be <= k The min player will be able to reduce it to at most k. 2. If the parent node is a maximizer node, and it has a child node of value k, then the parent node will be >= k The max player will be able to reduce it to at least k. So, if a maximizer looks at 2 child nodes, the next depth layer is min. If one child node is <= 2 and the other child node is <= 1, then do not evaluate the rest of the child node that is <= 1. If you do, then the min player will go to <= 1. So, you evaluate the tree and propagate up constraints. The constraints will let you ignore partially evaluated subtrees. This is alpha/beta, which is just an optimization on top of the min/max algorithm. You have to completely evaluate one child to know whether to partially or completely evaluate another child of the same parent. However, this is true at each depth you reach. You start at the leaves and recurse up the tree. In practice this means you completely evaluate the left side of the tree, and then start pruning more and more right children as you move up. In the simple case, you compare the known min/max of a left child with a constraint of the right child. This compares two children of the same parent node. Once you have a score for the entire left half of the tree, you can start pruning nodes by comparing a min/max constraint with the left half-tree. In other words, you can decide to prune some deep subtree because you know that the top-most child decision would go another way. This is deep cutoff. This can save some work, but it's not huge. Ideally, the number of static evaluations (given depth d and branching factor b) will reduce from b**d ===> 2b ** (d/2) In practice, it comes close to this ideal case. This can double the number of levels you can generate. So far, we assumed the branching factor is always constant. In practice, the branching factor can vary with the game. So, the program should track what the branching factor it has, and adjust its depth. In practice, you compute optimal move for level d-1 before starting to search the d. This is an insurance policy. If you don't finish level d, then you use the best move for level d-1. How much does this insurance policy cost? For the first level, it's fixed cost, for level 1 its b, for level 2 it's b**2 and so on. where b is the branching factor So, to go d-1 levels down, this is 1 + b + b**2 + b**3 + .... + b**(d-1) This simplifies to: Cost = 1 + b + b**2 + b**3 + .... + b**(d-1) b*Cost = b + b**2 + b**3 + .... + b**(d) - subtract cost from b*Cost b*Cost - Cost = (b + b**2 + b**3 + .... + b**(d)) - (1 + b + b**2 + b**3 + .... + b**(d-1)) b*Cost - Cost = b**(d) - 1 Cost(b - 1) = b**(d) - 1 Cost = (b**(d) - 1) / (b - 1) This is about (b**(d)) / (b) = b**(d-1) So for a branching factor of 10, this would be just 10% added cost. It gives you an insurance policy that if you ever run out of time you have a nearly-as-good answer. It also means you can keep searching deeper and deeper depths until you run out of time for the next move. So, it adapts to the actual branching of the current game state and uses all available time for making the best possible answer. This strategy is called progressive deepening So, you start with minimax and optimize it with alpha/beta and progressive-deepening In practice, you can do deeper subsearch on some subtrees if there is a particularly good possible state. Note, now the algorithm also goes top-down. So at each level d, it computes optimal move for that level and also computes all possible game state permutations to commpute level d+1 Lecture 7: Constraints: Interpreting Line Drawings --------------------------------------------------- An example is a 4-color map problem Another example is interpreting line drawings to decide how many objects are in the image. One approach - Guzman's Solution -------------------------------- Look at the lines and find their intersections. Find arrows (->) and forks (-<) These tend to be 3 different faces of the same object. An arrow is looking at the far corner of a block from the middle A fork is looking at the far corner of a block from an angle above that corner Number the faces Represent the entire image as a graph. Each node is a face An edge exists between 2 nodes if they share one of these special junctions If we say that at least one edge between faces means they are the same object, then we are too permissive, and everything becomes part of one giant object. If we say that at least two edges between faces means they are the same object, then we are too conservative, and some faces are not part of one object Instead, we say two edges between faces means they are the same object, and then we combine their nodes and iterate. Combining the nodes will put more links between remaining nodes and usually/eventually all faces become part of an object. This works because "the world is full of three-face vertexes", and this correlates to a arrow or fork and "usually" correlates to 2-shared links in the greaph of faces. A second Approach - Huffman's Solution -------------------------------------- This started with several assumptions 1 - The world is presented in "general position", so all cubed are on the same orientation to the viewer. 2 - The world is trihedral, so all verteces are the intersection of 3 planes 3 - There are 4 kinds of lines: concave, convex, boundary These differ based on which side of the line the object is on. Concave (-). This is a corner on the inside of a cube. The sides around the line project toward the viewer Convex (+). This is a corner on the outside of a cube. The sides around the line project away from the viewer. Boundary (arrow) depending on which side you see the object. So a boundary separates object from background The arrow points in the direction so the plane is on the right when going in the direction of the arrow Boundary counts as 2 labels, depending on the direction of the arrow. This ignores shadows Some vocabulary: The real world has vertexes and the model protrays these with junctions The real world has edges and the model protrays these with lines There are only 18 ways to label a junction using these assumptions and labels. There are 18 possible combinations of lines. 6 L's, 5 forks, 4 T's, 3 arrows If there are 3 faces, then there are 8 octants. I guess this is like the 8 octants of a xyz 3-dimensional graph. If we fill all 8 octants or 0 octants, then there are no vertices, so we ignore those 2 cases. If just one of the 8 octants is filled, then you can view it from one of the remaining 7 octants From a diagnally opposite octant, you see a solid in one of the corners of the octant space You see 3 convex lines meeting at a point at the corner. From an adjacent octant, you see only 2 boundary lines, the head of one arrow connects to the tail of the other From above and to the side, you see an arrow with 2 boundaries (the arow) and a convex spine that is the shaft If 7 of the octants are filled, then you are viewing it from the one open octant, and you see the Inside corner of a box. This is an arrow with 3 concave lines meeting at a point If 5 of the octants are filled, then you are viewing it from the one of 3 open octants The solid is 4 filled octants (taking up half the octant space and a top block. The 3 open octants are like a tunnel that makes a 90-degree turn around the top block At the apex of the turn, you see the point of the solid, so this is an arrow junction with 2 boundaries (arrow heads point toward each other) and a convex At either the entry of the turn, there is a concave line along the floow and a boundary which is the edge that the tunnel turns around There are actually 2 ways to describe this, the arrow points toward the concave or the arrow points away from the concave. This is just which side of the tunnel's turn you are on. If 3 of the octants are filled, then you are viewing it from the one of 5 open octants The solid is a 90-degree bend. You can also have a 3-way junction that is a groove and you are looking along the bottom of the groove. There are 2 edges on the 2 sides and a concave spine And, there are more combinations A T requires that one object obscures another object. The cross pieces of the T are all boundary lines with the obscurine object on the right There are 4 combinations. Note there are 8 permutations of a binary-direction arrow with a T of 3 arms (2**3) but half of these must be redundant If 2 or 4 or 6 of the octants are filled, then you can have more than 3 faces visible. For now, we ignore the cases where there are more than 3 faces visible. To label an image, you can start labelling the outside borders with arrows so this is an object in space against a background. Now, you can look at each junction where 2 lines are labelled and then look in the catalog to see the possible labellings for the third line. There often is only one possible labelling. Once you have labelled all lines, then make sure that every junction is a legal junction. If one junction is not legel, then the image cannot be constructed from physical If all junctions are legal this may or may not be possible to create - it is necessary but not sufficient. It may be possible to build a shape with more than 3 faces - but that is not allowed in this model. David Waltz Solution ------------------------------------------------- This tries to add cracks and shadows and light and junctions of 4 or more faces This now has 50+ labels, and there are 1,000's of possible junctions This used a depth-first-search program to explore the possible space. But that is combinatorically explosive. So, don't try to DFS the space because the space is too big. So, he had an alternate model - constraint propagation. For illustration, consider Huffmans labels but in reality Waltz had a bigger vocabulary. 1. For each junction - list the possible junctions that could be that junction. 2. For each junction - look at the neighbor junctions. Discard any possible labelling that is not allowed by the list of possible labellings in a neighbor junction. Two adjacent junctions share a line, and the line must have the same label in both junctions. This will propagate constraints and prune the space of possible labellings. So, constraints do a lot of pruning. Now, in these examples, a constraint is binary. It excludes a possible labelling in an adjacent node, so a possible labelling is either possible/impossible. Constraints don't give an analog "liklihood" score. Lecture 8: Constraints: Search Domains --------------------------------------------------- As an example, consider map coloring No adjacent states can have the same color A depth-first-search would be far too expensive (10**18 years or so at 30/sec). Instead, we can use constraint propagation. Initially, assign a list of possible colors to each state. But, at first, there are no constraints. Any state can be any color. We need something to get the process started. If we rotate colors, then a state that borders 4 or more colors cannot get a color in a 4-color map problem. So, we need to be able to back up, and allow 2 non-adjacent bordering states to have the same color. Vocabulary: - Variable V can have an assignment - Value X can be assigned to a variable - Domain D is a set of values - Constraint C is a limit on variable values. In map coloring it is a limit on pairs of variable values. So, states are variables, colors are values, domains are remaining color possibilities The constraint is no 2 border states can have the same value. Initialize the map so every state has a list of all possible colors We do a Depth-First-Search, assign a value to each state as we traverse the map Essentially, when assigning a color to a state, make sure that this choice will not create a problem for an adjacent state Domain Reduction Algorithm -------------------------- For each DFS assignment of a color to a state For each variable Vi "considered" (this means examining variables of adjacent states) For each value Xi in the domain (list of possible colors) of Vi For each constraint C (xi, Xj) If there is no value Xj in the adjacent state such that the constraint is satisfied, then discard Xi from the list of possible colors for the current state Xi If the domain is empty, then back up to the next iteration of the parent loop The intuition is, when assigning a color to one state, check to see if that will create problems for any adjacent state. One open question is what it means to "consider", or how much validity checks do we do when assigning a color. There are several possibilities: - Check nothing. This leads to invalid solutions. - Check only adjacent states and make sure the 2 states are different colors. This cannot solve the problem - you lose possible solutions by picking a value too soon. - Check every state can solve the problem. This is overkill - solve the entire problem for each step of solving the problem. - Check just the neighboring states. This is the Domain Reduction Algorithm. So, when you pick a color for a state, make sure that every border state still has at least one option. This works, but it still goes down a bunch of dead ends. Still, it is practical. - Check just the neighboring states - but if we have to change a domain of a neighbor state, then make sure this does not conflict with any of its neighboring states. This propagates variables whose domains have been reduced at all - Check just the neighboring states - but if we have to change a domain of a neighbor state down to a single value, then make sure this does not conflict with any of its neighboring states. This propagates variables whose domains have been reduced to a single color There are some optimizations to this general algorithm. You can order the states you visit in the DFS to search the most constrained states (those with most neighbors) first. This finds a solution faster. It prunes the search space most early in the search. So, do the least constrained last. Map reduction also is the same as many resource planning problems. - Scheduling planes, which plane travels which route. You have a graph, where cities are nodes and flights are edges. We can name the edges f1, f2, f3, f4, f5 You can try to fly this schedule with different aircraft. Now, each node has a list of possible aircraft. Constraints: No aircraft can be on 2 routes at the same time. Each aircraft has to be on the ground for a minimum time Domains: the list of aircraft that can fly from this city Variables: The cities Lecture 9: Constraints: Visual Object Recognition -------------------------------------------------- Recognizes Objects - faces is a good example David Marr theories on Vision ----------------------------- Vision is a series of steps 1. Start with camera input and first find edges This creates the "Primal Sketch" 2. Decorate the primal sketch with vectors that are normal to all faces This creates the 2-1/2D sketch 3. Convert the 2-1/2 D sketch into generalized cylinders A regular cylinder is a circle moving along an axis A generalyzed Cylinder is a circle moving along an axis but the circle can change diameter at different points 4. Match the generalized cylinder against a library of pre-known descriptions But, nobody can make this work. Alignment Theory of Recognition ------------------------------- You can reconstruct any view of an object from 3 different 2-d images. This requires a translucent image, so no vertices are obscured This also assumes that you are far enough away that there are no perspective problems - you have an orthoscopic projection Each object in a library has these 3 views Now, you can map a new unknown object to a list of library objects. For each library object, label the vertices p1, p2, p3, ... Really, there is p1A for view A, p1B for view B and so on. There is also a p1, p2, p3, .... for the unknown object. To match an unknown objects, we create a series of equations p1 = alpha*P1A + beta*P1B + gamma*P1C + tau p2 = alpha*P2A + beta*P2B + gamma*P2C + tau ..... Note, the alpha, beta, gamma and tau are the same constants for all equations This transforms the vertices from the unknown object into the spaces of the library of images All of these are 2d drawings, a 2-d camera view of the object. There are 4 unknowns, so we can take 4 sets of corresponding points and then solve the system of 4 equations with 4 unknowns. This should solve for alpha, beta, gamma, tau. To decide whether an unknown matches the model, then every set of corresponding points should also work with the same values for alpha, beta, gamma, tau. As a simpler example, lets consider an object that is in the XY plane, and rotated around a vertical line in the Z axis. This is a simple example, and can generalize to rotations in X, Y, and Z axes. Suppose there is a point s that is in the model object when it is in some normal position. This point has an X,y coordinate, which we represent as a vector from the origin 0,0 to x,y x is xs and y is ys We take 2 views of this model, view A and View B The A and B views were made by simply rotating the object from its standard positio around the Z axis. So there is a point in each rotation that corresponds to point s in the standard position Each of these is represented by a different vector xa, ya, and xb, yb So, the vector from 0,0 to the corresponding point in rotation A is thetaSA degrees off the vector from 0,0 to the point in standard position. So, the vector from 0,0 to the corresponding point in rotation B is thetaSB degrees off the vector from 0,0 to the point in standard position. Let's say there is an unknown object, that is yet another rotation around the Z axis from standard position There is a point in each rotation that corresponds to point s in the standard position This is represented by a different vector xu, yu The vector from 0,0 to the corresponding point in rotation U is thetaSU degrees off the vector from 0,0 to the point in standard position. So, how do we express xa in terms of xs and ys and thetaSA The vextor from 0,0 to s has 2 component vectors, x and y To rotate s by thetaSA, Rotate the x vector by thetaSA Rotate the Y vector by thetaSA Add up the contributions Xa = Xs*cos(thetaSA) - ys*sin(thetaSA) And similarly: Xb = Xs*cos(thetaSB) - ys*sin(thetaSB) Xu = Xs*cos(thetaSU) - ys*sin(thetaSU) Now, ALL points in the object go through the same rotation when the object is rotated once. So, every set of corresponding points like p1a, p1b, p1u or p2a, p2b, p2u will all be at the same angle offsets to each other. We can treat sinSA, cosSA, sinSB, cosSB as constants. We know xa and xb, these are from our models in the library and we can empirically measure them. Now, we can use the system of 2 equations and 2 unknowns, and derive xs and ys Xa = Xs*cos(thetaSA) - ys*sin(thetaSA) Xb = Xs*cos(thetaSB) - ys*sin(thetaSB) The unknowns are xs and ys, everything else can be measures. So, we go from the 2 observed rotations to derive the standard orientation Now, we can derive an alpha and beta such that xu = alpha xA + beta xB for some complicated alpha and beta which is a function of all the angles (details not explained) If you add translation (moving the object along an axis) then you must include a tau term in the original expression. p1 = alpha*P1A + beta*P1B + gamma*P1C + tau So, now you have the standard position is derived by a systm of 3 unknowns (xs, ys, tau) and so you need 3 equations. To get a third equation, you need a third reference orientation, C For 3 degrees of rotation and translation you need 4 equations and 4 unknowns. (BTW, this is why we use 4-dimensional matrices in graphics) This scheme works great for simple rotatino and translation. But, in real life, organic things like human faces may stretch and skew, not just rotate and translate. Correlation Theory of Recognition ------------------------------- Don't try to match entire faces, or even individual features Look for intermediate-sized features, like a node and mouth or 2 eyes in a horizontal line Consider a simpler model. A function that maps y=f(x). This could be a sine wave or anything. Now, consider a stored library image, and it may be the same shape of signal but at an offset, so it starts at x=10. This is y=g(x) The question is can we decide those 2 signals are the same. Formally, this means is integral(f(x) * g(x - Offset)) dx So, try this for every offset and use the one with the biggest result. That will be the one with the closest offset. (BTW, this is intuitively what a fourier transform does, except it uses as a pattern e**jw which is cosw + jsinw). In 2-dimensions, we can use integral(f(x,y) * g(x - XOffset, y - YOffset)) Lecture 10 - Introduction to Learning ========================================== There are several approaches to machine learning I. Regularity Based Learning (AKA "brute Force learning" or "bulldozer" learning) - Nearest Neighbor (based on pattern recognition) - Neural nets (derived from ideas in biology) - Boosting (based on a theoretical approach) II. Constraint Based Learning (more human-like) - One shot based learning - Explanation-based learning This lecture will go into Nearest Neighbor learning Take an image and derive a vector of feature values Then compare the vector of feature values to a library of known feature vectors Pick the closest match. Example is matching electrical cover platess. Measurements include - Total area of of the plate - Total area of all holes in the plate There is error in recognition camera and variability of the subjects. So don't expect an exact match but pick the closest match. In practice, each point in the library has a min and max legal value for each entry in the vector. You can plot these on a N-vector graph. The boundary between any two valid points is the midline that is a perpendicular bisector of the line between the two points. This line is called a "decision boundary" You can do this for every pair of points, and create a graph of lines that divides up the space. This approach lets you identify things, if you have a pre-existing library. If you have only some of the features in the value, then you can still try to find the closest match for those values. This uses the principle that if 2 things are similar in some respects then they are similar in other respects. For example, for detecting blood cells, we might use values like: - Total area of cell - Area of nucleus Alternatively, you may use the occurrence of keywords. Identify the subject of a new article by: - Create a vector of 1/0 binary values that shows whether each keyword is present in the article - Compare the vector with a library of known vectors For example, try to distinguish articles from Wired and "Town and Country" Given an unknown, in the N-dimensional space, how do you decide which library item is closest? The unknown object is some vector value, so it is a point in N-dimensional space. Each known object in the library is also some vector value, so each is also a point in N-dimensional space. There are several approaches: 1. You can use Euclidean distance from the unknown point to the "representative" point for each category in the library. This can lead to wrong answers. 2. You can draw a vector from the origin to the "representative" point for each category in the library Decide which vector the new unknown point is closest to. You can also draw a line from the origin to the unknown point. Pick the vector to a reference point with the smallest angle to the vector to the unknown point. Specifically, measure the cos(theta) where theta is the angle between each vector. cos(theta) = ( Sum-Of(unknownVector[i] * referencePointVector[i]) ) / ( magnitude(unknownVector) * magnitude(referencePointVector)) This is just the dot-product. There are more complicated metrics, but this metric works pretty well and is simple and fast. As an alternate problem: Consider a robot arm with 2 joints, so it can move 2 different joints, each in a plane and we can describe the angles of these joints as theta1 and theta2. Now, try to plan motion so it moves the tip of the arm along a line parallel to the X-axis Solving this with Newtonian mechanics is possible, but the math turns out to be incredibly complex, including coreolis forces because of the constrained degrees of motion. Instead, we can solve this with a table, with columns for theta1 and theta2, velocity of theta1, velocity of theta2, accelleration of theta1 and accelleration of theta2, torque on motor for joint1, and torque on motor for joint2 Now, let the arm wave around randomly, and record the torques. This makes a table of random associations. Now, we will divide the trajectory into small sections, each is a part of the path. - For each small piece of the path, determine what are the angles for theta1 and theta2, velocity of theta1, velocity of theta2, accelleration of theta1 and accelleration of theta2 - Look up in the table, and use the distance metric to find the closest row that has the same values for angles for theta1 and theta2, velocity of theta1, velocity of theta2, accelleration of theta1 and accelleration of theta2 - The closest match row, also has torque of motor1 and motor2, which control the joint for theta1 and the joint for theta2. The training phase is trying random torques and measuring the resulting positions. This builds up the table. Once the table is big enough, then there are enough data points that it can find the best torques. In practice, there are some amazing robot demos of very dextrous actions that are learned. Note, each row has some observed output values and some input values. You pick the row based on the closest match of the output values, and then use the input values from the same selected row to control the machine. This can be applied to any system of control, including drugs and HTN. You have 10**10 neurons in the brain outside the cerebellum, You have 10**11 are in the cerebellum Each cerebellum neuron has approx 10**5 neurons So, there are 10**16 synapses in the cerebellum. It's possible that the cerebellum is really just a massive lookup table. There are some problems with this: 1. What if the space of samples does not cover the entire problem space? Then you cannot learn enough to do the task. For any dimension x (Variance(x))**2 = 1/n * SUM(( x[i] - Mean(X))**2) If we do not have variance in x, then replace x with x' where x' = x / variance(x) The variance of the new x' is 1. 2. What if one of the variables does not matter? This gives bad values Lecture 11 : Learning: Identification Trees, Disorder =========================================================== As a (toy) example, consider a program to identify vampires We have a list of properties (make a shadow, fear garlic, ....) We start with a series of known true cases. Some of the reference points may have slightly different values, but they are both examples of the same class. This is different than nearest neighbor because all of the reference points are slightly different - Characteristics are non-numeric - Some of the characteristics may not matter - Some of the characteristics may matter only for some cases - Some characteristics may be more expensive to check than others. Each characteristic is a test, not a value. So, we start by building a tree of tests. Each test may have any number of outcomes. Note, one of these outcomes may be "I don't know") This is called an identification tree. This is different than a decision tree, but is similar. A good tree would - No false negatives, no false positives - A small simple tree. (Occams razor - the simplest explanation is often the best explanation) - Path through tree has the minimum cost of tests Each test has a bunch of inputs, and will map each input test case to one of the result values. Each result value has a group of reference cases that have that result. So, a test will divide a group of input cases into several subsets, each corresponding to one result value. For each test Apply this test to each known case For each outcome, record the number of cases with that outcome that are a true positive, a true negative and an unknown This tells you how each test divides up the known cases. The best test to use is one that produces homogenous groups for each test result. In other words, all cases that have a particular output from a test are either all true-positive or else all true-negative reference points So, a test results may unambiguously identify some reference cases, ie all reference cases with a particular test result are either positive or negative. But, other result values from the SAME test may be ambiguous, ie both positive and negative reference cases will have that test result value. Measure how many reference cases each test can unambiguously categorize For each test case Look at each output value, and only consider output values with either all true-positive or else all true-negative reference points Count the number of reference points that ended up in any output for this test The sum of counts (all reference cases that ended up in any homogenous set) is the score for this test. It tells us how many reference samples seem to be in an unambiguous result. We assume that if a test output is unambiguous (all reference cases with that value for that test are either positive or negative) then this is true - ie, there is not some reference case we have not yet seen that would make some output value ambiguous. The first test in the tree is the root node. This should be the test that has the largest score, so it will unambiguously identify the most reference points. So, all of the cases that have an output which is unambiguous are solved. Now, the test cases that have an output that corresponds to both positive and negative cases must be further subdivided. So, now take the remaining reference cases that are ambiguous. This REPEATS the above process for measuring how unambiguous each test is, but we ONLY consider the reference cases that were ambiguous in the first test we used. Measure how many reference cases each test can unambiguously categorize For each test case Look at each output value, and only consider output values with either all true-positive or else all true-negative reference points Count the number of reference points that ended up in any output for this test The sum of counts (all reference cases that ended up in any homogenous set) is the score for this test. It tells us how many reference samples seem to be in an unambiguous result. We assume that if a test output is unambiguous (all reference cases with that value for that test are either positive or negative) then this is true - ie, there is not some reference case we have not yet seen that would make some output value ambiguous. Continue repeating this until all cases are classified. Now, in a large data set, usually no test will have a result that unambiguously identifies all positives or all negatives. This is ok, we will just always have to perform several tests to get one definite positive or negative. However, what is the best sequence of tests? (ie, fastest route to identification) One way is to measure the amount of "disorder" or homogeneity in the cases that have each corresponding test result. From Information Theory: Disorder of a set = -((P/T) * log-base-2(P/T)) - ((N/T) * log-base-2(N/T)) P = positive N = Negative T = total (P + N) The rations are all < 1 so logs are always negative, so -(-) is always positive. Graph this with x-axis is (P/T) and y-axis is Disorder. The graph says If P/T=1/2 then disorder = 1 If P/T=1 then disorder = 0 If P/T=0 then disorder = 0 L'hopital's rule says you can use the limit. The curve looks like an upside down U that starts as 0, reaches a max of 1, and ends at 0. Now, we can measure the quality of a test by measuring the disorder of cases for each test result value and then add up the disorders for all sets corresponding to each output values. Additionally, we will weight each sum by the fraction of cases that Test-Quality = SUM-OF-EACH-OUTPUT-VALUE(Disorder(set of cases with this test result value) * (#cases with this result value / #cases handled by test) Here, we are measuring Disorder, so a lower score is better. So, take the test with the lowest weighted disorder score, and that is the best test. Now, we can do as before. Start with the best test, then for each result group perform the next best test. Now, you may use the same test twice, but use different thresholds for separating the input cases into output. We can use this same approach - a tree of tests for the cases we discussed in lecture 10 for Decision Boundaries in Nearest Neighbors. But, with a tree of tests, the decision boundaries are dis-continuous lines. This approach is better than simple nearest-neighbors. Does not need to use all tests, only enough to make a decision This is used many times, but in practice in medicine we convert the tree to a set of rules. Each rule is just a path down the tree to a leaf node. Lecture 12 - Learning: Neural Nets, Back Propagation ====================================================== This mimics biology, and is the result of a lot of effort over the years Let's start with a simple overview of the biology of neurons. This has a cell body, an axon (the axon may be bifurcated, ie branch), and dendrites (which heavily branch) Dendrites form the dendritic tree. Dendrites take input from other axon. The axon feeds outputs to other dendrites Each junction is a synapse, which has a lot of control mechanisms. Not all synapses are the same. Some have more influence of the overall neuron excitation than others. Synapses may be weighted, "synaptic weights" The input from all dendritic synapses add/subtract to determine whether to excite the axon. This is a binary decision, though, so the axon is either excited or not excited. So, there is an analog blend of weighted inputs, and then it either meets a threshold or not. Model of Neuron - List of inputs: I1, I2, I3,.... All inputs are 0 or 1 - Each input is multiplied by a weight: w1, w2, w3,.... So, we have weighted inputs: I1w1, I2w2, I3w3, ..... This converts the binary input into a scaler integer Graphically, these are weighting boxes - Combine the inputs. We sum all of the inputs. Simply summing the inputs is a big design decision I1w1 + I2w2 + I3w3 + ..... Graphically, this is a summation box - Measure the sum of weighted inputs against a Threshold. If (I1w1 + I2w2 + I3w3 + .....) >= Threshold, then the axon is 1 If (I1w1 + I2w2 + I3w3 + .....) < Threshold, then the axon is 0 Graphically, this is a threshold box Now, we can combine neurons to make a neural net A group of neurons may take a list of inputs, I1, I2, I3,.... map them to different neurons, weight them differently, sum them, pass through one or more thresholds This will produce one or more outputs, z1, z2, z3, .... So, there are several inputs and several outputs. We can model these as vectors. [z1, z2, z2,...] = F(I1, I2, I3,....) Where M is the input weights We can train this, so we want the web to exactly match the training inputs to the training outputs [Tz1, Tz2, Tz2,...] = G(TI1, TI2, TI3,....) So, we want the regular function F to work like G for the training inputs. We want a set of weights to make the general output match the training output. We need a way of measuring how well the function is performing. This is a "performance function", P, which is some difference between actual output and desired output. Note this is only for training data, because we know the desired output, and we can run it through the net to see the actual output for a specific set of weights. In practice, make a vector that is the difference between the observed and desired outputs. P = [(Tz1 - z1), (Tz2 - z2), (Tz3 - z3),....] Now, scale this to 1/2 * (P **2) This squaring will make it always a positive number, and halving makes it mathematically convenient. Graphically, the 1/2 * (P **2) is a U-shaped graph with trough at the (0,0) Now, again for convenience, make P = -1/2 * (P **2) So, now this is an upside-down W, with max at (0,0) We can just do a Hill-Climbing DFS to explore the space of weights, and then pick the optimal weights that will maximize the performance function. Intuitively: We can graph the Performance Function in a space of weights. We can also highlight the iso-performance lines, combinations of weights that make the same performance This is like isolines in a topological graph, or iso-utility curves in economics Consider just 2 weights, w1 and w2, so this is a 2-d cartesian coordinates. There may be some point that is optimal performance. This corresponds to the max of -1/2 * (P **2) or also to the min of 1/2 * (P **2) Now, start with a random weighting, w1, w2, we can see where this set of weightings lands on the graph that maps weightings to performance. Now, we use hillclimbing to pick a set of weights that maximizes performance. We can do this for every set of input and output training data. Of course, if there are N weights, then this may become computationally expensive. In theory, we can also adjust thresholds. This may make the exploration more complicated, but there is a trick. Trick #1: Add an extra input, I0, and a weight w0. I0 is always -1, and the w0 is the threshold, so w0I0 is -threshold. Now, the axon fires if summation is >= 0, forget the threshold. So, this casts the threshold as just another weight, and we are back to just optimizing the weights. Trick #2 In practice, for faster searching of a large space of many weights, ascend on a vector perpendicular to each contour line. deltaW is the change to the weight vector [deltaW1, deltaW2, deltaW3,...] For any i (1,2,3,...) deltaW[i] = k * (delta P[i]/deltaW1 + delta P[i]/deltaW2 + .....) This works, except it requires a continuous function. Trick #3 To make this a continuous function, we need the output not a pure binary step function, but rather are very steep sigmoid. So, make w0 is 1 / (1 + e**(-alpha)) where alpha is the input Now, let's consider the world's simplest neural net One input I1, p1 = w1*p1 Simple thresholding function, a sigmoid function y = Sigmoid(p1) Now, pass through a second weighting function p2 = w2*y Pass through a second sigmoid function z = Sigmoid(p2) Calculate the performance of z P = -1/2 * (desired - z)**2 This is just a single line, so it is no longer a vector, and instead is just a scalar. w1 w2 | | \/ \/ x ---(Mult)-P1->(Sigmoid)-y->(Mult)-P2->(Sigmoid)-z->(Performance)--> P = -1/2 * (d - z)**2 We want to optimize this, so we will need to traval along the max derivative. So, we have to calculate partial derivatives of P with respect to different weights. Specifically, we need partial derivatives of P with respect to w1 and w2 Start with w2, it's easier. We use the chain rule, several times to go back each stage in the net. deltaP/deltaW2 = deltaP/deltaZ * deltaZ/deltaW2 but deltaP/deltaZ = (d - z) deltaP/deltaZ = (d - z) * deltaZ/deltaW2 We can use chain rule again to solve deltaZ/deltaW2 deltaZ/deltaW2 = deltaZ/deltaP2 * deltaP2/deltaW2 But, deltaP2/deltaW2 is just y deltaZ/deltaW2 = deltaZ/deltaP2 * y We can recombine these to get t hefinal answer: deltaP/deltaW2 = (d - z) * deltaZ/deltaP2 * y Now, we go back to the original question, and compute deltaP/deltaW1 Again, use the chain rule deltaP/deltaW1 = deltaP/deltaZ * deltaZ/deltaW1 deltaP/deltaZ was already solved above for the w2 case. deltaP/deltaZ = (d - z) * deltaZ/deltaW1 Again, use the chain rule again deltaZ/deltaW1 = deltaZ/deltaP2 * deltaP2/deltaW1 deltaP2/deltaW1 is again the chain rule deltaP2/deltaW1 = deltaP2/deltaY * deltaY/deltaW1 Now, deltaP2/deltaY is just w2 To solve deltaY/deltaW1, we again use the chain rule deltaY/deltaW1 = deltaY/deltaP1 * deltaP1*deltaW1 deltaP1*deltaW1 is just x so deltaY/deltaW1 = deltaY/deltaP1 * x We can recombine these to get th efinal answer: deltaP/deltaW1 = (d - z) * deltaZ/deltaP2 * w2 * deltaY/deltaP1 * x OK, but now we have to solve these partial derivatives for the sigmoid functions. deltaZ/deltaP2 and deltaY/deltaP1 Recall, we defined these as 1 / (1 + e**(-alpha)) where alpha is the input So, we want deltaBeta/deltaAlpha deltaBeta/deltaAlpha = (1 + e**-alpha)**-1 deltaBeta/deltaAlpha = -(1 + e**-alpha)**-2 * delta(1+e**-alpha)/deltaAlpha deltaBeta/deltaAlpha = -(1 + e**-alpha)**-2 * e**-alpha deltaBeta/deltaAlpha = e**-alpha / (1 + e**-alpha)**2 deltaBeta/deltaAlpha = (1 + e**-alpha - 1) / (1 + e**-alpha)**2 deltaBeta/deltaAlpha = 1/(1 + e**-alpha) * [(1 + e**-alpha)/(1 + e**-alpha) - 1/(1 + e**-alpha)] But, 1/(1 + e**-alpha) = beta deltaBeta/deltaAlpha = beta * [1 - beta] We chose the sigmoid function carefully, so it is continuous (which allows us to take derivatives) and also acts *almost* like a digital threshold function (which is what we used originally), and also the derivative of the output with respect to the input is just a function of the output. So, now deltaZ/deltaP2 = z(1 - z) We can recombine these to get the final answers: deltaP/deltaW2 = (d - z) * (z*(1 - z)) * y deltaP/deltaW1 = (d - z) * (z*(1 - z)) * w2 * (y *(1 - y)) * x So, the deltaP/deltaW2 partial derivatives only depend on z and y So, the deltaP/deltaW1 partial derivatives only depend on w2 and x and y This is the back propagation algorithm So, if there are 100 inputs, then the Nth input depends only on the N-1st Or, for a fixed number of neurons in a column, the total amount of calculation is linear in terms of the number of neurons. These are very simple neural nets, There are no feedback loops so an output never becomes an input. This is just scaling inputs to fit desired outputs. It is just fitting a curve. It maps inputs to some desired curve output. There are some problems with neural nets. - How do we encode all the inputs into numeric values - Sometimes, the curve gets tightly bent to fit the samples. This is called "overgitting" and it leads to brittle responsiveness. - The rate constants can lead to positive feedback loops which will wildly oscillate rather than converge on the final answer. Mega-Recitation 4: Neural Nets ====================================================== This lecture is going to present the specific strategies for solving neural nets, including Back Propagation. Each weight in the net may affect its final output result, so when we want to explicitly describe this, and compute performance with respect to some weight embeeded deep inside the net, we have to do a lot of recursion. Consider a neural net as a series of operations, with step 1, 2, 3, ... through N and step N produces the final output. Now, stage N-1 will send its output to stage N. Eventually this reaches the final output of the net. But, this means that any effect that stage N-1 has on the final output is first filtered through N (and any directly connected outputs to stage N-1). This is inherently a recursive problem. The effect of weight N-1 on the performance can be described in terms of the effect of stage N on performance. So, rather than describe the effect of each stage on net performance in isolation, we start at the end and then describe the effect of the last stage on net performance in absolute terms, then the effect of the next-to-last stage effect on net performance in terms of the next to last stage and the effect of the last stage on net performance. So, we describe the effect of stage N-1 in terms of the effect of stage N. This means technically, we need to recursively compute the network performance in terms of the output of each stage. We call this delta(J) where J is the stage number. delta(j) is not the true network performance, but it is the performance (desired - actual) value for that specific stage in the network. First, a few equations: ----------------------- Sigmoid This is: y = 1 / (1 + e**-x) Derivative of Sigmoid is derivative dSigmoid/dx = y(1 - y) Performance Function -1/2 (desired - actual) **2 This is 1/2 and squared to make the derivative simpler. dPerformance/dx = (desiredOutput - actualOutput) where x is the final output. W' This is the new weight, after a previous step computed a value and then did back-propagation. Each step will compute an observed output, then use backpropagation to revise the weights for the next step. Each W is the input for a multiplier. W' = W + (alpha * Input * delta(J)) where alpha is a constant, it's the size of steps you take in a hill-climbing search. Small alpha is small steps, and big alpha is big steps Input is the input coming into the multiplier. Note you use the value for the input from iteration N to compute the next weight used for iteration N+1 delta(J) is a partial derivative of the performance of the net with respect to this weight. It tells you how this weight affects the performance of the net. For the last node in the network, this is just: delta(J) = dPerformance/dw = dPerformance/dSigmoidOut * dSigmoidOut/dw We already know that: dPerformance/dSigmoidOut = (desiredOutput - actualOutput) So, delta(J) = (desiredOutput - actualOutput) * dSigmoidOut/dw But, dSigmoidOut/dw = actualOutput(1 - actualOutput) So, for the last weight in the pipeline deltaJ = (desiredOutput - actualOutput) actualOutput(1 - actualOutput) This is the base case. Now, for any other node in the middle of the network, we can compute delta(J) recursively. dPerformance/dw = dPerformance/dSigmoidOut * dSigmoidOut/dw But, now dSigmoidOut/dw is solved recursively. This is why we needed the base case, which was the last node. So, dSigmoidOut/dw is SUM(weight(i)*delta(I)) where i is every node that takes an output from the weight we are evaluating This is the recursion, so we already computed the delta(J) for the output nodes, and now we use that to compute this node that is one step back. ** Key Idea ** So, this scheme requires that we compute the delta(I) for the outputs of nodes and then use those to compute the changes in weights. By using the next weight Now, let's REPLACE the sigmoid function with a different function. Now, if we change the sigmoid function to something else, then we replace actualOutput(1 - actualOutput) with the derivative of the new sigmoid function. As an example, consider a new function that is just a step function that compares the input to a threshold. It now will add a negative weight to the input. Out = In - weight, and then output either 0 or 1 depending on whether the result is positive or negative. In this example, there are no sigmoids in a particular net. In this case, our deltaJ for the final node SIMPLIFIES to deltaJ = desiredOut - actualOut In this case, our deltaJ for the final node SIMPLIFIES to deltaK = SUM(w(i) * delta(i)) for all I that node K sends outputs to. Note, these are the immediate outputs only. Note, this multiplies the weight at the current node with the delta of an output node. Now, in this case we will compute new weights for *both* scaling nodes (out = w * input) and threshold nodes (out = -(weight) + input) Now, different types of nets can do different kinds of identification. This really comes down to the number and configuration of non-trivial nodes, ie, the Sigmoids, not adders or multipliers. So, the real "logic" in the neural net is the placement of the threshold/sigmoid functions. More specifically, the logic is defined by the number of sigmoids that take an original input to the net, so the first stage of the net. Other sigmoids in the net that take inputs from earlier sigmoids really do the equivilent of logical AND or OR, but do not put new constraints on the original inputs. So, looking at the sigmoids at the first stage of the net will tell us about the behavior of the whole net. For now, assume there are just 2 inputs, x and y, and we can plot these on a 2D graph. We can then plot which values of x,y get recognized (ie, have a positive net output) by the net. Each sigmoid in a net will usually add a single "line" to the bounding region of inputs that are recognized. So it can identify a linear relationship between the inputs x and y and identify all input combinations above or below that line. Adding more sigmoid nodes will add more lines that define the boundary of "identified" x,y values, so it can recognize a more complicated region of inputs. This first stage of sigmoids will each create recognize values in a half-plane, bounded by a line If the sigmoid takes both x,y then the recognized half plane can be bounded by a diagonal If the sigmoid takes only x or only y then the recognized half plane can be bounded by a horizontal or vertical line Note, you can take both x and y and still bound a halfplane by a horizontal or vertical, for example by assigning a weight of 0 to one of the inputs. The second stage will combine the half-planes, with either a UNION (logical OR) or the Intersection (logical AND) The third stage will combine the regions of the second stage, with either a UNION (logical OR) or the Intersection (logical AND) And so on. - A net with no sigmoids, is really just a big weighted sum. Its output is just all x and y scaled - A net with one sigmoid detects inputs along a single line. If the net takes both x and y as inputs, then the line can be at any angle. If the net takes only x OR y as input, then the line is either vertical (parallel to Y axis) or horizontal (parallel to x axis). - A net with 3 sigmoids detects inputs along a space bounded by 3 lines Note, these bounding lines may be horizontal or vertical or diagonal. You can look at each bounding line, and If it is a horizontal/vertical line, then it is formed by a sigmoid that takes only one input (just x or just y) If it is a diagonal line, then it is formed by a sigmoid that takes both inputs (a and y) Lecture 12a - Neural Nets Part A (Fall 2015) ================================================================== In 2010, this course felt the neural nets were less interesting - it was almost removed from the course. By 2012, however, Jeff Hinton published a paper of image recognition using neural networks that categorized pictures. This was powerful, and made neural networks popular again. This is inspired by biology. First a quick review of neuron. There is cell body, axon, dendrites. Axons connect to the dendrites of other neurons at synapses. If there is enough stilumation in the dendritic tree, then the axon depolarizes, which will then feed into another neuron. Note, this is digital, so the axon either depolarizes or it does not, it all depends on whether the inputs to the dendrites exceeds some threshold. In nerual nets in the literature, there are several components: - Inputs like x1,x2,..xn which are either 0 or 1. This may be the output of a pre-synaptic neuron - Multipliers, multiply their input by a weight w1, w2, ...wn - Adders will add up several weighted inputs - Threshold box - it is 1 iff the input is above some threshold. Each will have a threshold t1, t2, ...tn The output is z, which is 0 or 1 This model has some important properties we see in real neurons - all-or-none discrete behavior - Cumulative influence of inputs - Synaptic weight (each input is weighted differently) It does not have some possibly important properties of neurons - Refractory period - Time varies - whether the sequence of arrival of inputs to the dendrites - Axonal bifurcation - depolarization may go down either branch So a lot of neurons are a time-based system. A neural net is a collection of weights and thresholds. It takes inputs x1, x2, ...xn and outputs z1, z2, .... zn So, Z is a vector which is a function of input vector, weight vector, and threshold vector. Training a neural net is just modifying the weights and thresholds so we get the desired output for known inputs. So, a neural network is a function approximator - it can train to emulate any function. We use some sample data with known outputs. We compare the current function of the net as the difference between actual and desired value. Performance is P = f(DesiredVector, ActualVector) A simple model is just subtracting: P = AbsoluteValue(desired - actual). But, this makes the math hard, so we pick another function that will make the math simpler. P = 1/2 * AbsoluteValue(desired - actual))**2 P = - (AbsoluteValue(desired - actual))**2) It's always negative and the best value is 0. So, we adjust weights and threshold to maximize performance, so make P the closest to 0. You can think of this as searching in the space of all possible weights and thresholds. We can do a simple hill-climbing. But, that is computationally intractible - a recent successful neural net had 60 million parameters. A better way is to start taking partial derivatives. Consider a simple net with 2 weights, w1 and w2 dPerformance/dw1 dPerformance/dw2 Change in weight vector = (dPerformance/dw1 * i) + (dPerformance/dw2 * j) This ensures that we move in the direction toward the optimal solution. Really, we will move at a rate constant, Rate, which encapsulates the size of the step in our search. Change in weight vector = Rate * ((dPerformance/dw1 * i) + (dPerformance/dw2 * j)) Now, this approach requires that the functions be continuous, not some digital/discontinuous threshold. So, we use a steep sigmoid for the threshold, which is still continuous but close to a step function. First, we will add in an additional input, which is always -1, and is immediately multiplied by a weight w0. This is added to the summation. Now, the threshold is simply centered around 0. Second we replace the discontinuous step function with a sigmoid. We can use many step functions, but for mathematical simplicity, we use F(x) = 1 / (1 + e**(-x)) Let's work with a simple net, just 2 neurons x -- (Multiplier * w1, output P1)---> (Sigmoid, output y) --> (Multiplier * w2, output P2)---> (Sigmoid, output z) So, Z is the final output. Performance = 1/2 * (D - z)**2 Where Z is the final output variable and D is the desired output. So, now let's compute the partial derivatives. dP/dw2 ------- We need to use the change rule. dP/dw2 = dP/dz * dz/dW2 Really, we can also use the chain rule to simplify dz/dW2 and replace it with dz/dP2 * dP2/dw2 So: dP/dw2 = dP/dz * dz/dP2 * dP2/dw2 Note, this is simply backtracking through the neural net. Each partial derivative is dOutput/dInput for some stage of the neural net. Multiplying these together allows us to compute the output with respect to the orignal inputs. dP/dw2 ------- We can also use the chain rule here dP/dw1 = dP/dz * dz/dP2 * dP2/dy * dy/dP1 * dP1/dw1 Now, this is just the product of partial derivatives. We can re-order them to actually look at the neural network from back to front. dP/dw2 = dP2/dw2 * dz/dP2 * dP/dz dP/dw1 = dP1/dw1 * dy/dP1 * dP2/dy * dz/dP2 * dP/dz Some of these are easy to do: Performance = 1/2 * (D - z)**2, so dP/dz = 2*1/2*(D - Z) = D - z P2 = Y*w2 so dP2/dW2 = y Similarly, dP1/dW1 = x dz/dP2 is a bit tricky. This is the sigmoid. In general b = 1 / (1 + e**(-a)) db/da = d(1 / (1 + e**(-a)))/da = d((1 + e**(-a))**-1)/da = (-1 * (1 + e**(-a))**-2) * de**(-a)/da but de**(-a)/da = -1 * e**(-a) so db/da = (-1 * (1 + e**(-a))**-2) * -1 * (e**(-a)) = (1 + e**(-a))**-2) * e**(-a) This is really e**(-a) / (1 + e**(-a))**2) We can rewrite this as: e**(-a) / (1 + e**(-a)) * (1/(1 + e**(-a))) We can rewrite this as: (1 + e**(-a) - 1 ) / (1 + e**(-a)) * (1/(1 + e**(-a))) We can rewrite this as: ((1 + e**(-a)) / (1 + e**(-a) - 1/(1 + e**(-a))) * (1/(1 + e**(-a))) We can rewrite this as: (1 - 1/(1 + e**(-a))) * (1/(1 + e**(-a))) But remember that 1/(1 + e**(-a) is b We can rewrite this as: (1 - b) * b So, the derivative of the sigmoid output with respect to the input is expressed purely in terms of the output. So, dz/dP2 has output z, so this is z(1 - z) Similarly, dy/dP1 has output z, so this is y(1 - y) Substituting these, we get dP/dw2 = y * z(1 - z) * (D - Z) dP/dw1 = x * y(1 - y) * dP2/dy * z(1 - z) * (D - Z) I *think* dP2/dy = w2 A small rate constant will lead to slow training. A large rate constant may pass over the best answer Really, you want to change the rate constant with performance If performance is decreasing, then the rate constant is too big If performance is increasing, then you are headed in the right direction, so try increasing the rate constant. Now, suppose we have a slightly more complex neural net, 4 neurons. It is just 2 paths of the 2-neuron example we did above. x -- (Multiplier * w1, output P1)---> (Sigmoid, output y) --> (Multiplier * w2, output P2)---> (Sigmoid, output z) x -- (Multiplier * w1, output P1)---> (Sigmoid, output y) --> (Multiplier * w2, output P2)---> (Sigmoid, output z) Now, the performance function has 2 actual outputs and 2 desired outputs. If the 2 streams interact (so intermediate outputs of 1 path is input to steps in the other path). This can be tricky, because there are many different paths through the network. Now, a lot of these partial derivatives are shared between paths. A lot of the work to compute one path will be reused for other paths. The computation needed is linear with the depth of the network. The computation needed is exponential with the width of the network. By the way, a fast fourier transform also reuses partial results. Lecture 12b - Deep Neural Nets Part B (Fall 2015) ================================================================== This discusses applying neural nets to larger problems, so many many more parameters to the net and many many more layers in the net. Really, deep nets is a misnomer. They are usually very wide, and often not more than 10 layers deep. In a net with several horizontal paths, there may be cross-connections between paths. However, if all paths lead to some intermediate node, then the performance will vary with the output of that intermediate node, no matter how many inputs in turn affect the intermediate node. These "bottlenecks" in the net help reduce the complexity somewhat, because once you express how performance varies with the intermediate node, you don't have to explicitly show performance of every path back to the beginning. Instead, then you separately show how the performance of the intermediate node is affected by its inputs. In a sense, divide the neural net into vertical slices, and solve each slice independantly, using just the slice that immediately precedes it. Or, solve each node just in terms of its immediate inputs, not every path back to the original inputs for the entire net. The dot-product is SUM(x1w1 + x2w2 + x3w3 + ....) This is applying an nx1 and 1xn vector. It is what a neural net does when n inputs are separately weighted and then pass through a summation node. If neural nets model what happens in our brain, then the dot-product seems to be something intrinsic to biological computing. Image recognition with neural nets use the idea of Convolution, which is NOT really the same as the convolution integral in Fourier analysis. Convolution in fourier analysis has a concept of memory when it analyzes a signal. This starts with a low res image, say 256 x 256 Then, go over the image, looking at only a 10 x 10 pixel square at a time - move the square to eventually scan the entire image. A neuron will produce an output for each 10 x 10 square Then shift the square to the right by 1 pixel, so there is overlap between 2 consecutive squares. The convolution function will eventually produce a 2-dimensional grid of "points" each is a value of applying a function to one 10 x 10 square. I am not sure how the coordinates map. Maybe the top-left x,y of each 10x10 pixel square is the point in the new matrix? Next we do "Max Pooling" In the result 2D grid, we break it up into squares of points. I am not sure of the square size. For each square, find the maximum value in the square. Then shift the square to the right by 1 point and repeat the process, so there is overlap between 2 consecutive squares. The maximum-selection will eventually produce a 2-dimensional grid of "points" where each point is the max value of one square of points. I am not sure how the coordinates map. Maybe the top-left x,y of each square in the old matrix is the point in the new matrix? This is called "Pooling" and basically we end of with a 2D matrix of points, which represent the highest function of a corresponding region in the original image. One step of convolution and pooling is called a kernel. You may do up to 100 kernels on one image. I am not sure how they differ. Perhaps each is a different neuron, that tries to match the image to a different image in the library? Old fashioned neural nets would feed each pixel as an input to a neural net. The new approach requires a lot more computation. It will have a library of N different "types" of thin, and M examples of each type. There are several specific ideas layered on top of this basic architecture. Auto-Coding ---------------- A vector of inputs (x1, x2, ..., xn) go into an initial layer of neurons. A second layer of neurons, called a hidden layer, has fewer neurons (like 0.1 as many) and takes output from the first layer A third layer of neurons then takes its input from the output of the second layer This generates an output vector (z1, ..., zn). We can now compare this to the desired values (d1, ....,dn) Initially, we can let the desired values be the original input values (x1, ...., xn) Why is this useful? Because the second layer, the "hidden layer", is smaller, so it is a more concise representation (or just lossless compressed) of the input. This is interpreted as a "generalization" of the input. Once we train the neural net, we have a middle layer that is some kind of encoding of a generalization of the properties we are looking for. This is a non-obvious encoding - we do not know how the data is encoded, and it does not seem to correlate to obvious properties in the input data. But, practically, it seems to work. You can do this layer-by-layer, and use the first layer to train the second layer, and so on. Another name for this is "Boltzman machines" and so on. The hidden layer often has 1 neuron for each class. Consider a much simpler example: recognize animals by how tall they are. There are 3 animals, a cheetah, a zebra, a giraffe, and use their shadow as the input to a neural net. Sigmoid Modelling -------------------- We can also modify the weights and threshold to change the shape of the sigmoid to optimally fit the positive and negative examples. We want both positive and negatives to land on the flat parts of the sigmoid. This tends to make a steeper curve that changes from 0 to 1 after the last negative example and before the first positive example. Soft Max -------------------- We have a neuron that identifies each class. Really, it generates a probability that the object we are examining is in the class. So, for a picture, each neuron will assign a probability that the object is in one class. Really, each neuron implements a function F1(x) which returns a probability that x is in class 1, and F2(x) matches it to class 2, and so on. Probability of class J is Fj(x) / SUM(Fi(x)) for all i. Now, we can combine Soft-Max and Auto-Coding. Start with a net with an input layer, a hidden internal layer and a large output layer, and train it. Now, remove the output layer, and replace it with a new output layer. The new output layer has the same number of neurons as the hidden internal layer, and each new output neuron takes input from every neuron in the hidden internal layer. Now, freeze the weights to the hidden internal layer and re-train the new output layer. The new output layer is trained quickly and becomes very accurate. Dropout ----------------------- A common problems in neural nets is to get stuck on local maximum. A solution is that at every iteration, a neuron has a probability of being removed from the next step. It's not clear whether it will return for future iterations or not. It probably does. This seems to prevent getting stuck on a local maximum state. There is redundancy here. Once a neural net is trained, you can remove neurons and very quickly retrain it. But, a wider net can avoid getting stuck on local maxima. So, you can train a net with 10 neurons per layer, then after it has trained, remove all but 2 neurons per layer and quickly retrain it. But, you cannot start with a network of only 2 neurons per layer and train it because it will get stuck on local maxima. Lecture 13 - Learning: Genetic Algorithms ====================================================== This is another effort for mimicing biology - specifically mimicing evolution. A single cell has 1 chromosome from each. Review of mitosis and meiosis, forming gametes. There are several steps for randomization: - Whether paternal or maternal chromosome goes into each gamete - Crossing over We will represent a chromosome as a simple string of values. It's binary values for now, but they could be any alphabet. There are several randomization steps - These can mutate 0->1 or 1->0 - Crossing over - splice the prefix and suffix of adjacent chromosomes - Genotype->Phenotype expression, which depends on environment Each individual may have different fitness. We can measure the fitness with a simple numerical score Fitness then determines a probability of survival This will implement selection, which individualy provide genetic material for the next generation. This is all essentially hill-climbing. Randomization just looks at all local directions and fitness picks the steepest slope. We will select one survivor for each generation. Probability[i] is probability that individual i survives to the next generation Fitness[i] is fitness of individual i Here is a first guess: Probability[i] = Fitness[i] The sum of all probabilities should add up to 1.0 because at each generation we will select one survivor. One way to ensure all probabilities add up to 1 is to scale them by the sum of all fitnesses. Probability[i] = Fitness[i] / SUM-for-all-i(Fitness[i]) Here is an example: ------------------------ This is a genetic algorithm that looks for an optimal value in a 2D space Fitness(x,y) = (sin(x*omega))**2 * (sin(y*omega))**2 * e**((x+y)/sigma) where omega and sigma are arbitrary constants. Really this function is chosen to make nice looking iso-utility lines, that look like a topological map. Each chromosome consists of 2 numbers, x and y. The values may be fractions. A mutation function may randomly add or multiply an x or y by some value A crossover operation may randomly the x between 2 adjacent chromosomes or the y's between 2 adjacent chromosomes. Without crossover, this is essentially a hill-climbing search. It can get stuck at local maxima or minima We care less of the absolute fitness value, and more about the ordering of fitness So we order the candidates by fitness the highest fitness has a fixed probability Pc of survival the second highest fitness has a probability (1 - Pc)*Pc of survival Where 1-Pc is the chance that the highest did not survive, the third highest fitness has a probability ((1 - Pc)(1 - Pc))*Pc of survival In general, P(n-1) = (1 - Pc)**(n-2) * Pc The probability of the last chance is (1 - Pc)**(n - 1) The "step size" makes a big difference on whether these algorithms will get stuck on a local maxima. A larger step size helps it jump over local maxima. The "step size" is the amount of randomization in each step, and not the number of iterations the software uses for each step. Another improvement is to select the survivors based not only on fitness but also diversity. So, you want both a fit and diverse population. Compute diversity as a numerical distance between each new candidate and all candicates that have previously been selected. So, you keep a history of all previously selected individuals to compute this diversity. We have a 2D space, x axis is fitness, y axis is diversity We want the max of both fitness and diversity, the top right corner of the space that maximizes both x and y We can then draw diagonals from that top-right corner, that are iso-survival lines. On one hand, you get to a maximal value, but once you reach it you don't narrow down because the diversity component forces you to remain spread out. Ideally, at one point you want to reduce the importance of diversity. There are a few cases where this has been useful. - Planning. Cross-over allows you to mix the frint-half of one plan with the back half of another plan Or you can use a forward-chain expert system and then randomize the rules slightly Example Making creature from Blocks Each chromosome has bits for Which blocks are included Where blocks connect Now, this all uses a very simplistic idea of evolution. For example, we really don't understand how genotype maps to phenotype or thngs like speciation. This is all essentially hill-climbing. Randomization just looks at all local directions and fitness picks the steepest slope. So, a lot of the value of these programs is in the selection of the fitness model and how to represent the problem. However, genetic algorithms are not too magic - it's basically hill-climbing in a space of possible solutions. Lecture 14 - Learning: Sparse Spaces, Pnonology ================================================================== Winston does not feel Neural Nets are very promising - they can work but usually not very well. This is his opinion, but this is from a guy with a lot of knowledge and experience. Neural Nets are popular in the field, and Winston said his lab is doing some work on them, but they may not live up to the promise. He feels simpler approaches, like Nearest Neighbor, are still valid for learning problems when no other solution seems to fit. Making a noun parallel can use simple rules, like adding an "s" to the end. But this may be simple from a spelling perspective but complex differences for the sounds. The "s" is pronounced differently for dogs (a z) or cats (an s). Phonology theory from Morris Holley A speaker translates a word to phonemes to create sounds. A listener hears these sounds, converts them to electrical signals then translates the signal to a vector of booleans for "distinctive features" It seems 14 different distinctive features can describe all phonemes. So, there are potentially 16,000 phonemes. Some of these are: sylabic, voiced, continuent, strident But most lanaguages have fewer than 100 phonemes (English has approx 40) Oddly, a listener wil often "infer" acoustic sounds. The McGurk effect, we watch a speaker's lips and infer the sounds. But, there are other examples, including the "Phoneme Restoration Effect" There are 5 muscles that form words Tongue Two Lips ?? Palate? He is describing a machine (Yip and Sussman) Vision system List of words List of features Set of Constraints (plural constraints, etc) Buffer of phonemes Step 1 The vision input system will send inputs to the List of features, The vision input will also to the list of words to select a word Step 2 A selected word will select features from the list of features A selected word will push pnonemes onto the phoneme buffer Step 3 The list of features will activate constraints Constraints examine the phonemes in the buffer and can also push phonemes onto the phoneme buffer For example, if the buffer has a list of phonemes for a noun, and this is plural, then the constraint This system can run either way. To generate speech from a word, and also to map heard sounds to words. The constraints can be learned, from positive and negative examples. In Yip and Sussman, they tried to understand which phoneme is used for the plural "s", either a ZZZZZZ or a SSSSSSS. They found examples of both types of words, then mapped each to phonemes, and decide which phonemes seem to matter and which do not. The space of phonemes is sparse. This actually makes it easier to distinguish between different phonemes, because any 2 phonemes differ in several ways. In the Scientific Method - start with a problem and end with an experiment. Most AI is done backwards - trying to fit a mechanism to a problem. It's "mechanism envy" or "falling in love with the mechanism". For example, it tries to apply neural nets to every problem. Sussman-Yip started with a problem, and then built an experiment to decide how the mechanism works. They followed the scientific method They designed a good representation of the problem. From this representation, constraints emerge From the constraints, they can design an experiment It all starts with a good representation of a problem. 1. It makes the right things explicit 2. It exposes constraints 3. Localness. Are related things in the world also related in the model Lecture 15 - Learning: Near Misses, Felicity Conditions ================================================================== As an aside, this is the type of learning algorithms that we learned in 1984 in 6.034, before neural nets and other types of learning were added to the course (though those other techniques may have been understood by AI people at the time, they were not in this particular introductory course in 1984 when I took it). Example: Recognizing a simple arch. This is 2 vertical blocks and a horizontal block across their tops. You start with an initial positive example and an initial negative example. Both examples are the 3 blocks. Start by representing this as a simple symbolic form. A graph, each node is a block and each edge is a relationship between 2 blocks. Simple edge is "supported by". So, for the positive example, the top cross piece has "supported by" relationships to the 2 posts. An abstract representation is important, because it removes unimportant details, like the color of the arches. Basically, we compare the positive and negative models and find relationships in one and not the other. Relationships that are in positive and not in negative are required Relationships that are in negative and not in positive are vetos The model evolves with pairs of positives and near-misses. Negatives must be "near-misses" to be most useful. They expose one relationship that is missing or violated. Relationships that are in two positive examples are Not-Necessary OR else may be a OR-disjunction of 2 values. The model can say "Does not matter" if we have seen positive examples with all possible values. Otherwise, it is a disjunction-OR Now, there is also an Is-A property of each node. In the arch example, the Is-A can be types of blocks: brick, wedge, dowel, etc. We can combine positive examples with an OR operation. Alternatively, the values may be arranged in a hierarchy. Brick and Wedge are both blocks. We can combine 2 positive examples Part of the algorithm depends on representation. Describing the right features and making thise explicit. Another key part is the order that the positive and negative examples are presented. Start with a "seed" which is the first positive example. For each positive example, apply the heuristics to broaden the model For each negative example, apply the heuristics to narrow the model We can represent the space of possible models as a tree of nodes, each node is a possible model. Edges connect a parent to a child node if the child is a refinement of the parent after seeing a positive or negative example. We can search this space, looking for an optimal model, which is the goal node. The space can be large, so do a Beam search to constrain the space. The heuristics have been identified. 1. Require link heuristic. Must be present in positive cases 2. Forbid Link heuristic. Must not be present in a positive case 3. Extend set. Broaden a list of allowed values 4. Drop-Link. A link may have any value in positive cases 5. Climb-tree heuristic. A link may have any value in a more abstract class. Note, this both generalizes and specializes the model. A required link is a specialization step. An extend or drop-link Climb-tree is a generalization step. Specialization uses near misses Generalization uses positive examples In general, there is a teacher (source of examples in a specific order) and a student. Felicity Conditions (felicity: the state of being happy) The student knowledge is a graph, (nodes are states of knowledge, links are learning steps) The teacher must understand the initial state of the student. The teacher fixes mistakes in past lessons by refreshing The teacher fixes mistakes in future lessons by saying "wait" The student must trust the teacher If a student has limited memory, then give series of specific examples If a student has large memory, then give batches of many examples As a general rule: Better students "talk to themselves". They explain the concepts to themselves. They say more problem-domain-specific statements They also say more "monitoring" statements, like "this is hard". But, these may also be metaevaluation, so monitoring and debugging the problem solving process Anecdotally, making you talk to yourself during the problem-solving process does seem to help you solve problems better. Well-packaged ideas have 5 common qualities 1. Symnbol that is associated with the idea. A visual "handle". Learning uses the "arch" 2. Slogan that is associated with the idea. A verbal handle. But, this also forces you to succinctly describe the essence of the idea. It's an elevator pitch. Learning uses the names "arch learning or one-shot learning" 3. Surprise. An idea does something very different. Arch Learning improves the model with each training example, not thousands. 4. Salient. Something that "sticks out". A significant idea that becomes the focal point. This is similar to the slogan. 5. Story. Present the idea as a story. Your ideas are like children, you want them to have the best life possible. So, you package them for the most success. Lecture 16 - Learning: Support Vector Machines ================================================================== All of the learning systems discussed so far are pattern recognition systems. In the abstract, this is identifying positive examples in a space of possible cases. As a simple example, consider a 2-D space, and there are positive and negative examples at different points in this space. All of the learning systems (nearest-neighbors, neural nets, etc) try to create a function that maps (x,y) into +/- in a correct way. In reality, it is trying to identify which parts of the space are positive examples. To me, the interesting thing about nearest neighbors is that it is a control system, not just a pettern recognition system. It predicts actions to correctly respond to a state, not just recognize whether the state is an example of a specific class. In these pattern recognition systems, the goal is to draw a line that separates the positive from negative examples. If you have the function for this line, then you can do pattern recognition by deciding if any state (x,y) is above or below this line. Or, in a more complex (realistic) scenario, this is a threshold which is the minimum requirements to be in the class and recognize patterns by deciding whether any instance (x,y,z,a,b,...) is above or below the threshold. This function is called the "decision boundary", it is the threshold. Now, this threshold function ideally is evenly between the positive and negative examples, not just skirting one of them. This leaves "room" for error and other new positive or negative examples. | + | | - + | | + | - | - --------------------------- Imagine a diagonal line separating the groups of + and - examples. Suppose you have a vector, w, that starts from the origin that is perpendicular to the center dividing line between positive and negative examples. Now, take any vector x from origin to a point (x1,y1). Project it onto the Perpendicular vector and decide whether it is on the + or - side. Really, to recognize any example (x,y), compute "w dot x >= c" where c is some constant. If "w dot x" is >= c then it's a positive example. Or, if we let b = -c, then this is just "(w dot x) + b >= 1" for positive and "(w dot x) + b" <= 1" for negative. So, what is w? x is just a vector to some (x,y). Create a constraint, "(w dot x+) + b >= 1", where x+ is any positive example. Create another constraint, "(w dot x-) + b <= -1", where x- is any negative example. So, for any xi (where Xi is any positive or negative (x,y)) "(w dot xi) + b >= 1", where x+ is any positive example. "(w dot xi) + b <= -1", where x- is any negative example. For convenience (a mathematical convenience), define a new variable yi such that yi = +1 for positive examples and -1 for negative samples. yi has a different value (+1 or -1) for each known positive and negative sample. Now, multiply both sides of the equations "(w dot xi) + b >= 1" by yi For positive examples: "yi * ((w dot xi) + b) >= yi" = "(w dot xi) + b >= 1" (since yi is +1 here) For negative examples: "yi * ((w dot xi) + b) <= yi" = "-((w dot xi) + b) <= -1" (since yi is -1 here) = "((w dot xi) + b) >= 1" = "(w dot xi) + b >= 1" (this is the same as the positive example). So, (w dot xi) + b >= 1 or yi((w dot xi) + b) - 1 >= 0 Now, we can say the edges of the boundary between positive and negative examples is exactly 0. These are the positive examples closest to the - region and the negative examples closest to the + region In the lecture, Winston calls these values "gutters" because they are the edges of the "street" where the "street" is the strip of the sample space that is empty and separates the region of positive samples from the region of negative samples. yi((w dot xi) + b) - 1 = 0 Additionally, we want a wide as possible a separation between the region of positive samples from the region of negative samples. In Winston's words, we want to maximize "the width of the street". Really, this should be maximize the distance between the intersection of u and the top of the negative bounds and the intersection of vector u and the bottom of the positive bounds. Let x+ be the smallest positive example and x- be the biggest negative example. Each of these is represented as a vector x+ = (x+,y+) and x- = (x-,y-) (note, I am using x+ as both the vector name and also the x-coord of the vector, sorry for the confusion. Same for x-). To start, take the difference between vectors x+ and x-. This is at some unknown angle. But, then do the dot product of this and vector w, which is orthogonal to the "gutter" or the side of the area dividing between negative and positive examples. So: width = (x+ - x-) dot (w / ||w||) Where w is the orthogonal vector ||w|| is the length of w (w / ||w||) is a unit length normal vector This is a scalar value, it's the width of the stream. Now, combine this with the previous constraint, that for all points that lie on the gutter (the edge of the dividing area) yi((w dot xi) + b) - 1 = 0 If this is a positive example, then yi = 1 (w dot xi) + b - 1 = 0 or (w dot xi) = 1 - b Now, we can substitute this into the width expression for x+ dot w First, expand the original expresion width = [(x+ dot w) - (x- dot w)] / ||w|| Now, substitute for the x dot w width = [(1 - b) - (-1 + b)] / ||w|| width = 2 / ||w|| Now, we want to maximize w, to maximize the distance between the negative and positive example boundaries. So, maximize 2 / ||w|| which means minimize ||w|| For mathematical convenience, this in turn means minimize (1/2 * ||w||**2) We will use Lagrange multipliers. This lets us maximize/minimize an expression that is also subject to some constraints. It allows us to find the min/max of the expression without thinking o fthe constraints. Define a new function L = function - SUM(alphai * contraints) where alphai is some weight constant L = (1/2 * ||w||**2) - SUM(alphai * (yi((w dot xi) + b) - 1)) Now, find the derivative and set it to 0. (Winston is skimming over some of the algabraic details here) The derivative is: dL / dw = w - SUM(alphai * yi((w dot xi) + b) - 1)) BTW, differentiating with respect to a vector is almost like differentiating with respect to a scalar. Also, derivative of ||w|| is w (because it's a lot of Sqrt(squares)) Also, derivative of d(alphai*yi*w dot xi)/dw is just alphai*yi*xi. It seems that a derivative of a dot product is similar to a differential of a multiplicative product. So.... dL / dw = w - SUM(alphai*yi*xi) Since we set the derivative to 0 0 = w - SUM(alphai*yi*xi) or: w = SUM(alphai*yi*xi) Intuitively, w is a linear sum of the positive and negative samples. alphai may be 0 for some i, so this is not necessarily all samples. Also, remember that yi is +1 or -1 Also, take the derivative with respect to b dL / db = SUM(alphai * yi) Since we set the derivative to 0 0 = SUM(alphai * yi) OK, this is a bit tricky. Next, we have an expression for w, which is w = SUM(alphai*yi*xi) Let's plug that back into the Lagrange function. L = (1/2 * ||w||**2) - SUM[alphai * (yi((w dot xi) + b) - 1)] Or do some algebra L = (1/2 * ||w||**2) - SUM[alphai*yi((w dot xi) + alphai*b) - alphai] Here, Winston switches from ||w|| back to the vector w without any explanation. Not sure why these are equivilent, but he may be using the definition of ||w||. He also seems to interchange multiplication and dot-products. Not sure why these are equivilent but I think he may just be expanding the definition of the dot product. But, now we can also expand the expression a bit by doing the multiplicative-distribution. L = (1/2 * ||w||**2) - SUM[alphai*yi((w dot xi)] + SUM[(alphai*b)] - SUM[alphai] So now he gets: L = (1/2 * w**2) - SUM[alphai*yi*xi*w] - SUM(alphai*yi*b) + SUM(alphai) So, after substituting: L = (1/2 * SUM(alphai*yi*xi)**2) - SUM[alphai*yi*xi*SUM(alphai*yi*xi)] - SUM(alphai*yi*b) + SUM(alphai) Note the second multiplicand uses j subscripts. This is same value as the i subscripts, but we use a different j subscript to tell them apart when we start manipulating them. Next, factor out the constant b L = (1/2 * SUM(alphai*yi*xi)**2) - SUM[alphai*yi*xi*SUM(alphai*yi*xi)] - b*SUM(alphai*yi) + SUM(alphai) But, recall that earlier we showed 0 = SUM(alphai * yi) so this is L = (1/2 * SUM(alphai*yi*xi)**2) - SUM[alphai*yi*xi*SUM(alphai*yi*xi)] - b*0 + SUM(alphai) L = (1/2 * SUM(alphai*yi*xi)**2) - SUM[alphai*yi*xi*SUM(alphai*yi*xi)] + SUM(alphai) Similarly, SUM[alphai*yi*xi*SUM(alphai*yi*xi)] = SUM[alphai*yi*xi]*SUM(alphai*yi*xi)] So: L = (1/2 * SUM(alphai*yi*xi)*SUM(alphai*yi*xi)) - SUM[alphai*yi*xi]*SUM(alphai*yi*xi)] + SUM(alphai) Or L = -(1/2 * SUM(alphai*yi*xi)*SUM(alphai*yi*xi)) + SUM(alphai) Or L = -(1/2 * SUM(alphai*alphaj*yi*yj*xi*xj)) + SUM(alphai) Why do we care? It tells us that the optimization only depends on the dot product of pairs of samples. Let's change focus. Now, plug the expression for w (which is w = SUM(alphai*yi*xi)) back into the original decision rule (if (w dot u) + b is >= 0 then it's positive) where u is an unknown vector So: if (SUM(alphai*yi*xi) dot u) + b is >= 0 then it's positive Why do we care? It tells us that the decision only depends on the dot product of pairs of samples. Note that this will NOT work if the positive and negative values are NOT linearly separable. It only works if there is a straight line between the two regions. | - - | - - - | - + - | + - | - + - | - - | - --------------------------- Now, if the points are not separable in one space, then transform the points into another space where they may be easily separable. The maximization only depends on the dot products. Really, we do not need the transformation of individual points from start space to solution space. Instead, we only need a function that converts the dot products in start space into dot products in solution space. This function is called a kernel function, Phi Some popular kernel functions are: 1. Linear: Phi(u dot v) = (u dot v + 1)**n 2. Radial-Basis: Phi(u dot v) = e**-(||u - v||) So, this transformation allows us to use the optimal solution we derived from simple spaces This still has the problem of over-fitting but not the problem of local min/max as neural nets do. Lecture 17 - Learning: Boosting ================================================================== This is used for binary classification Is that a person or a tree? Is that coffee or tea? The math looks hard, but this is easy to implement. It uses the intuition that a group of problem solvers can do better than any simple problem solver. Assume that there are classifier functions, which output either -1 or +1 Let h be a classifier function. It has an error rate between 0.0 and 1.0 which is the fraction of cases that are wrong (so 0.0 is perfect) We can make a good classifier that has a low error rate by letting several weaker classifiers each run and vote. Let H (capital H) is a better composite classifier. H(x) is a function of h1(x), h2(x), h3(x) Simply: H(x) = sin(h1(x) + h2(x) + h3(x)) If 2 or more are +1, then sin() is +1. So if 2 of 3 vote +1 then the answer is +1 Ideally, the parts of the solution space where one h is wrong the other 2 will br right. Of course, if 2 or more are wrong for some value, then the composite will also be wrong. So, which classifiers do we use? - Start with the h1 which has lowest error rate - h2 which is h1 with a modified set of inputs - where the inputs that is gets wrong are magnified/distorted We weigh the samples that have an error more heavily. - h3 which is h1 with a modified set of inputs - where the inputs that h1(x) != h2(x) aremagnified/distorted We weigh the samples that have an error more heavily. We can organize this as a tree. Each node is a function and edges link functions to input functions H is the composite function, it is root H has 3 child nodes: h1, h2, h3 Now, we can recurse. h1 in turn may be a composite of h1.1, h1.2, h1.3. h2 in turn may be a composite of h2.1, h2.2, h2.3. h3 in turn may be a composite of h3.1, h3.2, h3.3. We can create other classifiers from Decision Trees. This is not the full tree, but we may just use a "tree stump" If a decision tree has n tests, we can make a subtest from any subset. Additionally, we can also take the output unchanged, or invert it, so this makes twice the number of possible stumps. So, if there are 3 tests, then we could take any single test, and any pair of tests, and But, we could use ANY type of classifiers for boosting: neuralNets, NearestNeighbors, and more. For the rest of this discussion, we will use dicision tree stumps. For a decision tree stump, the error rate for N samples is 1/N * SUM(wrong cases) Each sample has a weight associated with it: w1, w2, w3. At time 1, all weights are equal, 1/N. So, really, the error rate for N samples is 1/N * SUM(weight of wrong cases) Weights are constrained SUM(weight of ALL cases) = 1.0 So, if some weights are heavier then others must be smaller So, how do we combine simpler classifiers into a composite classifier H(x) = sin( (alpha1 * h1(x)) + (alpha2 * h2(x)) + (alpha3 * h3(x)) ..... ) Where alpha1, alpha2, alpha3 are weights we use to scale each classifier function Assume there are N samples. At time 1, w1 = w2 = w3 = .... = 1/N We will iterate. At step t Pick the h classifier function that has the lowest error rate This lowest error rate is used to compute alpha for that function This lets us compute the weights for step t+1 Let the weight for input i be wi wi at step T+1 = ( wi at step T / Z ) * e**(-alpha at T * h(x) at time t * y(x)) where Z is some normalizing constant. It makes all new weights add up to 1, which is a constraint we earlier identified. where y is a function of x. This has value +1 or -1 y is actually the correct answer. It is used to make the product positive or negative depending on whether it is correct or not. When h(x) is correct then h(x) * y(x) = 1 * 1 = -1 * -1 = 1 When h(x) is INcorrect then h(x) * y(x) = 1 * -1 = -1 * 1 = -1 This is mathematically convenient We want to minimize the error of wi at step T+1 = ( wi at step T / Z ) * e**(-alpha at T * h(x) at time t * y(x)) Essentially, compute a derivative and set it to 0. Winston skips the math here, but the result is Error is minimum if alpha at step T = 1/2 * ln((1 - error at t) / error at t) So, the error rate is bounded by a decaying exponential function. So, this may combine many classifiers. This begs a question - is this really *combining* classifiers or else making a new classifier from any input functions. Really, the input functions could be any function, not just classifiers. Now, in general: Weight at t+1 = ((weight at t) / Z) * (y) Where y is some correction factor that depends on whether we correctly or incorrectly classified in step t Z is a scaling constant But, this is the error we derived above Error at step T = 1/2 * ln((1 - error at t) / error at t) We can plug in the expression for error that we derived above. Note, because this is -ln or +ln, we will invert the ratio when it's -ln If we classified correctly in step t, then: Weight at t+1 = ((weight at t) / K) * (sqrt[ error at t / (1 - error at t) ]) You get the same sin from h(x) and y If we classified INcorrectly in step t, then: Weight at t+1 = ((weight at t) / K) * (sqrt[ (1 - error at t) / error at t ]) We scale the SUM[ Weight at t ] term by a Z such that all w's to add up to 1. 1 = Correct Classifications + INcorrectClassifications Or expand the derivation of weight 1 = SUM[ sqrt[ (1 - error at t) / error at t ] * (Weight at t) / Z ] + SUM[ (Weight at t) / Z * (sqrt[ (1 - error at t) / error at t ]) ] We can factor out the error term 1 = (sqrt[ (1 - error at t) / error at t ] * (SUM[ Weight at t ] / Z)) + sqrt[ (1 - error at t) / error at t ]) ] * (SUM[ weight at t ] / Z) Or Z = (sqrt[ (1 - error at t) / error at t ] * SUM[ Weight at t ]) + sqrt[ (1 - error at t) / error at t ]) ] * SUM[ weight at t ]) Then, Winston says for Correct cases, SUM[ Weight at t ] = error rate at t, and for for INcorrect cases, SUM[ Weight at t ] = (1 - error at t) So: Z = (sqrt[ error at t / (1 - error at t) ] * (1 - error at t)) + sqrt[ (1 - error at t) / error at t ]) ] * (error at t)) Do some algebra and divide the denominator into the numerator Z = (sqrt[ error at t ] * sqrt[1 - error at t]) + sqrt[1 - error at t] * sqrt[error at t]) But the two added terms are equivilens (a*b = b*a) so Z = 2 * sqrt[ (1 - error at t) * error at t ] Let's now substitute this expression for Z into the expression for the weights If we classified correctly in step t, then: Weight at t+1 = ((weight at t) / (2 * sqrt[ (1 - error at t) * error at t ])) * (sqrt[ error at t / (1 - error at t) ]) Weight at t+1 = ((weight at t) / (2 * (1 - error at t)) If we classified INcorrectly in step t, then: Weight at t+1 = ((weight at t) / (2 * sqrt[ (1 - error at t) * error at t ])) * (sqrt[ (1 - error at t) / error at t ]) Weight at t+1 = ((weight at t) / (2 *error at t) Next If we classified correctly in step t, then: Weight at t+1 = (1 / (2 * (1 - error at t)) * SUM[(weight at t)] But SUM[(weight at t)] = 1 - error at t So Weight at t+1 = (1 / (2 * (1 - error at t)) * (1 - error at t) Weight at t+1 = 1 / 2 If we classified INcorrectly in step t, then: Weight at t+1 = ((weight at t) / (2 *error at t) So, for all correct answere, the SUM[Weight at t+1] = 1/2 The sum of ALL weights is 1, so for all INcorrect answers, the SUM[Weight at t+1] = 1/2 OK, before we said that all weights add to 1. Now we see that all correct weights add to 1/2 and all incorrect weights add up to 1/2 In other words, the weights of Correct and INCorrect weights contribute equally. There are some simplifications If we have N samples, we do NOT have to evaluate tests between 2 correct or between 2 incorrect guesses. We only care about tests that separate a correct from an incorrect. Lecture 18 - Representations: Classes, Trajectories, Transitions ================================================================== None of the systems so far emulate human intelligence. Humans evolved, and in an evolutionary time have developed intelligence quickly. Brain size has increased, BUT neantherthols had bigger overall brains but not as smart. It looks like a small group of ~1000 humans (based on mitochondrial DNA) changed their mental machinery and then quickly became smart and took over the planet. To loosely paraphrase Winston who in turn loosely paraphrases Chomsky: "The ability to take 2 concepts and make a 3rd concept without disturbing the original 2 concepts. And the 3rd concept can be used as input to another iteration, so this proceeds without limit." In evolution, this change happened very rapidly, and seemed to not affect brain size. Winston says, this "symbolic ability" enabled story telling and understanding. So, "stories" seems to be a deep mechanism. Linguists distinguish between "inner language" the language with which we think, and spoken language. Often, bilingual people will remember a conversation but not remember which language it was in. Consider a semantic net. We start with nodes and links. "Combinators" are links, they connect nodes There are also links between related links. "Reification" is the process of making links themselves the endpoints of other links. It is making a link an equivilent alternate to a node as the input to a link. There is a localization process, called a "frame", by Minsky, that is a group of related links and nodes. This is like a context. It's a bunch of related nodes and frames, that values can be "plugged into", so you can bind a frame to particular values and make an instance of the frame for one specific application. Winston calls frames a "localization" layer. It "shines a spotlight" onto a This is very general, but so low level it does not help much with specific problems. There is a "parasitic meaning". We know something, then draw a diagram, and project our own understanding onto the diagram, but the diagram itself is not a meaning. We will not consider the types of representation that are needed to understand stories. 1. Classification. We can create hierarchies of types, like tool->hammer->Ball Peen Most people know a little about most things, but a lot about a few things. 2. Transitions One representation is to consider changes in the state. This is different than how engineers think - they think in terms of "state", the state of something at one time. Alternatively, we can think of things as transitions or changes or transformations. So, for every state, there is a timeline. Consider an event. We can model this as 3 time regions: Before event During event After event There are state variables, and for each of the different time regions we can model each with a vocabulary of change: Increasing, not-Increasing Decreasing, not-Decreasing Changing, Not-changing Appear, Not-Appear Disappear, Not-Disappear Not applicable 3. Trajectory A lot of what we say is about objects moving along trajectories. An object, starts at a source, is moved by an agent, follows a path, moves toware a source An agent may affect the object, possibly with a tool. There may be an assistant, and also a beneficiary of the transition. This is really a Minsky frame, and these are slots in the frame that can be bound to different elements of different stories. In some sample databases (like Wall St Journal corpus of 50,000 sentences), about 25% of sentences are transitions or trajectories. Story Sequences A story sequence can be a single sentence, like "Pat Comforted Chris" We can construct a "role" frame for this with slots: Agent=Pat Action=? Comfort is vague Object=Chris Result: Transition frame Transition Frame Object=Chris Mood=improved Multiple ways of thinking about something. Winston paraphrases Minsky as saying, "if you can only think of something in one way then you have no recourse when you get stuck". If the sentence were instead "Pat Terrorized Chris", we get similar frames, but now in the transition frame the Mood is "decreased" If the sentence were instead "Pat Kissed Chris", we get similar frames, but now in the role frame there is also a "hallucination" slow, which is actually a "Trajectory" type frame. Object=Pat's lips Destination=Chris' lips But, this is still vague. A father kisses his daughter on the forehead, a boy kisses a girlfriend on the lips, etc. Now, notice that our internal language also has a mechanism for understanding stories, "Sequence". We don't want just a giant jumble of frames, but rather a sequence over time. Our learning is rooted in sequence. It is hard to replay a piece of music from the beginning, you have to start on a bnoundary. Similarly, you often recall your SSN from the beginning, not just the last 4 digits. We also have libraries of stories. In particular, our frames are stored in a hierarchy. Event - Disaster Earthquake Hurricane - Party Birthday Wedding All of these frames have different slots, and they are the key attributes of a story that we look for. Parsing stories is still difficult, because it can be confusing what pronouns refer to. This leads to some general rules for good writing. 1. Don't use pronouns. It makes it hard for the reader to know what you are referring to. 2. Don't use "former" or "latter". They make the reader stop and look back to a previous sentence to see which you are referring to. 3. Don't call a "shovel" a "spade". Don't usee different words for the same object. Lecture 19 - Architectures: GPS, SOAP, Subsumption ================================================================== This lecture discusses architectures - which combines the ideas presented so far. This will be important in understanding how human intelligence works. GPS - (General Problem Solver) This is the "Problem Solving Hypothesis" This models the worlds in states, with a start state c and goal state s. Measure the distance (in some metric), and then forward-chain to progress to intermediate states that are closer to the goal state. This is also called "Means ends analysis" SOAR - (State Operator and Result) This is a second generation from the general problem solver, built by the people behind GPS This is the "Symbol System Hypothesis" - humans are symbol manipulators. It has several parts Short term memory Long term memory Vision and actuators Most of the work is in the short term memory. This was influenced by Cognitive Science modelling of human thought. Short term memory held Assertions Rules A process would shuttle rules and assertions between short and long term memory. This is a focus mechanism, so only part of your knowledge is active at any time The was also an elaborate "preference" system for choosing between rules when more than 1 rule can fire at the same time. It also had "prblem spaces", which are spaces of dicrete states. So, each problem has a specific problem space that is searched using the rules. So, it's a rule-based system. It also has a "universal subgoaling". When you cant think of what to do next, that becomes the next problem. So, this is forward/backward chaining. Emotion Machine model - Minsky Thinking comes in layers This is based on the "common sense hypothesis" When we solve a problem, we think on different levels 1. Instinctive reaction. Built-in reflexes 2. Learned reaction. We learn responses and protocols 3. Deliverative thinking. She does problem solving 4. Reflective thinking. Think about how we solved problems. This implies we store past problem solving sessions and then anylize/debug/optimize them later with more data to learn. 5. Self-reflecting layer. Think about how different goals interact 6. Self-concious thinking. Think about goals in a social context The Open Mind project was an attempt to gather common sense ideas from the web. Subsumption Architecture (Rod Brooks) This is based on the "creature hypothesis" - make a machine as smart as an insect. Traditionally, robots had separate vision systems, reasoning systems, action systems, and these were separate. Brooks reorganized this around separate layers that deal with the world, but each has some of Example of a hierarchy 1. Seek 2. Explore 3. Wander 4. Avoid Objects This is really a hierarchy of abstraction layers. Each layer has its own vision and reasoning system, and each layer is built on top of the lower layers. Some key features: - No representation. Don't try to build a representation of all data and the world - Use the world - the sensory input, as data rather than an abstract model. So, everything is reactive. - All mechanisms (?plans? rules?) are all finite state machines There are no plans, only reactionary behaviors. You do not have an abstract map of the terrain, but rather wander and react to walls and obstacles. It could build a map as it wanders, and it may do this. This was commercialized as the Roomba robot Minsky Layers seem to be a superset of these Deliverative thinking - this is SOAR Instinctive reaction - This is Subsumption Learned reaction - This is Subsumption Genesis Architecture (Winston's lab) "Strong story hypothesis" This is centered on language. The language system does 2 things 1. Guide and interact with perception/input systems 2. Describe events Some questions use perception to find an answer - like "am I wearing blue jeans" Some questions are not written down, but we can imagine (ie simulate) an event in our mind and tell from the simulation what happens, like "What happens if you run with a bucket full of water" A language allows you to describe events in the context of your culture and social environment. This can also tell stories. Telling and understanding stories may be essential, and the core of "understanding" To illustrate the essential quality of stories, consider a series of experiments from Psychology A rectangular room, all walls are white. In each corner is a basket that may contain food. You put the food in one of the 4 backets while rat is watching Spin the rat, then let it go for food. It will go to the correct corner or the diagonal opposite, but those 2. So, it understands that the room is a rectangle. This is true for rats, children, and adults. Now, paint 1 wall blue. The rat and the child still go to the 2 diagonal corners with equal probability. But, adults, always get the right corner. Children always get the right corner once they use thee words "left" and "right" in language. So, language is more than communicating ideas, it is deeply linked with using and manipulating ideas. But, an adult that is speaking what they hear over headphones, is now back to going to the 2 different corners with equal probability. So, distracting or otherwise occupying the language center will cause you to lose some reasoning function. This may be a bit like the layers in Brooks model, except now one of the layers is language. Two key ideas 1. If you want to be smarter, look, listen, draw, talk These seem to the drive more mental machinery and cause better processing of ideas. Taking notes in class helps because it engates your motor and linguistic machinery. 2. Beware of fast talkers. They are jamming your language processor and so you cannot think. Lecture 21 - Probabilistic Inference - Part I ================================================================== For an event, we can build a table. Each column is a possible explanation. There are N possible explanations so N columns. There are 2**N rows, one row for each possible explanation of columns. Add another column, and keep a count of how often the different combinations represented by each row happens. Then, the probability of each row is the #occurrences of that row divided by the total # observations This is a number between 0 and 1. Add a column for the probability of each row. The probability of each event is the sum of the probability of each row in which the event is true. The probability of each event given a precondition is the number of events with both the precondition AND the event divided by total # occurrences of precondition. This is powerful, but there is a problem because any realistic application has *lots* of rows. For N variables there are 2**N rows. So, we will go over probability without explicitly making a table. Some basics of probability: P(a) is probability of event a. The basic axioms: 0 <= P(a) <= 1 P(true) = 1 P(false) = 0 P(a) + P(b) - P(a AND b) = P(a OR b) (the intuition is to not double-count the P(a AND b), it's part of both P(a) and also P(b) Conditional probability P(a | b) means probability of a given b P(a,b) means probability of a AND b P(A | b) = P(a,b) / P(b) A normal probability is #events / #samples Suppose #samples is P(b), the #events of of a in this world would have to be the same as #envents of "a AND b" since we only consider the events where b is true. So, in this subset of all events, P(a | b) = #events of a / #all events = P(a,b) / P(b) Similarly, by just algebra, P(a,b) = P(a | b) * P(b) Now, divide the space of all events into 3 parts. Consider 3 variables a,b,c Let y = P(b,c) P(a,b,c) = P(a,y) = P(a | y) * P(y) = P(a | b,c) * P(b,c) = P(a | b,c) * P(b | c) * P(c) In general P(x1,x2,...,xn) = Product of (Pxi | xi-1,....,x1) for all i between 1 and n inclusive This is the "Chain rule" Independence P(a | b) = P(a) if a is independant of b The probability of b does not matter So, in Venn diagram terms, P(a | b) is the ratio of the overlap of a and b, to the entire area of b If this is equal to P(a) then that ratio is the same as the ratio of A to the size of the entire event space. Conditional Independence P(a | b,z) = P(a | z) if a is independant of b The probability of b does not matter So, in Venn diagram terms, P(a | b,z) is the ratio of the overlap of a and b and z, to the overlap of b and z That ratio is the same as the ratio of the overlap of a and z to the size of all of z. In general: P(a,b | z) = P(a | z) * P(b | z) Belief Nets We can diagram events, as nodes, and causality as edges. Several nodes may cause a single event, ie, a racoon or a burglur can both cause a dog to bark. Given a node's parents, every node is independant of all other non-descendant nodes. In other words, a node is only dependant on all of its immediate parents, not any indirect ancestors. Now a node *might* depend on its descendants. For each node, you can make a probability table, with all of it's parent nodes as variables, so if there are N parent nodes then there are 2**N rows Now, you still have to find some probabilities, but far fewer than if you just created a single giant table of all variables in the system. So, a Belief Network is a way of simplifying work because you can safely ignore variables that do not have any effect on a particular variable you are modelling. Moreover, once you have a belief network, then you can create a single large table mechanically. Start with a probability of all variables true P(a,b,c,d) But, you can compute this with the chain rule P(a,b,c,d) = P(a | b,c,d) * P(b | c,d) * P(c | d) * P(d) But, P(a | b,c,d) = P(a | b) if a is independant of c,d Lecture 22 - Probabilistic Inference - Part II ================================================================== A bit of review: The Joint Probability Table tells us everything, but it is huge and combinatorically explosive We used "Inference nets" also called "Bayes nets" that have a node for each event and edges if one event will cause another event. This assumes no loops. Any variable in the graph is independant of any non-descendent if the variable's parents are true. Here, he uses "descendant" to mean parent or input nodes. All causality flows through the parents, if the parents are true then you can ignore anything upstream. The Inference graph allows us to derive a full Joint Probability Table with only a limited number of probabilities. It prunes non-causal relationships so it only requires the essential probabilities. To do this, we use the "Chain rule" P(x1,x2,...,xn) = Product of P(xi | xi-1,....,x1) for all i between 1 and n inclusive So, this is a probuct of N terms, where each term is P(xi | xi-1,....,x1) so each variable only depends on variables that come after it. For example: P(x1,x2,x3) = P(X1 | X2,x3) * (PX2 | x3) * P(x3) Each node is a variable. So, how do we arrange the variables for the Chain rule expression? Start with the graph, and take a leaf node. Append it to the list, and remove it from the graph. Now, iterate with the simpler graph This orders the variables so each variable precedes all of its parents. Now, we can use the chain rule with the variables in this order. For example: P(C,D,B,T,R) = P(C | D,B,T,R) * P(D | B,T,R) * P(B | T,R) * P(T | R) * P(R) But, a variable is Independant of any non-descendant node other than its parents. So, a variable is not dependant on nodes that are not parents. So, now we can start to simplify the expressions to remove non-parents from the expression. So, if only D and B are parents of C, then P(C | D,B,T,R) = P(C | D,B) We can start with observed data and use that to build a table of probabilities. Each observed event gives one True/False value for every variable. For each variable that is independant, we record the total number of observations and the number of cases when the variable is true. This gives us #true / #events, so a probability of each independant variable. For each variable that depends on 1 or more parent variables, we keep a similar tally. Create a Probability table. Each row is a combination of input values, and we record the number of times we saw each combination of values. and also the number of times the output variable was true when we saw each combination of input values. So, again, for every row, we have a total #events and also #times output variable is true. This uses observed data to build the probability tables. Next, we use the probability tables to predict behavior. Again, start with an Inference graph, but now start at the leaf nodes. For each node, generate a rendom probability, to tell you whether the node is true/false. Once we have assigned values for all input nodes that input into a particular intermediate node, then we can use those values to select a row int he probability table for the intermediate node. Again, generate a random number and if it mees the threshold for that row, then the intermediate node is true, otherwise it is false. Now, we need to decide whether we have the dependencies in the Inference graph correctly designed. In other words, are there links between every pair of interdependant nodes and no links between independant nodes. We can check this with rules of Baysean Inference. By definition we know that: P(a | b) = P(a,b) / P(b) so: P(a | b) * P(b) = P(a,b) But, we can also start with P(b | a) = P(b,a) / P(b) So, P(b | a) * P(a) = P(b,a) = P(a,b) So, P(a | b) * P(b) = P(a,b) = P(b | a) * P(a) So, P(a | b) * P(b) = P(b | a) * P(a) So, P(a | b) = P(b | a) * P(a) / P(b). This is Bayes Rule. This can be used to determine whether evidence will imply a conclusion. Let a = conclusion b = evidence It may be harder to decide P(a | b) which is conclusion given evidence But, it may be easier to decide P(b | a) which is evidence given conclusion. P(conclusion | evidence) = P(evidence | conclusion) * P(conclusion) / P(evidence) Suppose there are several possible conclusions. We have some evidence, and we know the probability of the evidence given each conclusion. We also know the baseline pre-test probability of the conclusions. So, this is all we need, to decide the probability of each conclusion. I don't understand what he is saying here. All probabilities add up to 1. 1 = P(conclusion1) + P(conclusion2) + .... 1 = P(conclusion1 | evidence) + P(conclusion2 | evidence) + ..... 1 = P(evidence | conclusion1)P(conclusion1)/P(evidence) + P(evidence | conclusion2)P(conclusion2)/P(evidence) + .... P(evidence) = P(evidence | conclusion1)P(conclusion1) + P(evidence | conclusion2)P(conclusion2) + .... But, suppose there are several different pieces of evidence. P(conclusion | evidence) = P(evidence1,evidence2,...evidenceN | conclusion) * P(conclusion) / d where d is some denominator we are not specifying Now, suppose the conclusion is independant of some piece of evidence. When variables are independant, P(evidence1,evidence2,...evidenceN | conclusion) = P(evidence1 | conclusion) * P(evidence2 | conclusion) * ..... * P(evidenceN | conclusion) So, P(conclusion | evidence) = (P(evidence1 | conclusion) * P(evidence2 | conclusion) * ..... * P(evidenceN | conclusion)) * P(conclusion) / d where d is some denominator we are not specifying Now, do this for every possible conclusion. The comclusion with the highest probability given the evidence is the one we select as being most likely. We can also use this to select between competing conclusions after a series of repeated tests. We use the same test, over and over, but it may give a different value each time. Consider flipping a coin - the same flip may come up heads or tails each time. We build a sequence of probability values for each possible conclusion. Each possible conclusion has a liklihood of the test result. So, Start at a node with the pre-test probability. Run one test, and get the result Find the probability of that result for each possible conclusion Multiply that probability witht he current probability of the possible conclusion We can evolve several possibile conclusions in parallel. At any point, pick the possible conclusion with the highest probability. Similarly, if there are 2 competing Inference graph topologies. Each has a probability of selecting a specific conclusion for some evidence. Generate an evidence and result. Compute the probability of each Inference graph reaching this result. Repeat N times. Multiply the probabilities of all N results for each possible Inference graph. Select the Inference graph with the highest probability of producing the observed results. Now, we are selecting structures. But, the space is likely too large, when there are 30 or more variables. So, this is a search problem. Take the loser model and modify it randomly. This is a random walk through the tree. At each step, we compare models. At the start, each model has the same N variables, and no edges. For one model, pick 2 random nodes, and make an edge between them. Loop Get a series of evidence Compute the probability that each model chooses the observed result Multiply that with the running probability of that model Pick the better model, which has highest probability. Now, take the loser, and randomly mutate it (add or remove an edge) Repeat Actually, instead of product of probabilities, use the sum of logs of probabilities. The probabilities get so small we need to add logs rather than multiply numbers. Typical applications: Medical diagnosis Software debugging Probability analysis is the right thing to do when you don't really understand the system Like biology where we do not know all the pathways. Lecture 23 - Model merging ================================================================== Structure discovery was discussed in the previous lecture. This is important, and may be integral to machine learning. Suppose you have 2 stories. Represent each story as a sequence of nodes. We can model this as a basic graph. Events are nodes and edges are time sequencing A-->B means "A followed by B". Next, we can try to combine similar stories, make a graph that has the same basic structure as 2 original graphs, but with optional detours or optional changes in details. This is "Baysean Story Merging" Each node is really a small mini-graph, of actor nodes and action nodes. We can merge 2 larger nodes, if their mini-graphs have comparable action nodes. Note, comparable means things like "run" and "Flee" are the same, but also looser similarities. This can be used for story understanding. - Parse the story into sentences, and each sentence into a mini-graph with actor nodes and action edges. - The story is a sequence of sentences, so it is a sequence of minigraphs. - Compare minigraphs to try to make an abstract minigraph that is a generalization So, story understanding is a general idea, where "story" can also be "theory" or "scientific model". A story is any sequence of events, and understanding stories is really just understanding sequences of events and can be used for scientific investigation. This is "cross model coupling". Each story graph is a model, and this tries to "couple" or find similarities between 2 models. This can be used for more than stories. Winston gives an example of finding similar Bird calls. Think of a visual picture as a sequence of pen strokes. So, a picture is a linear story, it is a linear sequence of graphical primitives, like "horizontal line", "box" and so on. Now, we can do image recognition as comparing 2 sequences of graphical primitives. This is a bit like string matching - take a pattern and try to see if it is a loose/sloppy fit to a known prototype. In speech recognition, we do a Fouruier spectrum. This reduces the series of peaks. But, we also watch speakers, and see the movement of their lips. This is a second series of input, but this time it is a series of visual shapes of the speaker's lips. When we listen, we are correlating the two models, so some lip shapes correspond to the sound peaks. To make a match between 2 story lines: For each node in one sequence, Make weighted links to all possible matching nodes in the other sequence. Each link is weighted by how close a match the 2 nodes are. Find pairs of nodes that are similar to the same things, in a constant proportion. For example, if - node 1 has a link to prototype 1 of weight 4 and a link to prototype 2 of weight 2 - node 2 has a link to prototype 1 of weight 2 and a link to prototype 2 of weight 1 Node 2 has the same links as node 1 with a 2x weight. So, Node 2 and node 1 are both instances of the same thing. Review: * The right representation makes you smarter * Sleep makes you smarter * You cannot learn unless you almost know * We think with mouths eyes and hands * The Strong Story Hypothesis "The mechanisms that enable us humans to tell and understand stories are the evolutionary augmentations that separate our intelligence from that of other primates" * All great Ideas are simple