60
Concept of Learning Rote Learning Learning by Taking Advice Learning in Problem Solving Learning by Induction Explanation-Based Learning Learning Automation Learning in Neural Networks Unit 7 Learning Learning Objectives After reading this unit you should appreciate the following: Concept of Learning Learning Automation Genetic Algorithm Learning by Induction Neural Networks Learning in Neural Networks Back Propagation Network Top Concept of Learning One of the most often heard criticisms of AI is that machines cannot be called intelligent until they are able to learn to do new things and to adapt to new situations, rather than simply doing as they are told to do. There can be little question that the ability to adapt to new surroundings and to solve new problems is an important characteristic of intelligent entities. Can we expect to see such abilities in programs? Ada Augusta, one of the earliest philosophers of computing, wrote that

Chaptr 7 (final)

Embed Size (px)

DESCRIPTION

AI notes

Citation preview

Page 1: Chaptr 7 (final)

Concept of LearningRote LearningLearning by Taking AdviceLearning in Problem SolvingLearning by InductionExplanation-Based LearningLearning AutomationLearning in Neural Networks

Unit 7Learning

Learning Objectives

After reading this unit you should appreciate the following:

Concept of Learning

Learning Automation

Genetic Algorithm

Learning by Induction

Neural Networks

Learning in Neural Networks

Back Propagation Network

Top

Concept of Learning

One of the most often heard criticisms of AI is that machines cannot be called intelligent until they are able to learn to do new things and to adapt to new situations, rather than simply doing as they are told to do. There can be little question that the ability to adapt to new surroundings and to solve new problems is an important characteristic of intelligent entities. Can we expect to see such abilities in programs? Ada Augusta, one of the earliest philosophers of computing, wrote that

The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.

Several AI critics have interpreted this remark as saying that computers cannot learn. In fact, it does not say that at all. Nothing prevents us from telling a computer how to interpret its inputs in such a way that its performance gradually improves.

Rather than asking in advance whether it is possible for computers to "learn," it is much more enlightening to try to describe exactly what activities we mean when we

Page 2: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

say "learning" and what mechanisms could be used to enable us to perform those activities. Simon has proposed that learning denotes changes in the system that are adaptive in the sense that they enable the system to do the same task or tasks drawn from the same population more efficiently and more effectively the next time.

As thus defined, learning covers a wide range of phenomena. At one end of the spectrum is skill refinement. People get better at many tasks simply by practicing. The more you ride a bicycle or play tennis, the better you get. At the other end of the spectrum lies knowledge acquisition. As we have seen, many AI programs draw heavily on knowledge as their source of power. Knowledge is generally acquired through experience and such acquisition is the focus of this chapter.

Knowledge acquisition itself includes many different activities. Simple storing of computed information, or rote learning, is the most basic learning activity. Many computer programs, e.g., database systems, can be said to "learn" in this sense, although most people would not call such simple storage, learning. However, many AI programs are able to improve their performance substantially through rote-learning technique and we will look at one example in depth, the checker-playing program of Samuel.

Another way we learn is through taking advice from others. Advice taking is similar to rote learning, but high-level advice may not be in a form simple enough for a program to use directly in problem solving. The advice may need to be first operationalized.

People also learn through their own problem-solving experience. After solving a Complex problem, we remember the structure of the problem and the methods we used to solve it. The next time we see the problem, we can solve it more efficiently. Moreover, we can generalize from our experience to solve related problems more easily contrast to advice taking, learning from problem-solving experience does not usually involve gathering new knowledge that was previously unavailable to the learning program. That is, the program remembers its experiences and generalizes from them, but does not add to the transitive closure of its knowledge, in the sense that an advice-taking program would, i.e., by receiving stimuli from the outside world. In large problem spaces, however, efficiency gains are critical. Practically speaking, learning can mean the difference between solving a problem rapidly and not solving it at all. In addition, programs that learn though problem-solving experience may be able to come up with qualitatively better solutions in the future.

Another form of learning that does involve stimuli from the outside is learning from examples. We often learn to classify things in the world without being given explicit rules. For example, adults can differentiate between cats and dogs, but small children often cannot. Somewhere along the line, we induce a method for telling cats from dogs - based on seeing numerous examples of each. Learning from examples usually involves a teacher who helps us classify things by correcting us when we are wrong. Sometimes, however, a program can discover things without the aid of a teacher.

AI researchers have proposed many mechanisms for doing the kinds of learning described above. In this chapter, we discuss several of them. But keep in mind throughout this discussion that learning is itself a problem-solving process. In fact, it is very difficult to formulate a precise definition of learning that distinguishes it from other problem-solving tasks.

179

Page 3: Chaptr 7 (final)

LEARNING

Top

Rote Learning

When a computer stores a piece of data, it is performing a rudimentary form of learning. After all, this act of storage presumably allows the program to perform better in the future (otherwise, why bother?). In the case of data caching, we store computed values so that we do not have to recompute them later. When computation is more expensive than recall, this strategy can save a significant amount of time. Caching has been used in AI programs to produce some surprising performance improvements. Such caching is known as rote learning.

Figure 7.1: Storing Backed-Up Values

We mentioned one of the earliest game-playing programs, Samuel's checkers program. This program learned to play checkers well enough to beat its creator. It exploited two kinds of learning: rote learning and parameter (or coefficient) adjustment. Samuel's program used the minimax search procedure to explore checkers game trees. As is the case with all such programs, time constraints permitted it to search only a few levels in the tree. (The exact number varied

180

Page 4: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

depending on the situation.) When it could search no deeper, it applied its static evaluation function to the board position and used that score to continue its search of the game tree. When it finished searching the tree and propagating the values backward, it had a score for the position represented by the root of the tree. It could then choose the best move and make it. But it also recorded the board position at the root of the tree and the backed up score that had just been computed for it as shown in Figure 7.1(a).

Now consider a situation as shown in Figure 7.1(b). Instead of using the static evaluation function to compute a score for position A, the stored value for A can be used. This creates the effect of having searched an additional several ply since the stored value for A was computed by backing up values from exactly such a search.

Rote learning of this sort is very simple. It does not appear to involve any sophisticated problem-solving capabilities. But even it shows the need for some capabilities that will become increasingly important in more complex learning systems. These capabilities include:

1. Organized Storage of Information-in order for it to be faster to use a stored value than it would to recompute it, there must be a way to access the appropriate stored value quickly. In Samuel's program, this was done by indexing board positions by a few important characteristics, such as the number of pieces. But as the complexity of the stored information increases, more sophisticated techniques are necessary.

2. Generalization-The number of distinct objects that might potentially be stored can be very large. To keep the number of stored objects down to a manageable level, some kind of generalization is necessary. In Samuel's program, for example, the number of distinct objects that could be stored was equal to the number of different board positions that can arise in a game. Only a few simple forms of generalization were used in Samuel's program to cut down that number. All positions are stored as though White is to move. This cuts the number of stored positions in half. When possible, rotations along the diagonal are also combined. Again, though, as the complexity of the learning process increases, so too does the need for generalization. .

At this point, we have begun to see one way in which learning is similar to other kinds of problem solving. Its success depends on a good organizational structure for its knowledge base.

Student Activity 7.1

Before reading next section, answer the following questions.

1. Discuss role of A.I. in learning.

2. What do you mean by skill refinement and knowledge acquisition?

3. Would it be reasonable to apply rote learning procedure to chess? Why?

If your answers are correct, then proceed to the next section.

Top

Learning by Taking Advice

181

Page 5: Chaptr 7 (final)

LEARNING

A computer can do very little without a program for it to run. When a programmer writes a series of instructions into a computer, a rudimentary kind of learning is taking place. The programmer is a sort of teacher, and the computer is a sort of student. After being programmed, the computer is now able to do something it previously could not. Executing the program may not be such a simple matter, however. Suppose the program is written in a high-level language like LISP. Some interpreter or compiler must intervene to change the teacher's instructions into code that the machine can execute directly.

People process advice in an analogous way. In chess, the advice "fight for control of the center of the board" is useless unless the player can translate the advice into concrete moves and plans. A computer program might make use of the advice by adjusting its static evaluation function to include a factor based on the number of center squares attacked by its own pieces.

Mostow describes a program called FOO, which accepts advice for playing hearts, a card game. A human user first translates the advice from English into a representation that FOO can understand. For example, "Avoid taking points" becomes:

(avoid (take-points me) (trick))

FOO must operationalize this advice by turning it into an expression that contains concepts and actions FOO can use when playing the game of hearts. One strategy FOO can follow is to UNFOLD an expression by replacing some term by its definition. By UNFOLDing the definition of avoid, FOO comes up with:

(achieve (not (during (trick) (take-points me))))

FOO considers the advice to apply to the player called "me." Next, FOO UNFOLDs the definition of trick:

(achieve (not (during

(scenario

(each pI (players) (play-card p1))

(take-trick (trick-winner))

(take-points me))))

In other words, the player should avoid taking points during the scenario consisting of (1) players playing cards and (2) one player taking the trick. FOO then uses case analysis to determine which steps could cause one to take points. It rules out step 1 on the basis that it knows of no intersection of the concepts take-points and play-card. But step 2 could affect taking points, so FOO UNFOLDs the definition of take-points:

(achieve (not (there-exists c1 (cards-played)

(there-exists c2 (point-cards)

(during (take (trick-winner) cl)

(take me c2))))))

182

Page 6: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

This advice says that the player should avoid taking point-cards during the process of the trick-winner taking the trick. The question for FOO now is: Under what conditions does (take me c2) occur during (take (trick-winner) cl)? By using a technique called partial match, FOO hypothesizes that points will be taken if me = trick-winner and c2 = cl. It transforms the advice into:

(achieve (not (and (have-points (cards-played > = (trick-winner) me))))

This means, "Do not win a trick that has points." We have not travelled very far conceptually from "avoid taking points," but it is important to note that the current vocabulary is one that FOO can understand in terms of actually playing the game of t. hearts. Through a number of other transformations, FOO eventually settles on:

(achieve (>= (and (in-suit-led (card-of me))

(possible (trick-has-points)))

(low (card-of me)))

In other words, when playing a card that is the same suit as the card that was played first, if the trick possibly contains points, then playa low card. At last, FOO has translated the rather vague advice "avoid taking points" into a specific, usable heuristic. FOO is able to playa better game of hearts after receiving this advice. A human can watch FOO play, detect new mistakes, and correct them through yet more advice, such as "play high cards when it is safe to do so." The ability to operationalize knowledge is critical for systems that learn from a teacher's advice. It is also an important component of explanation-based learning.

Top

Learning in Problem Solving

In the last section, we saw how a problem solver could improve its performance by taking advice from a teacher. Can a program get better without the aid of a teacher? It can, by generalizing from its own experiences.

Learning by Parameter Adjustment

Many programs rely on an evaluation procedure that combines information from several sources into a single summary statistic. Game-playing programs do this in their static evaluation functions, in which a variety of factors, such as piece advantage and mobility, are combined into a single score reflecting the desirability of a particular board position. Pattern classification programs often combine several features to determine the correct category into which a given stimulus should be placed. In designing such programs, it is often difficult to know a priori how much weight should be attached to each feature being used. One way of finding the correct weights is to begin with some estimate of the correct settings and then to let the program modify the settings on the basis of its experience. Features that appear to be good predictors of overall success will have their weights increased, while those that do not will have their weights decreased, perhaps even to the point of being dropped entirely.

Samuel's checkers program exploited this kind of learning in addition to the rote learning described above, and it provides a good example of its use. As its static evaluation function, the program used a polynomial of the form

183

Page 7: Chaptr 7 (final)

LEARNING

C1tl + C2t2 + ...+ C16t16

The t terms are the values of the sixteen features that contribute to the evaluation. The terms are the coefficients (weights) that are attached to each of these values. As learning progresses, the values will change.

The most important question in the design of a learning program based on parameter adjustment is "When should the value of a coefficient be increased and when should it be decreased?" The second question to be answered is then "By how much should the value be changed? The simple answer to the first question is that the coefficients of terms that predicted the final outcome accurately should be increased, while the coefficients of poor predictors should be decreased. In some domains, this is easy to do. If a pattern classification program uses its evaluation function to classify an input and it gets the right answer, then all the terms that predicted that answer should have their weights increased. But in game-playing programs, the problem is more difficult. The program does not get any concrete feedback from individual moves. It does not find out for sure until the end of the game whether it has won. But many moves have contributed to that final outcome. Even if the program wins, it may have made some bad moves along the way. The problem of appropriately assigning responsibility to each of the steps that led to a single outcome is known as the credit assignment problem.

Samuel's program exploits one technique, albeit imperfect, for solving this problem. Assume that the initial values chosen for the coefficients are good enough that the total evaluation function produces values that are fairly reasonable measures of the correct score even if they are not as accurate as we hope to get them. Then this evaluation function can be used to provide feedback to itself. Move sequences that lead to positions with higher values can be considered good (and the terms in the evaluation function that suggested them can be reinforced).

Because of the limitations of this approach, however, Samuel's program did two other things, one of which provided an additional test that progress was being made and the other of which generated additional nudges to keep the process out of a rut.

When the program was in learning mode, it played against another copy of itself. Only one of the copies altered its scoring function during the game; the other remained fixed. At the end of the game, if the copy with the modified function won, then the modified function was accepted. Otherwise, the old one was retained. If, however, this happened very many times, then some drastic change was made to the function in an attempt to get the process going in a more profitable direction.

Periodically, one term in the scoring function was eliminated and replaced by another. This was possible because, although the program used only sixteen features at anyone time, it actually knew about thirty-eight. This replacement differed from the rest of the learning procedure since it created a sudden change in the scoring function rather than a gradual shift in its weights.

This process of learning by successive modifications to the weights of terms in a scoring function has many limitations, mostly arising out of its lack of exploitation of any knowledge about the structure of the problem with which it is dealing and the logical relationships among the problem's components. In addition, because the learning procedure is a variety of hill climbing, it suffers from the same difficulties, as do other hill-climbing programs. Parameter adjustment is certainly not a solution to

184

Page 8: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

the overall learning problem. But it is often a useful technique, either in situations where very little additional knowledge is available or in programs in which it is combined with more knowledge-intensive methods.

Learning with Macro-Operators

We saw that rote learning was used in the context of a checker-playing program. Similar techniques can be used in more general problem-solving programs. The idea is the same: to avoid expensive recomputation. For example, suppose you are faced with the problem of getting to the downtown post office. Your solution may involve getting in your car, starting it, and driving along a certain route. Substantial planning may go into choosing the appropriate route, but you need not plan about how to go about starting your car. You are free to treat START-CAR as an atomic action, even though it really consists of several actions: sitting down, adjusting the mirror, inserting the key, and turning the key. Sequences of actions that can be treated as a whole are called macro-operators.

Macro-operators were used in the early problem-solving system STRIPS. After each problem-solving episode, the learning component takes the computed plan and stores it away as a macro-operator, or MACROP. A MACROP is just like a regular operator except that it consists of a sequence of actions. Not just a single one. A MACROP's preconditions are the initial conditions of the problem just solved and its postconditions correspond to the goal just achieved. In its simplest form, the caching of previously computed plans is similar to rote learning.

Suppose we are given an initial blocks world situation in which ON(C, B) and ON(A, Table) are both true. STRIPS can achieve the goal ON(A, B) by devising a plan with the four steps UNSTACK(C, B). PUTDOWN(C), PICKUP (A), STACK(A, B). A STRIP now builds a MACROP with preconditions ON(C, B). ON(A, Table) and postconditions ON(C, Table). ON(A, B). The body of the MACROP consists of the four steps just mentioned. In future planning, STRIPS is free to use this complex macro-operator just as it would use any other operator.

But rarely will STRIPS see the exact same problem twice. New problems will differ from previous problems. We would still like the problem solver to make efficient use of the knowledge it gained from its previous experiences. By generalizing MACROPs before storing them, STRIPS is able to accomplish this. The simplest idea for generalization is to replace all of the constants in the macro-operator by variables. Instead of storing the MACROP described in the previous paragraph, STRIPS can generalize the plan to consist of the steps UNSTACK(X1.X2). PUTDOWN(X1). PICKUP(X3), STACK(X3, X2). where X1. X2. and X3 are variables. This plan can then be stored with preconditions ON(X1, X2), ON(X3. Table) and postconditions ON(X1, Table). ON(X2, X3). Such a MACROP can now apply in a variety of situations.

Generalization is not so easy. Sometimes constants must retain their specific values. Suppose our domain included an operator called STACK-ON-B(X). With preconditions that both X and B be clear, and with postcondition ON(X, B).

STRIPS might come up with the plan UNSTACK(C, B), PUTDOWN(C), STACK- ~ ON-B(A). Let's generalize this plan and store it as a MACROP. The precondition " becomes ON(X3, X2), the postcondition becomes ON(X1, X2), and the plan itself becomes UNSTACK(X3, X2), PUTDOWN(X3), STACK-ON-B(X(). Now, suppose we encounter a slightly different problem:

185

Page 9: Chaptr 7 (final)

LEARNING

The generalized MACROP we just stored seems well-suited to solving this problem if we let X1 = A, X2 = C, and X3 = E. Its preconditions are satisfied, so we construct the plan UNSTACK(E, C), PUTDOWN(E), STACK-ON-B(A). But this plan does not work. The problem is that the post condition of the MACROP is over generalized. This operation is only useful for stacking blocks onto B, which is not what we need in this new example. In this case, this difficulty will be discovered when the last step is attempted. Although we cleared C, which is where we wanted to put A, we failed to clear B, which is where the MACROP is going to try to put it. Since B is not clear, STACK-ON-B cannot be executed, If B had happened to be clear, the MACROP would have executed to completion, but it would not have accomplished the stated goal.

In reality, STRIPS uses a more complex generalization procedure. First, all constants are replaced by variables. Then, for each operator in the parameterized plan, STRIPS reevaluates its preconditions. In our example, the preconditions of steps I and 2 are satisfied, but the only way to ensure that B is clear for step 3 is to assume that block X2, which was cleared by the UNSTACK operator, is actually block B. Through "re- proving" that the generalized plan works, STRIPS locates constraints of this kind.

It turns out that the set of problems for which macro-operators are critical are exactly those problems with nonserializable subgoals. Nonserializability means that working on one subgoal will necessarily interfere with the previous solution to another subgoal. Recall that we discussed such problems in connection with nonlinear planning. Macro- operators can be useful in such cases, since one macro-operator can produce a small global change in the world, even though the individual operators that make it up produce many undesirable local changes.

For example, consider the 8-puzzle. Once a program has correctly placed the first four tiles, it is difficult to place the fifth tile without disturbing the first four. Because disturbing previously solved subgoals is detected as a bad thing by heuristic scoring, functions, it is strongly resisted. For many problems, including the 8-puzzle and Rubik's cube, weak methods based on heuristic scoring are therefore insufficient. Hence, we either need domain-specific knowledge, or else a new weak method. Fortunately, we can learn the domain-specific knowledge we need in the form of macro-operators. Thus, macro-operators can be viewed as a weak method for learning. In the 8-puzzle, for example, we might have a macro a complex, prestored sequence of operators-for placing the fifth tile without disturbing any of the first four tiles externally (although in fact they are disturbed within the macro itself). This approach contrasts with STRIPS, which learned its MACROPs gradually, from experience. Korf's algorithm runs in time proportional to the time it takes to solve a single problem without macro-operators.

Learning by Chunking

Chunking is a process similar in flavor to macro-operators. The idea of chunking comes from the psychological literature on memory and problem solving. SOAR also exploits chunking so that its performance can increase with experience. In fact, the designers of SOAR hypothesize that chunking is a universal learning method, i.e., it can account for all types of learning in intelligent systems.

SOAR solves problems by firing productions, which are stored in long-term memory. Some of those firings turn out to be more useful than others. When SOAR detects a useful sequence of production firings, it creates a chunk, which is essentially a large,

186

Page 10: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

production that does the work of an entile sequence of smaller ones. As in MACROPs, chunks are generalized before they are stored.

Problems like choosing which subgoals to tackle and which operators to try (i.e., search control problems) are solved with the same mechanisms as problems in the original problem space. Because the problem solving is uniform, chunking can be used to learn general search control knowledge in addition to operator sequences. For example, if SOAR tries several different operators, but only one leads to a useful path in the search space, then SOAR builds productions that help it choose operators more wisely in the future. SOAR has used chunking to replicate the macro-operator results described in the last section. In solving the 8-puzzle, for example, SOAR learns how to place a given tile without permanently disturbing the previously placed tiles. Given the way that SOAR learns, several chunks may encode a single macro-operator, and one chunk may participate in a number of macro sequences. Chunks are generally applicable toward any goal state. This contrasts with macro tables, which are structured towards reaching a particular goal state from any initial state. Also, chunking emphasizes how learning can occur during problem solving, while macro tables are usually built during a preprocessing stage. As a result, SOAR is able to learn within trials as well as across trials. Chunks learned during the initial stages of solving a problem are applicable in the later stages of the same problem-solving episode. After a solution is found, the chunks remain in memory, ready for use in the next problem.

The price that SOAR pays for this generality and flexibility is speed. At present, chunking is inadequate for duplicating the contents of large, directly computed macro-operator tables.

The Utility Problem

PRODIGY employs several learning mechanisms. One mechanism uses explanation-based learning (EBL), a learning method we discuss in PRODIGY can examine a trace of its own problem-solving behavior and try to explain why certain paths failed. The program uses those explanations to formulate control rules that help the problem solver avoid those paths in the future. So while

SOAR learns primarily from examples of successful problem solving, PRODIGY also learns from its failures.

A major contribution of the work on EBL in PRODIGY was the identification of the utility problem in learning systems. While new search control knowledge can be of great benefit in solving future problems efficiently, there are also some drawbacks. The learned control rules can take up large amounts of memory and the search program must take the time to consider each rule at each step during problem solving. Considering a control rule amounts to seeing if its postconditions are desirable and seeing if its preconditions are satisfied. This is a' time-consuming process. So while learned rules may reduce problem-solving time by directing the search more carefully, they may also increase problem-solving time by forcing the problem solver to consider them. If we only want to minimize the number of node expansions in the search space, then the more control rules we learn, the better. But if we want to minimize the total CPU time required to solve a problem, we must consider this trade-off.

187

Page 11: Chaptr 7 (final)

LEARNING

PRODIGY maintains a utility measure for each control rule. This measure takes into account the average savings provided by the rule, the frequency of its application, and the cost of matching it. If a proposed rule has a negative utility, it is discarded (or "forgotten"). If not, it is placed in long-term memory with the other rules. It is then monitored during subsequent problem solving. If its utility falls, the rule is discarded. Empirical experiments have demonstrated the effectiveness of keeping only those control rules with high utility. Utility considerations apply to a wide range of AI learning systems. For example, for a discussion of how to deal with large, expensive chunks in SOAR.

Top

Learning by Induction

Classification is the process of assigning, to a particular input, the name of a class to which it belongs. The classes from which the classification procedure can choose can be described in a variety of ways. Their definition will depend on the use to which they will be put.

Classification is an important component of many problem-solving tasks. In its simplest form, it is presented as a straightforward recognition task. An example of this is the question "What letter of the alphabet is this?" But often classification is embedded inside another operation. To see how this can happen, consider a problem-solving system that contains the following production rule:

If: the current goal is to get from place A to place B, and

there is a WALL separating the two places

then: look for a DOORWAY in the WALL and go through it.

To use this rule successfully, the system's matching routine must be able to identify an object as a wall. Without this, the rule can never be invoked. Then, to apply the rule, the system must be able to recognize a doorway.

Before classification can be done, the classes it will use must be defined. This can be done in a variety of ways, including:

Isolate a set of features that are relevant to the task domain. Define each class by a weighted sum of values of these features. Each class is then defined by a scoring function that looks very similar to the scoring functions often used in other situations, such as game playing. Such a function has the form.

C1t1 + C2V2 + C3t3 + ...

Each t corresponds to a value of a relevant parameter, and each c represents the weight to be attached to the corresponding t. Negative weights can be used to indicate features whose presence usually constitutes negative evidence for a given class.

For example, if the task is weather prediction, the parameters can be such measurements as rainfall and location of cold fronts. Different functions can be written to combine these parameters to predict sunny, cloudy, rainy, or snowy weather.

188

Page 12: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

Isolate a set of features that are relevant to the task domain. Define each class as a structure composed of those features. For example, if the task is to identify animals, the body of each type of animal can be stored as a structure, with various features representing such things as color, length of neck, and feathers.

There are advantages and disadvantages to each of these general approaches. The statistical approach taken by the first scheme presented here is often more efficient than the structural approach taken by the second. But the second is more flexible and more extensible.

Regardless of the way that classes are to be described, it is often difficult to construct, by hand, good class definitions. This is particularly true in domains that are not well understood or that change rapidly. Thus the idea of producing a classification program that can evolve its own class definitions is appealing. This task of constructing class definitions is called concept learning, or induction. The techniques used for this task J must, of course, depend on the way that classes (concepts) are described. If classes are described by scoring functions, then concept learning can be done using the technique of coefficient adjustment. If, however, we want to define classes structurally, some other technique for learning class definitions is necessary. In this section, we present three such techniques.

Winston's Learning Program

Winston describes an early structural concept-learning program. This program operated in a simple blocks world domain. Its goal was to construct representations of the definitions of concepts in the blocks domain. For example, it learned the concepts: House, Tent, and Arch shown in Figure 7.2. The figure also shows an example of a near miss for each concept. A near miss is an object that is not an instance of the concept in question but that is very similar to such instances.

The program started with a line drawing of a blocks world structure. It used procedures to analyze the drawing and construct a semantic net representation of the structural description of the object(s). This structural description was then provided as input to the learning program. An example of such a structural description for the House of Figure 7.2 is shown in Figure 7.3(a). Node A represents the entire structure, which is composed of two parts: node B, a Wedge, and node C, a Brick. Figures 7.3(b) and 7.3(c) show descriptions of the two Arch structures of Figure 7.2. These descriptions are identical except for the types of the objects on the top, one is a Brick while the other is a Wedge. Notice that the two supporting objects are related not only by left-of and right-of links, but also by a does-not-marry link, which says that the two objects do not marry. Two objects marry if they have faces that touch, and they have a common edge. The marry relation is critical in the definition of an Arch. It is the difference between the first arch structure and the near miss arch structure shown in Figure 7.2.

189

Page 13: Chaptr 7 (final)

LEARNING

Figure 7.2: Some Blocks World Concepts

The basic approach that Winston's program took to the problem of concept formation can be described as follows:

1. Begin with a structural description of one known instance of the concept. Call that description the concept definition.

2. Examine descriptions of other known instances of the concept. Generalize the definition to include them.

3. Examine descriptions of near misses of the concept. Restrict the definition to exclude these.

Steps 2 and 3 of this procedure can be interleaved.

Steps 2 and 3 of this procedure rely heavily on a comparison process by which similarities and differences between structures can be detected. This process must function in much the same way as does any other matching process, such as one to determine whether a given production rule can be applied to a particular problem state. Because differences as well as similarities must be found. The procedure must perform not just literal but also approximate matching. The output of the comparison procedure is a skeleton structure describing the commonalities between the two input structures. It is annotated with a set of comparison notes that describe specific similarities and differences between the inputs.

To see how this approach works, we trace it through the process of learning what an arch is. Suppose that the arch description of Figure 7 .3(b) is presented first. It then becomes the definition of the concept Arch. Then suppose that the arch description of Figure 7.3(c) is presented. The comparison routine will return a structure similar to the two input structures except that it will note that the objects represented by the nodes labeled C are not identical. This structure is shown as Figure 7.4. The c-note link from node C describes the difference found by the comparison routine. It notes that the difference occurred in the isa link, and that in the first structure the isa link pointed to Brick, and in the second it pointed to Wedge. It also notes that if we were to follow isa links from Brick and Wedge, these links would eventually merge. At this point, a new, description of the concept Arch can be generated. This description

190

Page 14: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

could say simply that node C must be either a Brick or a Wedge. But since this particular disjunction has no previously known significance, it is probably better to trace up the isa hierarchies of Brick and Wedge until they merge. Assuming that that

happens at the node Object, the Arch definition shown in Figure 7.4 can be built.

191

Page 15: Chaptr 7 (final)

LEARNING

Figure 7.3: The Structural Description

Figure 7.4

Figuer 7.4

Figure 7.5: The Arch Description after Two Examples

Next, suppose that the near miss arch is presented. This time, the comparison routine will note that the only difference between the current definition and the near miss is in the does-not-marry link between nodes B and D. But since this is a near miss, we do not want to broaden the definition to include it. Instead, we want to restrict the definition so that it is specifically excluded. To do this, we modify the link does-not-marry, which may simply be recording something that has happened by chance to be true of the small number of examples that have been presented. It must now say must-not-marry. Actually, must-not-marry should not be a completely new link. There must be some structure among link types to reflect the relationships between marry, does-not-marry, and must-not-marry.

Notice how the problem-solving and knowledge representation techniques we covered in earlier chapters are brought to bear on the problem of learning. Semantic

192

Page 16: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

networks were used to describe block structures, and an isa hierarchy was used to describe relationships among already known objects. A matching process was used to detect similarities and differences between structures, and hill climbing allowed the program to evolve a more and more accurate concept definition.

This approach to structural concept learning is not without its problems. One major problem is that a teacher must guide the learning program through a carefully chosen sequence of examples. In the next section, we explore a learning technique that is insensitive to the order in which examples are presented.

Example

Goal

Winston's Arch Learning Program models how we humans learn a concept such as "What is an Arch"?

This is done by presenting to the "student" (the program) examples and counter examples (which are however very similar to examples). The latter are called "near misses."

The program builds (Winston's version of) a semantic network. With every example and with every near miss, the semantic network gets better at predicting what exactly an arch is.

Learning Procedure W

1. Let the description of the first sample which must be an example, be the initial description.

2. For all other samples.

193

Near Misses:

Page 17: Chaptr 7 (final)

LEARNING

- If the sample is a near-miss call SPECIALIZE

- If the sample is an example call GENERALIZE

Specialize

• Establish parts of near-miss to parts of model matching

• IF there is a single most important difference THEN:

– If the model has a link that is not in the near miss, convert it into a MUST link

– If the near miss has a link that is not in the model, create a MUST-NOT-LINK, ELSE ignore the near-miss.

Generalize

1. Establish a matching between parts of the sample and the model.

2. For each difference:

– If the link in the model points to a super/sub-class of the link in the sample a MUST-BE link is established at the join point.

– If the links point to classes of an exhaustive set, drop the link from the model.

– Otherwise create a new common class and a MUST-BE arc to it.

– If a link is missing in model or sample, drop it in the other.

– If numbers are involved, create an interval that contains both numbers.

– Otherwise, ignore the difference.

194

Page 18: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

Because the only difference is the missing support, and it is near miss, that are must be important.

New Model

This uses specialize.(Actually it uses it twice, this is an exception).

However, the following principles may be applied.

1. The teacher must use near misses and examples judiciously.

2. You cannot learn if you cannot know (=> KR)

3 You cannot know if you cannot distinguish relevant from irrelevant features. (=> Teachers can help here, by new missing.)

195

Page 19: Chaptr 7 (final)

LEARNING

4. No-Guessing Principle: When there is doubt what to learn, learn nothing.

5. Learning Conventions called "Felicity Conditions" help. The teacher knows that there should be only one relevant difference, and the Student uses it.

6. No-Altering Principle: If an example fails the model, create an exception. (Penguins are birds).

7. Martin's Law: You can't learn anything unless you almost know it already.

Learning Rules from Examples

Given are 8 instances:

1. DECCG

2. AGDBC

3. GCFDC

4. CDFDE

5. CEGEC

(Comment: "These are good")

1. CGCGF

2. DGFCD

3. ECGCD

(Comment: "These are bad")

The following rules perfectly describes warnings:

(M(1,C) M(2,G) M(3,C) M(4,G) M(5,F))

(M(1,D) M(2,G) M(3,F) M(4,C) M(5,D))

(M(1,E) M(2,C) M(3,G) M(4,C) M(5,D) ) ->Warning

Now we eliminate the first conjunct from all 3 disjuncts:

(M(2,G) M(3,C) M(4,G) M(5,F) )

(M(2,G) M(3,F) m(4,C) M(5,D))

(M(2,C) M(3,G) M(4,C) M(5,D) ) -> Warning

Now we check whether any of the good strings 1 - 5 would cause a warning.

1, 4, 5 are out at the first position. (Actually it is the second now). 2, 3 are out at the next (third) position.

Thus, I don't really need the first conjunct in the Warning rule!

I keep repeating this dropping business, until I am left with

M(5,F) V M(5,D)-> warning

196

Page 20: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

This rule contains all the knowledge contained in the examples!

Note that this elimination approach is order dependant!! Working from right to left you would get:

(M(1,C) M(2,6)) ( M(1,D) M(2,6)) M(1,E) -> warning

Therefore the resulting rule is order dependent! There are N! (factorial!!!) orders we could try to delete features! Also there is no guarantee that the rules will work correctly on new instances. In fact rules are not even guaranteed to exist!

To deal with N! problem, try to find "higher level features". For instance, if 2 features always occur together, replace them by one feature.

Student Activity 7.2

Before reading next section, answer the following questions.

1. What do you mean by learning by induction. Explain with example?

2. Describe learning by Macro Operators?

3. Explain why a good knowledge representation is useful in reasoning knowledge.

If your answers are correct, then proceed to the next section.

Top

Explanation-Based Learning

The previous section illustrated how we can induce concept descriptions from positive and negative examples. Learning complex concepts using these procedures typically requires a substantial number of training instances. But people seem to be able to learn quite a bit from single examples. Consider a chess player who, as Black, has reached the position shown in Figure 7.6. The position is called a "fork" because the white knight attacks both the black king and the black queen. Black must move the king, thereby leaving the queen open to capture. From this single experience, Black is able to learn quite a bit about the fork trap: the idea is that if any piece x attacks both the opponent's king and another piece y, then piece y will be lost. We don't need to see dozens of positive and negative examples of fork positions in order to draw these conclusions. From just one experience, we can learn to avoid this trap in the future and perhaps to use it to our own advantage.

What makes such single-example learning possible? The answer, not surprisingly, is knowledge. The chess player has plenty of domain-specific knowledge that can be brought to bear, including the rules of chess and any previously acquired strategies. That knowledge can be used to identify the critical aspects of the training example. In the case of the fork, we know that the double simultaneous attack is important while the precise position and type of the attacking piece is not.

Much of the recent work in machine learning has moved away from the empirical, data-intensive approach described in the last section toward this more analytical, knowledge-intensive approach. A number of independent studies led to the characterization of this approach as explanation-based learning. An EBL system attempts to learn from a single example x by explaining why x is an example of the

197

Page 21: Chaptr 7 (final)

LEARNING

target concept. The explanation is then generalized, and the system's performance is improved through the availability of this knowledge.

Figure 7.6: A Fork Position in Chess

Mitchell et al. and DeJong and Mooney both describe general frame works for EBL programs and give general learning algorithms. We can think of EBL programs as accepting the following as input:

1. A Training Example-What the learning program "sees" in the world?

2. A Goal Concept-A high-level description of what the program is supposed to learn

3. An Operationality Criterion-A description of which concepts are usable

4. A Domain Theory-A set of rules that describe relationships between objects and actions in a domain.

From this, EBL computes a generalization of the training example that is sufficient to describe the goal concept, and also satisfies the operationality criterion. Let's look more closely at this specification. The training example is a familiar input-it is the same thing as the example in the version space algorithm. The goal concept is also familiar, but in previous sections, we have viewed the goal concept as an ~ output of the program, not an input. The assumption here is that the goal concept is not operational, just like the high-level card-playing advice. An EBL program seeks to operationalize the goal concept by expressing it in terms that a problem-solving program can understand. These terms are given by the operationality criterion. In the chess example, the goal concept might be something like "bad position for Black," and the operationalized concept would be a generalized description of situations similar to the training example, given in terms of pieces and their relative positions. The last input to an EBL program is a domain theory, in our case, the rules of chess. Without such knowledge, it is impossible to come up with a correct generalization of the training example.

Explanation-based generalization (EBG) is an algorithm for EBL described in Mitchell et al. It has two steps: (1) explain and (2) generalize. During the first step, the domain theory is used to prune away all the unimportant aspects of the training example with respect to the goal concept. What is left is an explanation of why the training example is an instance of the goal concept. This explanation is expressed in terms that satisfy the operationality criterion. The next step is to generalize the explanation as far as possible while still describing the goal concept. Following our

198

Page 22: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

chess example, the first EBL step chooses to ignore White's pawns, king, and rook, and constructs an explanation consisting of White's knight, Black's king, and Black's queen, each in their specific positions. Operationality is ensured: all chess-playing programs understand the basic concepts of piece and position. Next, the explanation is generalized. Using domain knowledge, we find that moving the pieces to a different part of the board is still bad for Black. We can also determine that other pieces besides knights and queens can participate in fork attacks.

In reality, current EBL methods run into difficulties in domains as complex as chess, so we will not pursue this example further. Instead, let's look at a simpler case. Consider the problem of learning the concept Cup. Unlike the arch-learning program we want to be able to generalize from a single example of a cup. Suppose the example is:

Training Example:

owner(Object23, Ralph) has-part(Object23, Concavity12)

is(Object23, Light) color(Object23, Brown) ...

Clearly, some of the features of Object23 are more relevant to its being a cup than others. So far in this unit we have seen several methods for isolating relevant features. All these methods require many positive and negative examples. In EBL we instead rely on domain knowledge, such as:

Domain Knowledge:

is(x, Light) has-part(x, y) isa(y, Handle) liftable(x)

has-part(x, y) isa(y, Bottom) is(y, Flat) stable(x)

has-part(x, y) isa(y, Concavity) is(y, Upward-Pointing) open-vessel(x)

We also need a goal concept to operationalize:

Goal Concept: Cup

x is a Cup if x is liftable, stable, and open-vessel.

OperationalityCriterion: Concept definition must be expressed in purely structural terms (e.g., Light, Flat, etc.).

Given a training example and a functional description, we want to build a general structural description of a cup. The first step is to explain why Object23 is a cup. We do this by constructing a proof, as shown in Figure 7.7. Standard theorem-proving techniques can be used to find such a proof. Notice that the proof isolates the relevant features of the training example; nowhere in the proof do the predicates owner and color appear. The proof also serves as a basis for a valid generalization. If we gather up all the assumptions and replace constants with variables, we get the following description of a cup:

has-part(x, y) isa(y, Concavity) is(y, Upward-Pointing)

has-part(x, z) isa(z, Bottom) is(z, Flat)

has-part(x, w) isa(w, Handle) is(x, Light)

199

Page 23: Chaptr 7 (final)

LEARNING

This definition satisfies the operationality criterion and could be used by a robot to classify objects.

Figure 7.7: An Explanation

Simply replacing constants by variables worked in this example, but in some cases it is necessary to retain certain constants. To catch these cases, we must reprove the goal. This process, which we saw earlier in our discussion of learning in STRIPS, is called goal regression.

As we have seen, EBL depends strongly on a domain theory. Given such a theory, why are examples needed at all? We could have operationalized the goal concept Cup without reference to an example, since the domain theory contains all of the requisite information. The answer is that examples help to focus the learning on relevant operationalizations. Without an example cup, EBL is faced with the task of characterizing the entire range of objects that satisfy the goal concept. Most of these objects will never be encountered in the real world, and so the result will be overly general.

Providing a tractable domain theory is a difficult task. There is evidence that humans do not learn with very primitive relations. Instead, they create incomplete and inconsistent domain theories. For example, returning to chess, such a theory might " include concepts like "weak pawn structure." Getting EBL to work in ill-structured domain theories is an active area of research.

EBL shares many features of all the learning methods described in earlier sections. , Like concept learning, EBL begins with a positive example of some concept. As in learning by advice taking, the goal is to operationalize some piece of knowledge. And EBL techniques, like the techniques of chunking and macro-operators, are often used to improve the performance of problem-solving engines. The major difference between EBL and other learning methods is that EBL programs are built to take advantage of domain knowledge. Since learning is just another kind of problem solving, it should come as no surprise that there is leverage to be found in knowledge.

Top

200

Page 24: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

Learning Automation

The theory of learning automata was first introduced in 1961 (Tsetlin, 1961). Since that time these systems have been studied intensely, both analytically and through simulations (Lakshmivarahan, 1981). Learning automata systems are finite set adaptive systems which interact iteratively with a general environment. Through a probabilistic trial-and-error response process they learn to choose or adapt to a behaviour that produces the best response. They are, essentially, a form of weak, inductive learners.

In Figure 7.8, we see that the learning model for learning automata has been simplified for just two components, an automaton (learner) and an environment. The learning cycle begins with an input to the learning automata system from the environment. This input elicits one of a finite number of possible responses and then provides some form of feedback to the automaton in return. The automaton to alter its stimulus-response mapping structure to improve its behaviour in a more favourable way uses this feedback.

As a simple example, suppose a learning automata is being used to learn the best temperature control setting for your office each morning. It may select any one of ten temperature range settings at the beginning of each day (Figure 7.9). Without any prior knowledge of your temperature preferences, the automaton randomly selects a first setting using the probability vector corresponding to the temperature settings.

Figure 7.8: Learning Automaton Model

Figure 7.9: Temperature Control Model

Since the probability values are uniformly distributed, any one of the settings will be selected with equal likelihood. After the selected temperature has stabilized, the environment may respond with a simple good-bad feedback response. If the response is good, the automata will modify its probability vector by rewarding the probability

201

Page 25: Chaptr 7 (final)

LEARNING

corresponding to the good setting with a positive increment and reducing all other probabilities proportionately to maintain the sum equal to 1. If the response is bad, the automaton will penalize the selected setting by reducing the probability corresponding to the bad setting and increasing all other values proportionately. This process is repeated each day until the good selections have high probability values and all bad choices have values near zero. Thereafter, the system will always choose the good settings. If, at some point, in the future your temperature preferences change, the automaton can easily readapt.

Learning automata have been generalized and studied in various ways. One such generalization has been given the special name of collective learning automata (CLA). CLAs are standard learning automata systems except that feedback is not provided to the automaton after each response. In this case, several collective stimulus-response actions occur before feedback is passed to the automaton. It has been argued (Bock, 1976) that this type of learning more closely resembles that of human beings in that we usually perform a number or group of primitive actions before receiving feedback on the performance of such actions, such as solving a complete problem on a test or parking a car. We illustrate the operation of CLAs with an example of learning to play the game of Nim in an optimal way.

Nim is a two-person zero-sum game in which the players alternate in removing tokens from an array that initially has nine tokens. The tokens are arranged into three rows with one token in the first row, three in the second row, and five in the third row (Figure 7.10).

Figure 7.10: Nim Initial Configuration

The first player must remove at least one token but not more than all the tokens in any single row. Tokens can only be removed from a single row during each payer’s move. The second player responds by removing one or more tokens remaining in any row. Players alternate in this way until all tokens have been removed; the loser is the player forced to remove the last token.

We will use the triple (n1, n2, n3) to represent the states of the game at a given time where n1, n2 and n3 are the numbers of tokens in rows 1, 2, and 3, respectively. We will also use a matrix to determine the moves made by the CLA for any given state. The matrix of Figure 7.11 has heading columns which correspond to the state of the game when it is the CLA’s turn to move, and row headings which correspond to the new game state after the CLA’s turn to move, and row headings which correspond to the new game state after the CLA has completed a move. Fractional entries in the matrix are transition probabilities used by the CLA to execute each of its moves. Asterrisks in the matrix represent invalid moves.

Beginning with the initial state (1, 3, 5), suppose the CLA’s opponent removes two tokens from the third row resulting in the new state (1, 3, 3). If the ClA then removes all three tokens from the second row, the resultant state is (1, 0, 3). Suppose the

202

Page 26: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

opponent now removes all remaining tokens from the third row. This leaves the CLA with a losing configuration of (1, 0, 0).

Figure 7.11: CLA Internal Representation of Game States

As the start of the learning sequence, the matrix is initialized such that the elements in each column are equal (uniform) probability values. For example, since there are eight valid moves from the state (1, 3, 4) each column element under this state corresponding to a valid move has been given uniform probability values corresponding to all valid moves for the given column state.

The CLA selects moves probabilistically using the probability values in each column. So, for example, if the CLA had the first move, any row intersecting with the first

column not containing an asterisk would be chosen with probability . This choice

then determines the new game state from which the opponent must select a move. The opponent might have a similar matrix to record game states and choose moves. A complete game is played before the CLA is given any feedback, at which time it is informed whether or not its responses were good or bad. This is the collective feature of the CLA.

If the CLA wins a game, increasing the probability value in each column corresponding to the winning move rewards all moves made by the CLA during that game. All non-winning probabilities in those columns are reduced equally to keep the sum in each column equal to 1. If the CLA loses a game, reducing the probability values corresponding to each losing move penalizes the moves leading to that loss. All other probabilities in the columns having a losing move are increased equally to keep the column totals equal to 1.

After a number of games have been played by the CLA, the matrix elements that correspond to repeated wins will increase toward one, while all other elements in the column will decrease toward zero. Consequently, the CLA will choose the winning moves more frequently and thereby improve its performance.

203

Page 27: Chaptr 7 (final)

LEARNING

Simulated games between a CLA and various types of opponents have been performed and the results plotted (Bock, 1985). It was shown, for example, that two CLAs playing against each other required about 300 games before each learned to play optimally. Note, however, that convergence to optimality can be accomplished with fewer games if the opponent always plays optimally (or poorly), since, in such a case, the CLA will repeatedly lose (win) and quickly reduce (increase) the losing (winning) move elements to zero (one). It is also possible to speed up the learning process through the use of other techniques such as learned heuristics.

Learning systems based on the learning automaton or CLA paradigm are fairly general for applications in which a suitable state representation scheme can be found. They are also quite robust learners. In fact, it has been shown that an LA will converge to an optimal distribution under fairly general conditions if the feedback is accurate with probability greater 0.5 (Narendra and Thathachar, 1974). Of course, the rate of convergence is strongly dependent on the reliability of the feedback.

Learning automata are not very efficient learners as was noted in the game-playing example above. They are, however, relatively easy to implement, provided the number of states is not too large. When the number of states becomes large, the amount of storage and the computation required to update the transition matrix becomes excessive.

Potential applications for learning automata include adaptive telephone routing and control. Such applications have been studied using simulation programs (Narendra et al., 1977).

Genetic Algorithms

Genetic algorithm learning methods are based on models of natural adaptation and evolution. These learning systems improve their performance through processes that model population genetics and survival of the fittest. They have been studied since the early 1960s (Holland, 1962, 1975).

In the field of genetics, a population is subjected to an environment that places demands on the members. The members that adapt well are selected for mating and reproduction. The offspring of these better performers inherit genetic traits from both their parents. Members of this second generation of offspring, which also adapt well, are then selected for mating and reproduction and the evolutionary cycle continues. Poor performers die off without leaving offspring. Good performers produce good offspring and they, in turn, perform well. After some number of generations, the resultant population will have adapted optimally or at least very well to the environment.

Genetic algorithm systems start with a fixed size population of data structures that are used to perform some given tasks. After requiring the structures to execute the specified tasks some number of times, the structures are rated on their performance, and a new generation of data structures is then created. Mating the higher performing structures to produce offspring creates the new generation. These offspring and their parents are then retained for the next generations while the poorer performing structures are discarded. The basic cycle is illustrated in Figure 7.12.

Mutations are also performed on the best performing structures to insure that the full space of possible structures is reachable. This process is repeated for a number of

204

Page 28: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

generations until the resultant population consists of only the highest performing structures.

Data structures that make up the population can represent rules or any other suitable types of knowledge structure. To illustrate the genetic aspects of the problem, assume for simplicity that the populations of structures are fixed-length binary strings such as the eight-bit string 11010001. An initial population of these eight-bit strings would be generated randomly or with the use of heuristics at time zero. These strings, which might be simple condition an action rules, would then be assigned some tasks to perform (like predicting the weather based on certain physical and geographic conditions or diagnosing a fault in a piece of equipment).

Figure 7.12: Genetic Algorithm

After multiple attempts at executing the tasks, each of the participating structures would be rated and tagged with a utility value, u, commensurate with its performance. The next population would then be generated using the higher performing structures as parents and the process would be repeated with the newly produced generation. After many generations the remaining population structures should perform the desired tasks well.

Mating between two strings is accomplished with the crossover operation which randomly selects a bit position in the eight-bit string and concatenates the head of one parent to the tail of the second parent to produce the offspring. Suppose the two parents are designated as xxxxxxxx and yyyyyyyy respectively, and suppose the third bit position has been selected as the crossover point (at the position of the colon in the structure xxx: xxxxx). After the crossover operation is applied, two offspring are then generated, namely xxxyyyyy and yyyxxxxx. Such offspring and their parents are then used to make up the next generation of structures.

A second genetic operation often used is called inversion. Inversion is a transformation applied to a single string. A bit position is selected at random, and

205

Page 29: Chaptr 7 (final)

LEARNING

when applied to a structure; the inversion operation concatenates the tail of the string to the head of the same string. Thus, if the sixth position were selected (x1x2x3x4x5x6x7x8), the inverted string would be x7x8x1x2x3x4x5x6.

A third operator, mutation, is used to insure that all locations of the rule space are reachable, that every potential rule in the rule space is available for evaluation. This insures that the selection process does not get caught in a local minimum. For example, it may happen that use of the crossover and inversion operators will only produce a set of structures that are better than all local neighbors but not optimal in a global sense. This can happen since crossover and inversion may not be able to produce some undiscovered structures. The mutation operators can overcome this by simply selecting any bit position in a string at random and changing it. This operator is typically used only infrequently to prevent random wandering in the search space.

The genetic paradigm is best understood through an example. To illustrate similarities between the learning automation paradigm and the genetic paradigm we use the same learning task of the previous section, namely learning to play the game of Nim optimally. We use a slightly different representation scheme here since we want a population of structures that are easily transformed. To do this, we let each member of the population consist of a pair of triplets augmented with a utility value u, ((n1, n2, n3) (m1, m2, m3) u), where the first pair is the game state presented to the genetic algorithm system prior to its move, and the second triple is the state after the move. The u values represent the worth or current utility of the structure at any given time.

Before the game begins, the genetic system randomly generates an initial population of K triple-pair members. The population size K is one of the important parameters that must be selected. Here, we simply assume it is about 25 or 30, which should be more than the number of moves needed for any optimal play. All members are assigned an initial utility value of 0. The learning process then proceeds as follows:

1. The environment presents a valid triple to the genetic system.

2. The genetic system searches for all population triple pairs that have a first triple that matches the input triple. From those that match, the first one found having the highest utility value u is selected and the second triple is returned as the new game state. If no match is found, the genetic system randomly generates a triple which represents a valid move, returns this as the new state, and stores the triple pairs, the input, and newly generated triple as an addition to the population.

3. The above two steps are repeated until the game is terminated, in which case the genetic system is informed whether a win or loss occurred. If the system wins, each of the participating member moves has its utility value increased. If the system loses, each participating member has its utility value decreased.

4. The above steps are repeated until a fixed number of games have been played. At this time a new generation is created.

The new generation is created from the old population by first selecting a fraction (say one half) of the members having the highest utility values. From these, offspring are obtained by application of appropriate genetic operators.

206

Page 30: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

The three operators, crossover, inversion, and mutation, randomly modify the parent moves to give new offspring move sequences. (The best choice of genetic operators to apply in this example is left as an exercise). Each offspring inherits a utility value from one of the parents. Population members having low utility values are discarded to keep the population size fixed.

This whole process is repeated until the genetic system has learned all the optimal moves. This can be determined when the parent population ceases to change or when the genetic system repeatedly wins.

The similarity between the learning automaton and genetic paradigms should be apparent from this example. Both rely on random move sequences, and the better moves are rewarded while the poorer ones are penalized.

Neural Network

Neural networks are large networks of simple processing elements or nodes that process information dynamically in response to external inputs. The nodes are simplified models of neurons. The knowledge in a neural network is distributed throughout the network in the form of internode connections and weighted links that form the inputs to the nodes. The link weights serve to enhance or inhibit the input stimuli values that are then added together at the nodes. If the sum of all the inputs to a node exceeds some threshold value T, the node executes and produces an output that is passed on to other nodes or is used to produce some output response. In the simplest case, no output is produced if the total input is less than T. In more complex models, the output will depend on a nonlinear activation function.

Neural networks were originally inspired as being models of the human nervous system. They are greatly simplified models to be sure (neurons are known to be fairly complex processors). Even so, they have been shown to exhibit many "intelligent" abilities, such as learning, generalization, and abstraction.

A single node is illustrated in Figure 7.13. The inputs to the node are the values X1, X2

. . . , Xn which typically take on values of -1, 0, 1, or real values within the range (—1,1). The weights w1, w2,. . . ., wn, correspond to the synaptic strengths of a neuron. They serve to increase or decrease the effects of the corresponding x, input values. The sum of the products xi wi i = 1,2,. . . , n, serve as the total combined input to the node. If this sum is large enough to exceed the threshold amount T, the node fires, and produces an output y, an activation function value placed on the node’s output links. The output may then be the input to other nodes or the final output response from the network.

207

Page 31: Chaptr 7 (final)

LEARNING

Figure 7.13: Model of a single neuron (node)

Figure 7.14 illustrates three layers of a number of interconnected nodes. The first layer serves as the input layer, receiving inputs from some set of stimuli. The second layer (called the hidden layer layer) receive inputs from the first layer and produces a pattern of inputs to the third layer, the output layer. The. pattern of outputs from the final layer are the network's responses to the input stimuli patterns. Input links to layer j(j=1, 2, 3) have weights wij for I=1, 2,…….,n.

General multiplayer networks having n nodes (number of rows) in each of m layers (number of columns of nodes) will have weights represented as an n x m matrix W. Using this representation, nodes having no interconnecting links will have a weight value of zero. Networks consisting of more than three layers would, of course, be correspondingly more complex than the network depicted in Figure 7.14.

A neural network can be thought of as a black box that transforms the input vector x to the output vector y where the transformation performed is the result of the pattern of connections and weights, that is, according to the values of the weight matrix W.

Consider the vector product

There is a geometric interpretation for this product. It is equivalent to projecting one vector onto the other vector in n-dimensional space. This notion is depicted in Figure 7.15 for the two-dimensional case. The magnitude of the resultant vector is given by

here |x| denotes the norm or length of the vector x. Note that this product is maximum when both vectors point in the same direction, that is, when = 0. The product is a minimum when both point in opposite directions or when =180 degrees. This illustrates how the vectors in the weight matrix W influence the inputs to the nodes in a neutral network.

208

Page 32: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

Figure 7.14: A Multiplayer Neural Network

Figure 7.15: Vector Multiplication is Like Vector Projecton

Learning pattern weights

The interconnections and weights W in the neural network store the knowledge possessed by the network. These weights must be preset or learned in some manner. When learning is used, the process may be either supervised or unsupervised. In the supervised case, learning is performed by repeatedly presenting the network with an input pattern and a desired output response. The training examples then consist of the vector pairs (x, y'), where x is the input pattern and y' is the desired output response pattern. The weights are then adjusted until the difference between the actual output response y and the desired response y' are the same, that is until D = y- y' is near zero.

One of the simpler supervised learning algorithms uses the following formula to adjust the weights W.

209

Page 33: Chaptr 7 (final)

LEARNING

Where 0 < a < 1 is a learning constant that determines the rate of learning. When the difference D is large, the adjustment to the weights W is large, but when the output response y is close to the target response y' the adjustment will be small. When the difference D is near zero, the training process terminates at which point the network will produce the correct response for the given input patterns x.

In unsupervised learning, the training examples consist of the input vectors x only. No desired response y' is available to guide the system. Instead, the learning process must find the weights w, with no knowledge of the desired output response.

A neural net expert system

An example of a simple neural network diagnostic expert system has been described by Stephen Gallant (1988). This system diagnoses and recommends treatments for acute sarcophagal disease. The system is illustrated in Figure 7.16. From six symptom variables u1, u2,……., u6, one of two possible diseases can be diagnosed, u7

or ug. From the resultant diagnosis, one of three treatments, u9, u10, or u11 can then be recommended.

Figure 7.16: A Simple Neural Network Expert System.

When a given symptom is present, the corresponding variable is given a value of +1 (true). Negative symptoms are given an input value of -1 (false), and unknown symptoms are given the value 0. Input symptom values are multiplied by their corresponding weights Wi0 Numbers within the nodes are initial bias weights W i0 and

210

Page 34: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

numbers on the links are the other node input weights. When the sum of the weighted products of the inputs exceeds 0, an output will be present on the corresponding node output and serve as an input to the next layer of nodes.

As an example, suppose the patient has swollen feet +1) but not red ears (U2 = -1) nor hair loss (U8 = -1). This gives a value of U7 = +1 (since 0+(2)(l)+(-2)(-l)+(3)(-l) = 1), suggesting the patient has superciliosis.

When it is also known that the other symptoms of the patient are false (U 5 = U6 = l), it may be concluded that namatosis is absent (U8 = -1), and therefore that birambio (U10 = +1) should be prescribed while placibin should not be prescribed (U9 = -1). In addition, it will be found that posiboost should also be prescribed (u11 = +1).

The training algorithm added the intermediate triangular shaped nodes. These additional nodes are needed so that weight assignments can be made which permit the computations to work correctly for all training instances.

Deductions can be made just as well when only partial information is available. For example, when a patient has swollen feet and suffers from hair loss, it may be concluded the patient has superciliosis, regardless of whether or not the patient has red ears. This is so because the unknown variable cannot force the sum to change to negative.

A system such as this can also explain how or why a conclusion was reached. For example, when inputs and outputs are regarded as rules, an output can be explained as the conclusion to a rule. If placibin is true, the system might explain why with a statement such as:

Placibin is TRUE due to the following rule:

IF Ptacibin Alergy (u6) is FALSE, and Superciliosis is TRUE

THEN Conclude Placibin is TRUE.

Expert systems based on neural network architectures can be designed to possess many of the features of other expert system types, including explanations for how and why, and confidence estimation for variable deduction.

Top

Learning in Neural Networks

The perception, an invention of Rosenblatt [1962], was one of the earliest neural network models. A perceptron models a neuron by taking a weighted sum of its inputs and sending the output 1if the sum is greater than some adjustable threshold value (otherwise it sends 0). Figure 7.17 shows the device. Notice that in a perceptron, unlike a Hopfield network, connections are unidirectional.

The inputs (x1,x2....xn) and connection weights (w1, w2,... , wn) in the figure are typically real values, both positive and negative. If the presence of some feature x, tends to cause the perceptron to fire, the weight w, will be positive; if the feature x; inhibits the perceptron, the weight w, will be negative. The perceptron itself consists of the weights, the summation processor, and the adjustable threshold processor. Learning is a process of modifying the values of the weights and the threshold. It is convenient to implement the threshold as just another weight WQ, as in Figure 7.18.

211

Page 35: Chaptr 7 (final)

LEARNING

This weight can be thought of as the propensity of the perceptron to fire irrespective of its inputs. The perceptron of Figure 7.18 fires if the weighted sum is greater than zero.

A perceptron computes a binary function of its input. Several perceptrons can be combined to compute more complex functions, as shown in Figure 7.19.

Figure 7.17: A Neuron and a Perception

Figure 7.18: Perceptron with Adjustable Threshold Implemented as Additional Weight

Figure 7.19: A Perceptron with Many Inputs and Many Outputs

212

Page 36: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

Such a group of perceptrons can be trained on sample input-output pairs until it learns to compute the correct function. The amazing property of perceptron learning is this: Whatever a perceptron can compute, it can learn to compute. We demonstrate this in a moment. At the time perceptrons were invented, many people speculated that intelligent systems could be constructed out of perceptrons (see Figure 7.20).

Since the perceptrons of Figure 7.19 are independent of one another, they can be separately trained. So let us concentrate on what a single perceptron can learn to do. Consider the pattern classification problem shown in Figure 7.21. This problem is linearly separable, because we can draw a line that separates one class from another. Given values for x1 and x2, we want to train a perceptron to output 1 if it thinks the input belongs to the class of white dots and 0 if it thinks the input belongs to the class of black dots. We have no explicit rule to guide us; we must induce a rule from a set of training instances. We now see how perceptrons can learn to solve such problems.

First, it is necessary to take a close look at what the perceptron computes. Let an input vector (x1, x2,……xn). Notice that the weighted summation function g(x) and the output function o(x) can be defined as:

Consider the case where we have only two inputs (as in Figure 7.22). Then:

Figure 7.20: An Early Notion of an Intelligent System Built from Trainable Perceptions

213

Page 37: Chaptr 7 (final)

LEARNING

If g(x) is exactly zero, the perceptron cannot decide whether to fire. A slight change in inputs could cause the device to go either way. If we solve the equation g(x) = 0, we get the equation for a line:

The location of the line is completely determined by the weights w0, w1, and w2. lf an input vector lies on one side of the line, the perceptron will output 1; if it lies on the other side, the perceptron will output 0. A line that correctly separates the training instances corresponds to a perfectly functioning perceptron. Such a line is called a decision surface. In perceptrons with many inputs, the decision surface will be a hyperplane through the multidimensional space of possible input vectors. The problem of learning one of locating an appropriate decision surface.

We present a formal learning algorithm later. For now, consider the informal rule:

If the perceptron fires when it should not fire, make each w, smaller by an amount proportional to xi. If the perceptron fails to fire when it should fire, make each wi larger by a similar amount.

Suppose we want to train a three-input perceptron to fire only when its first input is n. If the perceptron fails to fire in the presence of an active x1, we will increase w1

(and we may increase other weights). If the perceptron fires incorrectly, we will end up decreasing weights that are not w1. (We will never decrease w1 because undesired firings only occur when x1 is 0, which forces the proportional change in w1 also to be 0). In addition, w0 will find a value based on the total number of incorrect firings versus incorrect misfirings. Soon, w1 will become large enough to overpower w0, w2

and w3 will not be powerful enough to fire the perception, even in the presence of both x2 and x3.

Figure 7.21: A Linearly Separable Pattern Classification Problem

Now let us return to the function g(x) and o(x). While the sign of g(x) is critical to determining whether the perception will fire, the magnitude is also important. The absolute value of g(x) tells how far a given input vector lies from the decision surface. This gives us a way of characterizing how good a set of weights is. Let be the weight vector (w0, w1,…wn), and let X be the subset of training instances misclassified by the current set of weights. Then define the perceptron criterion

214

Page 38: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

function, , to be the sum of the distances of the misclassified input vectors from the decision surface:

To create a better set of weights than the current set, we would like to reduce . Ultimately, if all inputs are classified correctly,

How do we go about minimizing ? We can use a form of local-search hill climbing known as gradient descent. For our current purposes, think of as defining a surface in the space of all possible weights. Such a surface might look like the one in Figure 7.22.

In the figure, weight w0 should be part of the weight space but is omitted here because it is easier to visualize J in only three dimensions. Now, some of the weight vectors constitute solutions, in that a perceptron with such a weight vector will classify all its inputs correctly. Note that there are an infinite number of solution

vectors. For any solution vector we know that suppose we begin with a

random weight vector that is not a solution vector. We want to slide down the J surface. There is a mathematical method for doing this—we compute the gradient of the function . Before we derive the gradient function, we reformulate the perception criterion function to remove the absolute value sign:

Figure 7.22: Adjusting the Weights by Gradient Descent, Minimizing

Recall that X is the set of misclassified input vectors.

Now, here is , the gradient of with respect to the weight space:

215

Page 39: Chaptr 7 (final)

LEARNING

The gradient is a vector that tells us the direction to move in the weight space in order to reduce . In order to find a solution weight vector, we simply change the weights in the direction of the gradient, recomputed , recomputed the new gradient, and iterate until . The rule for updating the weights at time is:

Or in expanded form:

is a scale factor that tells us how far to move in the direction of the gradient. A small will lead to slower learning, but a large to be a constant gives us what is usually called the “fixed-increment perceptron-learning algorithm”:

1. Create a perception with n+1 inputs and n+1 weights, where the extra input X0

is always set to 1.

2. Initialize the weights (w0, w1,……wn) to random real values.

3. Iterate through the training set, collecting all examples misclassified by the current set of weights.

4. If all examples are classified correctly, output the weights and quit.

5. Otherwise, compute the vector sum S of the misclassified input vectors, where each vector has the form {X0, X1,…..Xn). In creating the sum, add to S a vector

if is an input for which the perceptron incorrectly fails lo fire, but add vector -

if is an input for which the perceptron incorrectly fires. Multiply the sum by a

scale factor .

6. Modify the weights (w0, w1…..wn) by adding the elements of the vector S to them. Go to step 3.

The perceptron-learning algorithm is a search algorithm. It begins in a random initial state and finds a solution state. The search space is simply all-possible assignments of real values to the weights of the perceptron, and the search strategy is gradient descent.

So far, we have seen two search methods employed by neural networks, gradient descent in perceptrons and parallel relaxation in Hopfield networks. It is important to understand the relation between the two. Parallel relaxation is a problem-solving strategy, analogous to state space search in symbolic AI. Gradient descent is a learning strategy, analogous to techniques such as version spaces. In both symbolic and connectionist AI, learning is viewed as a type of problem solving, and this is why search is useful in learning. But the ultimate goal of learning is to get a system into a position where it can solve problems better. Do not confuse learning algorithms with others.

The perceptron convergence theorem, due to Rosenblatt [1962], guarantees that the perceptron will find a solution state, i.e., it will learn to classify any linearly separable set of inputs. In other words, the theorem shows that in the weight space, there are

216

Page 40: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

no local minima that do not correspond to the global minimum. Figure 7.23 shows a perceptron learning to classify the instances of Figure 7.21. Remember that every set of weights specifies some decision surface, in this case some two-dimensional line. In the figure, k is the number of passes through the training data, i.e., the number of iterations of steps 3 through 6 of the fixed-increment perceptron-learning algorithm.

The introduction of perceptrons in the late 1950s created a great deal of excitement, here was a device that strongly resembled a neuron and for which well-defined learning algorithms was available. There was much speculation about how intelligent systems could be constructed from perceptron building blocks. In their book Perceptrons, Minsky and Papert put an end to such speculation by analyzing the computational capabilities of the devices. They noticed that while the convergence theorem guaranteed correct classification of linearly separable data, most problems do not supply such nice data. Indeed, the perceptron is incapable of learning to solve some very simple problems.

Figure 7.23: A Perceptron Learning to Solve a Classification Problem

The perception cannot learn a linear decision surface to separate these different outputs, because no such decision surface exists. No single line can separate the 1 outputs from the 0 outputs. Minsky and Papert gave a number of problems with this property including telling whether a line drawing is connected, and separating figure from ground in a picture. Notice that the deficiency here is not in the perceptron-learning algorithm, but in the way the perception represents knowledge.

If we could draw an elliptical decision surface, we could encircle the two “1” outputs in the XOR space. However, perceptrons are incapable of modeling such surfaces. Another idea is to employ two separate line-drawing stages. We could draw one line to isolate the point and then another line to divide the remaining three points into two categories. Using this idea, we can construct a “multiplayer” perceptron (a series of perceptrons) to solve the problem.

Note how the output of the first perceptron serves as one of the inputs to the second perceptron, with a large, negatively weighted connection. If the first perceptron sees the input , it will send a massive inhibitory pulse to the second perceptron, causing that unit to output 0 regardless of its other inputs. If either of the inputs is 0, the second perceptron gets no inhibition from the first perceptron, and it outputs 1 if either of the inputs is 1.

217

Page 41: Chaptr 7 (final)

LEARNING

Backpropagation Networks

As suggested by Figure 7.20 and the Perceptrons critique, the ability to train multiplayer networks is an important sep in the direction of building intelligent machines from neuronlike components. Let’s reflect for a moment on why this is so. Our goal is to take a relatively amorphous mass of neuronlike elements and teach it to perform useful tasks. We would like it to be fast and resistant to damage. We would like it to generalize from the inputs it sees. We would like to build these neural masses on a very large scale, and we would like them to be able to learn efficiently. Perceptrons got us part of the way there, but we saw that they were too weak computationally. So we turn to more complex, multilayer networks.

What can multiplayer networks compute? The simple answer is: anything! Given a set of inputs, we can use summation-threshold units as simple AND, OR, and NOT gates by appropriately setting the threshold and connection weights. We know that we can build any arbitrary combinational circuit out of those basic logical units. In fact, if we are allowed to use feedback loops, we can build a general-purpose computer with them.

The major problem is learning. The knowledge representation system employed by neural nets is quite opaque: the nets must learn their own representations because programming them by hand is impossible. Perceptrons had the nice property that whatever they could compute, they could learn to compute. Does this property extend to multilayer networks? The answer is yes, sort of. Backpropagation is a step in that direction.

It will be useful to deal first with a subclass of multiplayer networks namely fully connected, layered, feedforward networks. A sample of such a network is shown in Figure 7.24. In this figure, xi, hi, and oi represent unit activation levels of input, hidden, and output units. Weights on connections between the input and hidden layers are denoted here by w1ij, while weights on connections between the hidden and output layers are denoted by w2ij. This network has three layers, although it is possible and sometimes useful to have more. Each unit in one layer is connected in the forward direction to every unit in the next layer. Activations flow from the input layer through the hidden layer, then on to the output layer. As usual, the knowledge of the network is encoded in the weights on connections between units. In contrast to the parallel relaxation method used by Hopfield nets, backpropagation networks perform a simpler computation. Because activations flow in only one direction, there is no need for an iterative relaxation process. The activation levels of the units in the output layer determine the output of the network.

The existence of hidden units allows the network to develop complex feature detectors, or internal representations. Figure 7.25 shows the application of a three layer network to the problem of recognizing digits. The two-dimensional grid containing the numeral “7” forms the input layer. A single hidden unit might be strongly activated by a horizontal line in the input, or perhaps a diagonal. The important thing to note is that the behaviour of these hidden units is automatically learned, not preprogrammed. In Figure 7.25, the input grid appears to be laid out in two dimensions, but the fully connected network is unaware of this 2-D structure. Because this structure can be important, many networks permit their hidden units to maintain only local connections to the input layer (e.g., a different 4 by 4 subgrid for each hidden unit).

218

Page 42: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

The hope in attacking problems like handwritten character recognition is that the neural network will not only learn to classify the inputs it is trained on but that it will generalize and be able to classify inputs that it has not yet seen. We return to generalization in the next section.

A reasonable question at this point is: “All neural nets seem to be able to do is classification. Hard AI problems such as planning, natural language parsing, and theorem proving are not simply classification tasks, so how do connectionist models address these problems?” Most of the problems we see in this chapter are indeed classification problems, because these are the problems that neural networks are best suited to handle at present. A major limitation of current network formalisms is how they deal with phenomena that involve time. This limitation is lifted to some degree in work on recurrent networks but the problems are still severe. Hence, we concentrate on classification problems for now.

Figure 7.24: A Multilayer Network

Let’s now return to backpropagation networks. The unit in a backpropagation network requires a slightly different activation function from the perception. A backpropagation unit still sums up its weighted inputs, but unlike the perception, it produces a real value between 0 and 1 as output, based on a sigmoid (or S-shaped) function, which is continuous and differentiable, as required by the backpropagation algorithm. Let sum be the weighted sum of the inputs to a unit. The equation for the unit’s output is given by:

Notice that if the sum is 0, the output is 0.5 (in contrast to the perception, where it must be either 0 or 1). As the sum gets larger, the output approaches 1. As the sum gets smaller, on the other hand, the output approaches 0.

219

Page 43: Chaptr 7 (final)

LEARNING

Figure 7.25: Using a Multilayer Network to Learn to Classify Handwritten Digits

Figure 7.26: The Stepwise Activation of the Perceptron and the Sigmoid Activation Function of the Backpropagation Unit (right)

Like a perception, a backpropagation network typically starts out with a random set of weights. The network adjusts its weights each time it sees an input-output pair. Each pair requires two states: a forward pass and a backward pass. The forward pass involves presenting a sample input to the network and letting activations flow until they reach the output layer. During the backward pass, the network’s actual output (from the forward pass) is compared with the target output and error estimates are computed for the output units. The weights connected to the output units can be adjusted in order to reduce those errors. We can then use the error estimates of the output units to derive error estimates for the units in the hidden layers. Finally, errors are propagated back to the connections stemming from the input units.

Unlike the perception-learning algorithm of the last section, the backpropagation algorithm usually updates its weights incrementally, after seeing each input-output pair. After it has seen all the input-output pairs (and adjusted its weights that many times), we say that one epoch has been completed. Training a backpropagation network usually requires many epochs.

220

1.0

0.5

1.0

0.5

0 0

Page 44: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

Refer back to Figure 7.25 for the basic structure on which the following algorithm is based.

Algorithm: Backpropagation

Given: A set of input-output vector pairs.

Compute: A set of weights for a three-layer network that maps inputs onto corresponding outputs.

1. Let A be the number of units in the input layer, as determined by the length of the training input vectors. Let C be the number of units in the output layer. Now choose B, the number of units in the hidden layer. As shown in Figure 7.25, the input and hidden layers each have an extra unit used for thresholding; therefore, the units in these layers will sometimes be indexed by the ranges (0,…..,A) and 0,…..,B). We denote the activation levels of the units in the input layer by x j, in the hidden layer by hj, and in the output layer by 0j. Weights connecting the input layer to the hidden layer are denoted by w1ij, where the subscript 0 indexes the input units and j indexes the hidden the hidden units. Likewise, weights connecting the hidden layer to the output layer are denoted by w2 ij, with I indexing to hidden units and j indexing output units.

2. Initialize the weights in the network. Each weight should be set randomly to a number between – 0.1 and 0.1.

3. Initialize the activations of the thresholding units. The values of these thresholding units should never change.

4. Choose an input-output pair. Suppose the input vector is and the target output

vector is Assign activation level to the input units.

5. Propagate the activations from the units in the input layer to the units in the hidden layer using the activation function.

for all

Note that i ranges from 0 to A. is the thresholding weight for hidden unit j (its

propensity to fire irrespective of its inputs). is always 1.0.

6. Propagate the activations from the units in the hidden layer to the units in the output layer.

221

Page 45: Chaptr 7 (final)

LEARNING

for all

Again, the thresholding weight w2oj for output unit j plays a role in the weighted summation. h0 is always 1.0.

7. Compute the errors of the units in the output layer, denoted Errors are based

on the network’s actual output and the target output

8. Compute the errors of the units in the hidden layer, denoted

9. Adjust the weights between the hidden layer and output layer. The learning rate is denoted ; its function is the same as in perceptron learning. A reasonable value of is 0.35.

10. Adjust the weights between the input layer and the hidden layer.

11. Go to step 4 and repeat. When all the input-output pairs have been presented to the network, one epoch has been completed. Repeat steps 4 to 10 for as many epochs as desired.

The algorithm generalizes straightforwardly to network of more than three layers. For each extra hidden layer, insert a forward propagation step between steps 6 and 7, an error computation step between steps 8 and9, and a weight adjustment step between steps 10 and 11. Error computation for hidden units should use the equation in step 8, but with I ranging over the units in the next layer, not necessarily the output layer.

The speed of learning can be increased by modifying the weight modification steps 9 and 10 to include a momentum term . The weight update formulas become:

where and are measured at time is the change the weight experienced during the previous forward-backward pass. If is set to 0.9 or so, learning speed is improved?

Recall that the activation function has a sigmoid shape. Since infinite weights would be required for the actual outputs of the network to reach 0.0 and 1.0, binary target outputs (the of steps 4 and 7 above) are usually given as 0.1 and 0.9 instead. The sigmoid is required by backpropagation because the derivation of the weight update rule requires that the activation function be continuous and differentiable.

222

Page 46: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

The derivation of the weight update rule is more complex than the derivation of the fixed-increment update rule for perceptrons, but the idea is much the same. There is an error function that defines a surface over weight space, and the weights are modified in the direction of the gradient of the surface. Interestingly, the error surface for multiplayer nets is more complex than the error surface for perceptrons. One notable difference is the existence of local minima. Recall the bowl-shaped space we used to explain perceptron learning (Figure 7.22). As we modified weights we moved in the directions of the bottom of the bowl; we reached it. A back propagation however may slide down the error surface into a set of weights that does not solve the problem it is being trained on. If that set of weights is at a local minimum, the network will never reach the optimal set of weights. Thus, we have no analogue of the perceptron convergence theorem for back propagation networks.

There are several methods of overcoming the problem of local minima. The momentum factor o, which tends to keep the weight changes moving in the same direction, allows the algorithm to skip over small minima. Finally adjusting the shape of a unit’s activation function can have an effect on the networks, the high-dimensional weight space provides plenty of degrees of freedom for the algorithm. The lack of a convergence thereon is not a problem in practice. However, this pleasant feature of backpropagation was not discovered until recently, when digital computers became fast enough to support large–scale simulations of neural networks. The backpropagation algorithm was actually derived independently by a number of researchers in the past, but it was discarded as many times because of the potential problems with local minima. In the days before fast digital computers, researches could judge their ideas by proving theorems about them , and they had no idea that local minima would turn out to be rare in practice. The modern form of backpropagation is often credited to Werbos [1974], LeCun [1985], Parker [1985], and Rumelhart et al. [1986].

Backpropagation networks are not without real problems, however, with the most serious being the slow speed of learning. Even simple tasks require extensive training periods. The XOR problem, for example involves only five units and nine weights, but it can require many, many passes through the four training cases before the weights coverage , especially if the learning parameters are not carefully tuned . Also, simple backpropagation does not scale up very well. The number of training examples required is superlinear in the size of the network.

Since backpropagation is inherently a parallel, distributed algorithm the idea of improving speed by building special–purpose backpropagation hardware is attractive. However, fast new variations of backpropagation and other learning algorithm appear frequently in the literature e.g., Fahlman [1988]. By the time an algorithm is transformed into hardware and embedded in a computer system, the algorithm is likely to be obsolete.

Student Activity 7.3

Answer the following questions.

1. Describe Learning Automation with example.

2. How neural networks are useful in learning?

3. What are perceptions?

223

Page 47: Chaptr 7 (final)

LEARNING

Summary

The most important thing to conclude from the study of automated learning is that learning itself is a problem-solving process.

A learning machine is the dream system of AI. the key to intelligent behavior is having a lot of knowledge.

Getting all of that knowledge into a computer is a staggering task of acquiring knowledge independently, as people do.

We do not yet have programs that can extend themselves indefinitely. But we have discovered some of the reasons for our failure to create such systems.

If we look at actual learning programs, we find that the more knowledge a program starts with, the more it can learn. This finding is satisfying, in the sense that it corroborates our other discoveries about the power of knowledge.

Learning automata systems are finite set adaptive systems which interact iteratively with a general environment

Research in machine learning has gone through several cycles of popularity.

A learning program needs to acquire new knowledge and new problem-solving abilities, but knowledge and problem solving are topics still under intensive study.

Genetic algorithm systems start with a fixed size population of data structures that are used to perform some given tasks. These structures are rated on their performance, and a new generation of data structures is then created

If we do not understand the nature of the thing we want to learn. Learning is difficult. Not surprisingly, the most successful learning programs operate in fairly well understood areas (like planning), and not in less well-understood areas (like natural language understanding).

Self-assessment Questions

Fill in the blanks (Solved)

1. _________ is generally acquired through experience.

2. A back propagation produces a real value between ______ and _______.

Answers

1. Knowledge

2. 0, 1

True or False (Solved)

1. Learning from examples does not involve stimuli from the outside.

2. A perception computes a binary function of its input.

224

Page 48: Chaptr 7 (final)

ARTIFICIAL INTELLIGENCE

Answers

1. False

2. True

Fill in the blanks (Unsolved)

1. Catching is known as ___________.

2. __________ is a process similar to macro-operators.

3. _____________ is a process of assigning, to a particular input, the name of a class to which it belongs.

True or False (Unsolved)

1. Learning Automata systems are called Rote learning.

2. Sequences of actions that can be treated as a whole are called macro-operators.

3. An EBL system attempts to learn from a set of examples.

Detailed Questions

1. Would it be reasonable to apply Samuel's rote-learning procedure to chess? Why (not)?

2. Consider the problem of building a program to learn a grammar for a language such as English assume that such a program would be provided, as input with a set of pairs each consisting of a sentence and a representation of the meaning of the sentence. This is analogous to the experience of a child who hears a sentence and sees something at the same time. How could such a program be built using the techniques discussed in this unit?

225