27
5 Learni ng Learning can be defined as any deliberate or directed change in the knowledge structureofa system that al- lows it to perform better on later repetitions of some given type of task. Learning is an essential componentof intelligent behavior-any organism that lives in a complex and changing environment cannot be designed to explicitly anticipate all possible situa- tions, but mustbe capable of effective self- modification. There are several basic ways to learn, i.e., to add to one's ability to perform or to know about things in the world: 1. Genetically-endowed abilities. Knowledge can be stored in the genes of men or animals, or in the circuits of machines. 2. Suppliedinformation. Someone can demonstrate how to perform an action or can describe or providefacts about an object or situation. 3. Outside evaluation. Someone can tell you when you are proceeding correctly, or when you have obtained the correct information. 4. Experienceor observation. One can learn by feedback from his environ- ment; the evaluation is usually done by the learner measuring his ap- proach to some explicitgoal. 5. Analogy. New facts or skills can be acquired by transforming and aug- menting ex isting knowledge that bears a strong similarity in some 129

Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

5

Learning

Learning can be defined asany deliberate or directedchange in the knowledgestructureofa system that al­lows it to perform better onlater repetitions of somegiven type of task. Learning is an essentialcomponentof intelligent behavior-anyorganism that lives in a complex andchanging environment cannot be designedto explicitly anticipate allpossible situa­tions, but must be capable of effective self­modification. There are several basic waysto learn, i.e., to add to one's ability toperform or to know about things in theworld:

1. Genetically-endowed abilities.Knowledge can be stored in thegenes of men or animals, or in the

circuits of machines.2. Suppliedinformation.

Someonecan demonstratehowto perform an action orcan describe or provide factsabout an object or situation.

3. Outsideevaluation. Someone cantell you whenyouare proceedingcorrectly, or when you have obtainedthe correct information.

4. Experienceor observation. One canlearn byfeedback from his environ­ment; the evaluation is usually doneby the learner measuring his ap­proach to some explicitgoal.

5. Analogy. New facts or skills can beacquired by transforming and aug­mentingexisting knowledge thatbears a strong similarity in some

129

Page 2: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

130

LEARNING

respect to the desired newconcept orskill.

In computers, the first two types oflearning correspondto providing themachine with a predesigned set of proce­dures, or adding information to a data­base. If the supplied information is in theform of advice, rather than a completeprocedure, then converting advice to anoperational formis a problem in the do­main of reasoning rather than learning.Types 3, 4, and 5 in the above list involve"learning on your own," or autonomouslearning, and present some of the mostchallenging problems in our understand­ing of intelligentbehavior. These types oflearning are the subjectof this chapter.

Autonomous learninglargely dependson recognizing when a newproblem situa­tion is sufficiently similar to a previousone, so that either a single generalizedsolution approach can be formulated, ora previous solution can be applied. How­ever, the recognition of similarity is a verycomplex issue, often involving the percep­tion of analogy between objects or situa­tions. Wewill firstdiscuss some aspectsof animaland human learning, and thenrepresentations and techniques for discov­ering and quantifying similarity (includingmethods for recognizing known patterns).Finally, techniques for automatic machinelearning will be described. We will addressthe following questions:

• How do animals and peoplelearn?• How can we quantify the conceptof

similarity?• How can we identify a particularsitua­

tion as being similar to or an instanceof an already-learned concept?

• How much do youhave to knowinorder to be able to learn more?

• What are the different modes of learn­ing, and how can such ability be embed­ded in a computer?

HUMAN AND ANIMAL LEARNING

Complicated behavior patterns canbe found in all higheranimals. Whenthese behavior patterns are rigid andstereotyped, they are known as instincts.Instincts are sometimes alterable byexperience, but only within narrow lim­its. In higher vertebrates, especially themammals, behavior becomes increasinglymodifiable by learning eventhough rigidinstinctive behavior is still present

It is now recognized that inheritanceand learning are both fundamental indetermining the behavior of higherani­mals, and the contributions of these twoelements are inextricably intertwined inmostbehavior patterns. Inheritance canbe regarded as determining the limitswithin which a particularbehavior pat­tern can be modified, while learningde­termines, within these limits, the precisecharacter of the behavior pattern. In somecases the limits imposed by inheritanceleave little room for modification, while inother casesthe inherited limits may be sowide that learning plays the majorrole indetermining behavior. r

Experiments conducted by W. H.Thorpe of Cambridge University [Thorpe65] show someinstances of the interac­tion between inheritance and learning. Heraised chaffinches in isolation and foundthat such birds wereunableto sing anormalchaffinch song. Thorpe then dem­onstrated that young chaffinches raised inisolation and permitted to hear a record­ing of a chaffinch song when'about six

Page 3: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

131

HUMAN INTELLIGENCE

months old would quickly learn to singproperly. However, young chaffinchespermitted to hear recordings of songs ofotherspecies that singsimilar songsdidnot ordinarily learn to singthese othersongs. Apparently chaffinches learn tosing only by hearing other chaffinches;they inheritan ability to recognize andrespond only to the songs of their ownspecies.

Whenchaffinch songs are playedbackward to the young birds, they re­spond to these, even though to our earssuch a recording does not sound like achaffinch song.

Thus a chaffinch songis neitherentirely inherited or wholly learned; it isbothinheritedand learned. Chaffinchesinherit the neural and muscular mecha­nisms responsible for chaffinch song, theyinherit the ability to recognize chaffinchsong when theyhear it, and they inheritsevere limits on the type ofsong they canlearn. The experience ofhearing anotherchaffinch sing is necessary to activate, andperhaps somewhat adjust, their inheritedsinging abilities, and in thissense theirsong is learned.

Types of Animal Learning

It ispossible to classify the many typesofanimal learningaccording to the condi­tions which appear to induce the observedbehavior modifications:

Habituation and sensitization. Habitua­tion is the waning of a response as aresult of repeated continuous stimula­tion not associated with anykind ofreward or reinforcement An animaldrops those responses that experi­ence indicates are ofno value to its

goals. Sensitization is an enhance­ment ofan organism's responsivenessto a (typically harmful) stimulus.Instances of both habituation andsensitization can be found in thesimplest organisms.

Associative learning. Associating a re­sponsewith a stimulus with which itwas not previously associated. Anexample of classical conditioning, orassociative learning, is Pavlov's exper­imentin which a dog learned to asso­ciate the sound of a bellwith foodafter repeated trials in which the bellalways rang just beforethe foodwasprovided. Another form of condi­tionedlearning, called operant condi­tioning, or instrumental learning, isillustrated by the Skinner box. Ahungry rat is given the opportunity todiscover that a pelletof food will begiven every time he presseson a barwhena signal lightis lit, but no foodwill be provided when the light is off.The rat quickly learns to press on thebar only when the light is lit Thedistinction between classical andoperant conditioning is that in classi­cal conditioning the conditionedstimulus (e.g., bell) is always followedby the unconditioned stimulus (e.g.,food), regardless of the animal's re­sponse; in operantconditioning,reinforcement (e.g., food) is onlyprovided when the animal respondsto the conditioning stimulus (e.g.,lightis lit) in somedesiredway (e.g.,pressing on the bar).

Trial-and-error learning. An associationis established between an arbitraryaction and the corresponding out­come. Various actions are tried, and

Page 4: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

132

LEARNING

future activity profits from the result­ing reward or punishment. This typeof learningcan oftenbe classified asa form of operant conditioning asdescribed above.

Latent learning. Learning that takesplace in the absence of a reward. Forexample, rats that run through amaze unrewarded will learn the mazefaster and with fewer errors whenfinally rewarded (compared to otherrats that neverencountered the mazebefore, but received rewards from thestart of the learning sessions).

Imprinting. The animal learns to make astrong association with another orga­nism or sometimes an object. Thesensitive phase in which learning ispossible lasts for a short time at aspecific point in the animal's matura­tion, say a day or two afterbirth, andonce this phase has passed, the ani­malcannot be imprinted withanyother object. For example, it wasfound that a duckling will learn tofollow a large colored boxwith aticking clock inside if the box is thefirstthing it observes afterhatching.The duckling preferred the box to itsmother!

Imitation. Behavior that is learned byobserving the actions of others.

Insight. This typeof learningis theability to respond correctly to a situa­tion significantly different from anypreviously encountered. The newstimuli may be qualitatively and quan­titatively different from previous ones.An example would be (1) a monkeyobtaininga bunch of bananas abovehis reach bymoving a box under the

bananas and standingon the box,(assuming that he had neverseen thisprocedure used before), and (2) sub­sequently usingthe box as a tool toobtain objectsout of his normalreach. The firstpart of the monkey'sactivities involve problem solving, thesecond involves learning. We notehere the difference between learningand problem solving: problem solvinginvolves obtaining a solution to aspecificproblem, while learningpro­vides an approach to solving a classof problems.

The evolution of anatomical complex­ity amonganimals has paralleled a steadyadvance in learningcapability from simplehabituation to associative learning, whichappears first as classical conditioning andthen as trial-and-error learning. It is as ifevolutionary progress in the powers ofbehavioral adjustment consists of elabo­rating and coordinating new types ofresponsesthat are built upon and act withthe simple ones of more primitive animals.

Some remarkable examples of animallearningare presented in Box5-1.

Piaget's Theory of HumanIntellectual Development

A study of the intellectual development ofa child can provide insight into learningmechanisms. Basedon an extensive seriesof experiments and informal observations,Piaget [Flavell 63] has formulated a the­ory of human intellectual development.Piaget views the child as someone who istryingto make sense of the world bydis­covering the nature of physical objectsand their interaction, and the behavior of

Page 5: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

133

HUMAN INTELLIGENCE

III BOX 5-1 Examples of Animal Learning

Visual Learning

Pigeons have been trained to re­spond to the presence or absenceof human images in photographs.The pictures are so varied that sim­plestimulus characterization seemsimpossible. The results suggest re­markable powers of conceptualiza­tion by pigeons as to whatis human.

Monkeys have been trainedto selectthe pictures of three differ­ent insects from amongpictures ofleaves, fruit, and branchesof similarsize and color.

Counting

Araven learned to open a box thathad the same numberof spots on itslidas were on a key card. Eventu­ally, the birdwas trained to liftonlythe lid that had the same number ofspots on it as the numberof objectsin front ofthe box.

Pigeons have been trainedtoeat only a specific numberof grains

out of a larger number offered. Theyhave also learned to eat only a spe­cific numberN of peas dropped intoa cup (oneat a time at randomin­tervals rangingfrom 1 to 30 sec­onds); the pigeon neversees morethan one pea in the cup at a time,but only eats the firstN.

Ajackdaw learned to openblacklidson boxes until it hadsecured twobaits, green lidsuntil ithad three, red lids until it had four,and white lids until it had securedfive.

A parrot after beingshownfour, six, or seven light flashes isable to take four, six, or sevenpieces of irregularly distributed foodout of trays. Numerous randomchanges in the time sequence ofvisual stimuli did not affect thepercentageof correct solutions.After the bird learned this task, thesignal of several light flashes wasreplaced by notes on a flute. Theparrot was able to transfer immedi­atelyto these newstimuli withoutfurther training.

Learning Visual Landmarks

The female diggerwasp can effi­ciently and consistently locate hernest when she returns from hunting.It has been shownthat she locatesthe nest by noting both distant andadjacent landmarks.

When a beehive is moved to anew location, the worker bees com­ingout on foraging flights will pauseand circle in increasing arcs aroundthe new site for a few momentsbefore flying off. The insect learnsthe relative positionsof newland­marks in this manner, so as to beableto find its way back.

The goby can jump from poolto poolat low tide withoutbeingstrandedon dry land, eventhoughtheycannot see the neighboringpools before leaping. The goblesapparently swim over rock depres­sionsat high tide and thereby learnthe generalfeatures and topographyof the limitedarea around the homepool.They then use this informationin making their jumps at low tide.

people and their motivations. Thus,Piaget views the child as developing in­creasingly well-articulated and interrelatedrepresentations that are used to interpretthe world. Interaction with the world iscrucial becausewhen there is sufficientmismatch between the representationsand reality, the representations are modi­fied or transformed. Piaget thinks of thechild's intellectual development as having

four discrete stages:'

1. Sensorimotorstage (years 0-2).By discovery through trial-and-error

'There has been somecontroversy as to whetherthedevelopment occursin discontinuous jumpsor as asingle continuous process that can be partitionedinto discrete stages. Cunningham (Cunningham 721,in an attemptto formalize intelligence, describesthese stages in terms ofdata structuresand opera­tionson thesestructures.

Page 6: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

134

LEARNING

manipulation experiments, childrendevelop a simple cause-and-effectunderstanding of how they can physi­cally interact with their immediateenvironment. In particular, theyde­velop an ability to predict the effectson the environment of specific motoractionson their part; for example,they learn to correctly judgespatialrelations, identify and pick up ob­jects, learn that objects have perma­nence (even when theyare no longervisible), and learn to move about.They also develop an ability to usesigns and facial expressions to com­municate.

2. Symbolic-operational or preopera­tional stage (years 2-7). Childrenstart to develop a symbolic under­standing of their environment. Theydevelop an ability to communicate vianatural language, to read and write,and form internal representations ofthe external world; theycan nowperform mental experiments to pre­dict whatwill happen in new situa­tions. However, their thinking is stilldominated byvisual impressions anddirect experience; their generaliza­tion ability is poor. For example, if5-year-old children are shown a rowof black checkers directly oppositearowcontaining an equalnumber ofred checkers, they will say that thereare an equal number of each. If thered checkersare now spaced closertogether, but none are removed, thechildren say there are more blackthan red checkers. Their visual com­parison of the lengthsof the two rowsdominates the more abstract notionof numerical equality. The ability ofpreoperationalchildren to form or to

recognize classdistinctions is alsopoor-they tend to categorize objectsbysingle salient features, and assumethat if two objects are alike in oneimportantattribute, they must bealike in other (or all) attributes.

3. Concrete-operational stage (years7-11). Children acquire some ofthe concepts and general principleswhich govern cause-and-effect rela­tionships in their direct interactionwith the environment. In particular,they develop an understanding ofsuch conceptsas invariance, revers­ibility, and conservation; e.g., theycan correctly judgethat a volume ofwater remains constant regardless ofthe shape of the container into whichit is poured. Five-year-old childrencan follow a known route (e.g., get­ting home from school), but cannotgive adequate directions to someoneelse, or drawa map. Theydo not getlost because they know that theymust tum at certain recognized loca­tions. The 8-year-old is able to forma conceptual pictureof the overallroute and can readily produce acorresponding map. Until childrenreach the age of 11 or 12 years, theirmethodof discovery is generallybased on trial-and-error experimen­tation. They do not formulate andsystematically evaluate alternativehypotheses.

4. Formal-operational stage (years11+). The young adult develops afull ability for symbolic reasoning,and the ability to conceive of possibil­itiesbeyond what is present in reality.

Recently, many of the Piaget experi-

Page 7: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

135

SIMILARITY

ments have been interpreted in terms ofinformation-processing metaphors, seek­ing explanations as to why children per­form better on intellectual tasks as theyget older. Basically, the questions are (1)do children think better (because they canhold more information in working mem­ory, can retrieve information faster, andcan reason faster), or (2) do they knowmore (have more knowledge, enablingthem to performtasks more efficiently)?Anderson [Anderson 85] discusses experi­ments that indicate that both effects arepresent.

One basicaspect of Piaget's theory isthat children develop an understanding oftheir physical environmentby manipula­tion of objectswithin it; he believed thatlanguage played a much less importantrole. A challenge to his work is based onthe idea that the child's performance insome of his experiments was less a matterof competence, than of learningthe cor­rect meaningof certain verbal expressions

"u th "such as "same amount, or more an.While Piaget's theory attempts to

describe the changes in, and characteris­tics of, human learningability, it providesvery little insightinto the specific mech­anisms that underlie such learning abil­ity. Thus it provides very littleguidancewith respect to the question of how tobuild machines capable of learning. AsMargeret Boden has written [Boden 81]:

One of the weaknessesof Piagetiantheory ... is its lack of specification ofdetailedprocedural mechanisms compe­tent to generate the behavior it describes. . . Even Piaget's careful description ofbehaviors . .. is relatedto uncomfortablyvague remarks about the progression ofstages, without it beingmade clear just

how one stage (or set of conceptualstructures) comesto followanother. . . .

SIMILARITY

Assigning names or labels to objects,events, and situations is one of the mostimportantand recurring themes in thestudy of intelligent behavior. This pattern­matching problem typically involves mea­suring the degree of similarity betweendata describing the given state of theworld, and previously stored models.Basically, a common representationforthe objects under consideration must befound, and then a metric must be definedrelative to this representation quantifyingthe degree ofsimilarity between any of theobjects and models.

The pattern-matching problem arisesin many situations: For example, the< IF (condition) THEN (action)> tem­plate is a basic method of encodingknowledge within the computerparadigm,and is used in a variety of applicationsranging from computer programming tothe way rules are structured in expertsystems (see Chapter 7). Determiningwhenan existing situation matches theIF condition of a template is a pattern­matching problem. Another example ofthe pattern matching problem is tryingtofindsimilarsymbolic patterns in two logi­cal or algebraic expressions so that theycan be combinedor simplified. (The uniti­cation problem in the predicate calculus,discussed in Chapter 4, is an exampleof asophisticated matching problem.) Finally,a significant portion of vision and per­ception is concernedwith recognizing aknownsensory pattern (i.e., the patternrecognitionproblem).

Page 8: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

136

LEARNING

Some similarity problems can requirethat we find an exactmatch, while othersare satisfied with finding an approximatematch. This critical distinction separatesthe methods based on the symbolic ap­proaches typically employed in the cog­nitiveareas of AI, from those based onmeasurement(feature) spaces and statis­tical decision theory that are often em­ployed in the perceptual areas ofAI. Inboth cases, the objects beingcomparedare usually first transformed into somecanonical (standard) form to allow thecomputational procedures to deal with apractical number of primitive elementsand relationships.

Note that there is the questionoflevel of generality to be used in similarityevaluation, even after conversion to acanonicalform. For example, ifwe arecomparingtwo concepts, one ofwhichcontains the trigonometric term sine andthe other cosine, then we haveto moveto the more general term trigonometricfunction in order to obtain a match.Knowing the level of generality to usein a matchingoperation is oftencru-cial in performing similarity analysis.

Similarity Based on Exact Match

The exact match problem usually arises ina precise world, such as the game world;e.g., whenwe try to match a chessboardconfiguration againsta stored set of con­figurations. Sometimes partial exactmatches are sufficient, while at othertimes a match of the entire description isrequired.

A representation oftenused for exactmatching is the graph with labeledarcs

FIGURE 5-1Graph Representation and the PartialMatchProblem.

The problem is to find instances of the subgraph in thegraph (e.g., 9,4,8 corresponds to a.b,c),

and nodes, as shown in Fig. 5-1. Mathe­matically, in the complete matchproblemone tries to find a mapping (isomorphism)between two graphs so that when a pair ofconnected nodes in one graph maps to apair in the other graph, the edge connect­ing the firstpair will map to the edgeconnecting the second pair. The partialmatch problemof finding isomorphismsbetween a graph and a subgraph is com­putationally more difficult becausethereare many more combinations to be tested.

Another common matching probleminvolves strings of symbols, as shownbelow:

Find the string "SED" in the string:

"A REPRESENTATION OFTEN USEDFOR MATCHING"

As in the graph case, the completematch of two strings is less difficult thantrying to match a substringagainstastring.

Page 9: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

137

LEARNING

Similarity Basedon ApproximateMatch

Approximate match problems arise inattempting to dealwith the physical world.For example, in visual perception thesensed objectcan differ from the refer­enceobjectdue to noisymeasurements,i.e., measurements degraded due to theeffects of perspective, illumination, etc.Because many of these differences are attoo Iowa level of resolution to be mean­ingful, or are irrelevant to our ultimatepurpose, we must define a metric forgoodness-of-match, rather than expectingto find an exact match. Using this ap­proach, two objects are given the samelabel if their match score is high enough.In Appendix 9-1 wedescribe matching oftwo shapes usinga "rubber sheet" repre­sentation. We imagine one of the figuresto be drawn on a transparentrubber sheetandsuperimposed on the other; we matchthe figures bystretching the rubber sheetas required. The measure of goodness ofmatch is based both on the quality ofmatch between individual componentsof the objects beingcompared, and therequired stretchingof the rubber sheetto attain a match between componentsof the figures.

LEARNING

Unsupervised (unmonitored and undi­rected) learning is really a paradox: Howcan the learner determine progress if hedoesnot know whathe is supposed tolearnfrom his environment? This situa­tion presentssevere problems in attempt­ingto devise programs that can learn. Ifthe designerintroduces very specific cri-

teria for success, then the learning pro­gram is given too much focus; in fact, anexplicit statement of how to evaluate suc­cess can provide the machine with justthose concepts that were supposed to belearned. Unfortunately, ifcriteriafor suc­cess are not provided, then learning isminimal.

We take the view that a system learnsto dealwith its environment by instantiat­ing given models or bydevising newones.It will be convenient to divide autonomouslearningtechniques into the followingmodel-based approaches:

• Parameter learning. The learner hasbeen supplied with one or more modelsof the world, eachwith parametersofshape, size, etc., and it is the purposeofthe learner to select an appropriatemodelfrom the given set, and to derivethe correct parameters for the chosenmodel. A model can be as simpleas thedescription of a straightline, and in thiscase the parameters can be the slope ofthe line and its distance from somegiven point in space. Another modelcouldbe a decision making device withweighted inputs, and the parameters tobe found are the weights.

• Description learning. Here the learnerhas been given a vocabulary of namesand relationships, and from these hemust create a description which formsthe desired model.

• Concept learning. This is the mostdifficult form of learningbecause theavailable primitives are no longerade­quate. New vocabulary terms must beinvented that allow efficient representa­tion ofthe relevant concepts.

Page 10: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

138

LEARNING

Two basic problems in autonomouslearning are (I) the credit assignmentproblem: how to reward or punish eachsystem component that contributes toan end result, and (2) local evaluationof progress: how to tell when a changeis actually progress toward the desiredend goal.

Model Instantiation:Parameter Learning

Almost all AI systems with learning abilityare limited to learning within a givenmodel or representation. That is, currentAI systems do not significantly modifytheir basic vocabulary (representations).Typicallv, learning consists of adjustingthe parameters of the given model basedon statistical techniques, or inserting anddeleting connections in a given graphicalrepresentation.

Parameter Learning Using an ImplicitModel. In the implicit form of parameterlearning, the model is not given directly,and may, in fact, never be known. Thoexamples of this type of learning are de­scribed below.

The Threshold Network. Early computerscientists thought of the brain as a looselyorganized, randomly interconnected net­work of relatively simple devices. They feltthat many simple elements, suitably con­nected, could yield interesting complexbehavior, and could form a self-organizingsystem. The system could learn becausethe random connections were designed tobecome selectively strengthened or weak­ened as the system interacted with itsenvironment. A form of implicit model

would thus be formed. In the 1960s, thethreshold device wasstudied intensively asthe basic element in such a self-organizingsystem. The name "perceptron" was oftenused to designate the basic device inwhich weighted inputs are summed andcompared to a threshold. In a typicalimplementation for visual pattern recogni­tion, a two-dimensional retina is set up sothat local regions of the retina can beanalyzed by feature detectors, as shown inFig. 5-2.

Sets or vectors of k retinal values,x= [X I ,X21• • .xJ, are passed to N featuredetectors, each of which determines ascore !;(x) representing the degree ofpresence or absence of some specificfeature or object in the image. Thesefeature scores"are then combined in aweighted vote:

SUM =

WI • f; + w2• f; +W3 • ~ + . .. w/ (N'

SUM is compared to a threshold, T,and if SUM is greater than T, we say thatthe system responds positively. Wewantthe system to respond positively for oneclass of objects and negatively for objectsnot in this class, e.g., positively when it ispresented with the character "B," andnegatively when presented with "A."

,An algorithm for adjustment of theweights (i.e., the training) of such a systemis described in Appendix 5-1. Althoughsuch training capability is attractive, thistype of device has serious innate limita­tions; if it must make a global decisionabout a pattern by examining local fea­tures, then there are certain attributes

'The N feature detectors are typicallythresholddevices that produce an f,(x) equal to +1 if thefeature is present , or a -I if it is absent.

Page 11: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

139

LEARNING

w1*f1Feature f1

I Kdetector Weight1 \ WI

Threshold

• • Device(with• +1 or -1 • threshold

• j • n

Feature Weight ~wN*fNdetector WNN fN

+1

-1

FIGURE 5..2 A Threshold Device Used as a Pattern Classifier

Patterns are assigned to one oftwo categories, labeled +1 or -1.

of the pattern that it cannotdeduce.For example, connectivity ofpatterns asshown in the examples of Fig. 5-3cannotbe determined by local measurements (seeMinsky [Minsky 67]). Intuitively, one cansee the difficulty: different pieces ofeachpatternare critical for keeping the patternconnected, and there is no single weight­ingarrangement that can capturetheconnectivity concept.

There has recently been renewedinterest in threshold networks because ofthe development of new techniques forobtainingthe network parameters (learn­ing algorithms). The techniques are basedon "simulated annealing," a statisticalmechanics methodfor solving complexoptimization problems. (Thename comesfrom the fact that the procedure is analo­gous to the use of successively reduced

•••• •••• •••• • •• •• •• • • • • •• ••••• •••• •••• • • • •• • • • • ••••• •••• •••• •••• • •Connected Not Not Connected Not

connected connected connected

FIGURE 5-3The Connectedness Problem Cannotbe Solved by a Single Threshold Device.

Page 12: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

140

LEARNING

temperature stages in the annealing ofmetals.) These techniques can be used tofind the weights in a multilevel thresholdnetworkby considering the weight deter­mination problem to be a minimizationproblem in manyvariables. This approachavoids being trapped in local minima bythe use of two mechanisms: (1) the adjust­ments are made randomly, and (2) anadjustable "temperature"parameter, T,controls the probability of acceptance ofany change in a variable.

To minimize a function E(xi) of themany variables, (Xi), the minimum E isfound by starting with a randomset of Xi

and introducingsmall random perturba­tions into these Xi. If a perturbation de­creases E, it is accepted. If a perturbationincreasesE it is accepted with a probabil­ity that is a function of the change in Eand the parameter T: the probability de­creases with the size of the change in Eand increaseswith increasing T. By begin­ning witha high T there is a high proba­bility of acceptance of a change in Xi. T isdecreasedas the processproceeds, thusdecreasing the probability of the accept­ance of a bad perturbation.The randomacceptance of "bad" perturbationsas afunction of the changein E and T is themechanism that drives the system out oflocal minima.

Using this approachSejnowski[Sejnowski 85] was able to train a multi­layer network to transform natural lan­guage text in English to the phoneticcode for the sounds.The trainingset wasa large sampleof natural language textand the corresponding phoneticcodesprepared,by linguists. "Using the trainingset,.the networkweights were adjustedusing simulated annealing so that thephonetic codes developed by the network

agreed with the codes in the trainingsetprepared by the linguists. The spokenform of the text was obtainedby feedingthe phonetic code developed by thethreshold network into a sound synthe­sizer. It is rather dramaticto hear theeffects of the training process: the outputof the network is first a babbleof sounds,and then becomes more and morehuman-sounding as the trainingprocessproceeds.Aftertraining, the network isable to synthesize the soundsof text notin the trainingset.

It should be recognized, however,that adjustment of weights to correctlyrecognize the members of a trainingset isnot the critical issue in building a learningsystem. Whatwe require is that the cor­respondingrecognition rule generalizesdistinctions represented by the trainingset so as to be applicable to new membersof the class. Since the weight-adjustmentlearning algorithms do not addressthisquestion, even successful generalizationon a few selectedproblems provides verylittle assurancethat this type of approachcan be successful on a wide range of realproblems.

The "Bucket Brigade"ProductionSystem. Holland [Holland 83] intro­duced a learning mechanism (the bucketbrigade algorithm or BBA)which offersan interestingapproach to the problemof howto credit portions of the systemthat are contributingto a solution, thecredit assignmentproblem. He usesthe production rule representation, tobe further discussed in Chapter 7, ofthe form, IF < condition> THEN<action>.

In a production system, all the pro­ductionsinteract with a global list (GL), or

Page 13: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

141

LEARNING

FIGURE 5-4The BucketBrigade Approach to ParameterLearning.

Rule 2posts C

Payoff byoutside worldto active rule@ becauseCwas posted

;"",----... ......./ .......~

Rule 1 " .posts B ~Iobal list

A.......-----~~---- - B

C

./ CD IfA then BI

(" -0 If B1h~~--

Payment to ruleQ)byrule@ because Bwasposted to global list

Parameter LearningUsingan ExplicitModel. In parameterlearning for anexplicit model, a pattern or situation isexpressed as a list of attribute measure­ments corresponding to the arguments ofthe model. The model itselfalso containsfree parameters which may be adjustedtoimprove performance. Learningconsistsof changingthe freeparameters (which

Ifa production receives more from itsconsumers than it pays out to its suppli­ers, then it has made a profitand its capi­tal (utility parameter) increases. Certainactionsproducepayoffs fromthe externalenvironment, and allproductions active atsuch a time share this payoff. The profit­ability of a production thus depends on itsbeing part of a chainof transactions thatultimately obtainsa payoff from the envi­ronment, and such that this payoff ex­ceeds the incurredexpenses. In this way,a productioncan receive due credit eventhough it is far removed from the success­ful action to which it contributed.

working storage,which stores informationrepresenting the current state of theworld; whenthe < condition> part of aproduction is satisfied, the < action>portionof the rule is executed. Typically,the internal action taken is to alter or addto the information stored in GL; theremight also be someexternalaction (suchas sending a message) which does notconcern us here. Pure production systemsare completely data driven, and thus donot require an explicit control or sequenc­ing mechanism. The only control functionrequired of the system is to determinewhich productions are satisfied by theinformation stored in GL, and to carry outthe actions of these satisfied productions.A collection of rules of this type can beconsidered to be an implicit modelof thedomain to which these rules apply.

In the BBA approach, whenthe< condition> portion of a productionissatisfied it does not automatically activatethe < action> portion. Rather, the pro­duction is allowed to make a bid based onan associated utility parameter(whichmeasures its past effectiveness and itscurrent relevance); only the highestbid­dingproductions are allowed to fire. Asshown in Fig. 5-4, the BBA treats eachproduction rule as a middleman in aneconomic system, and its utility parameteris a measure of its ability to makea profit.Whenever a production fires, it makesapayment by giving part of the value of itsutility parameter to its suppliers. (Thesuppliers of a production P are thoseother productions that posted data on GLthat allowed P's condition portion to besatisfied.) In a similar way, a productionreceives a payment from its consumerswhen it helps them satisfy the conditionpart of their rules.

Page 14: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

142

LEARNING

can be measurement weights as in thecase of an implicit model) basedon ob­served performance. An example of learn­ing usingan explicit model is an adaptivecontrol system. Here, the system, e.g., anaircraft, is modeled using equations thatdescribe its dynamic response. Algorithmsare provided that adjustthe parameters ofthe dynamic model, basedon the mea­sured in-flight performance of the aircraft.

One of the earliest and mosteffectiveparameter learningprograms for an ex­plicitmodel is Samuel's checker-playingprogram [Samuel 67], described inBox 5-2.

Mostgame-playing programs (includ­ing Samuel's) use a gametree representa-

tion in which branches at alternate levelsof the tree representsuccessive moves bythe player and his opponent, and thenodes represent the resulting game(orboard) configuration. Typically, the com­plete tree is too large to be explicitlyrepresented or evaluated. Instead, a valueis assigned to nodes a few levels beyondthe current position using an evaluationfunction that guides the program in theselection of desirable moves to be made.This evaluation function is not guaranteedto provide a correct rankingof potentialmoves; it usually takes into accountthenumber, nature, and location of thepieces, and other attributes that the de­signerfeels characterize the position.

• BOX 5-2 Checkers: Parameter Learning Using a Problem-Specific Model

Samuel's checker-playing program isa successful example ofparameterlearningusing a problem-specificmodel; this programis able to beatall but a few of the best checkerplayers in the world. Agametreerepresentation is used in which eachnode correspondsto the board con­figuration after a possible move, andthe move to be made is determinedafterassigning values to the nodesand choosinga legal move leadingto the node with the highestscore.Sincea full exploration of all possi­ble moves requires the evaluation ofabout 1040 moves, the approach is toevaluate only a few moves by apply­inga heuristicevaluation functionthat assigns scores to the nodesbeingconsidered.To improve theperformance of the checker player,

one can allow it to search fartherinto the gametree, or one canimprove the heuristic evaluationfunction to obtain a more accu­rate estimate of the valueof eachposition.

Samuel improved the look­ahead capability of his system bysaving every board position encoun­tered during play, along with itsmostextensive evaluation. Totakeup little computer storage space,and to retrieve the results rapidly,requiredthe use of clever indexingtechniques and taking advantage ofboardsymmetries. When a previ­ously considered board position isencountered in a later game itsevaluation score is retrieved frommemory rather than recomputed.Improved look-ahead capability

comes about as follows: Ifwelookahead only three levels to a boardposition P, but P has a previouslystored evaluation value that repre­sents a look-ahead ofthree levels,then wehave performed the equiva­lent of a look-ahead ofsixlevels.

The evaluation function de­pendson (1) measures ofvariousaspects of the gamesituation, suchas "mobility," (2) the function to beusedto combine these measures,and finally (3) the specific weight­ingsthat shouldbe usedfor eachmeasure in the evaluation function.The designer supplies the measure­mentalgorithms and the form ofthe function to be used, while theweightings are obtained bya "learn­ing"procedure. The system learnsto play better by adjusting weights

Page 15: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

143

LEARNING

The problems in designing a goodevaluation function are (1) the choice ofthe components of the evaluation func­tion, (2) the method of combining thesecomponents into a composite evaluationfunction, and (3) how to modify the com­ponents and the composite functions asexperience is gained in playingthe game.(The credit assignment problem ariseshere: how to assign credit or blame dueto something that happened early in thegame, since the good or bad results donot show up until much later.)

The third factor is the one that typi­callyconstitutes the machine learning toplay the game. Note, however, that thislearning is cruciallydependent on the

BOX 5-2 (continued)

cleverness of the human designer in solv­ing the first two problems.

Model Construction:Description Models

Programs capable of forming their owndescription models take an initial set ofdata as their input, and form a theory ora set of rival theories with respect to thegiven data. The theories are descriptionsor transformations that satisfythe initialdata . The program then attempts to gen­eralize these theories to cover additionaldata. This typically results in a number ofadmissible theories. Finally, the programchooses the strongest theory from these,

which determine how much eachparametercontributes to the evalua­tion function. Samuel investigatedtwo forms of the evaluation function:

1. A function ofthe formMOVE_VALUE = SUM(w/f;),where f, is the numerical valueof the zu, board feature, and WI

is the corresponding weightassigned to that feature. Sinceeach feature independentlycontributes to the overall score,the weight can be considered asa measure of the importance ofthat feature. Thus, featureswith low weights can be dis­carded.

2. An evaluation function in whichfeatures were grouped togetherto obtaina scorebyusing a

look-up table that says If boardfeature A has a value ofX, andboard feature B has a value ofy, . . . , then that group offeatures has a score ofZ. Thevalues assigned to the groupedfeatures in the look-uptablescorrespond to the weights inthe first evaluation function;theseare modified by a learn­ingprocedure.The detailed record ofgames

played byexperts, book games, wereused fortraining the system. thechecker program takingthe part ofoneofthe experts. If the programmakes the move that the expertmade fora particulargame situation,the evaluation function was assumedto be correct. since it caused thecorrect path in the gametree to be

taken. If the wrong move was made,thenthe evaluation function wasmodified. Various ad hoc weightmodification schemes were incor­porated into the program. Samuelfound that the secondevaluationfunction was far more effective thanthe first.

The approach of relating spe­cific boardpositions and corres­ponding expert responses has animportant advantage. Reward andpunishment training (learning)procedures can be based on thelocal situation, rather than on win­ning or losingthe game(becausethe experthas a global view in mindwhen making his move). This localindication of ultimate success orfailure is not usually available innon-game playing situations.

/

Page 16: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

144LEARNING

i.e., one that best explains the data en­countered to date.

Learning analogical relationships.An early investigation of this class ofproblemsbyT.G. Evans [Evans 68] stillstands as a remarkably insightful work on

analogy and the learning process (see Box5-3). Evans dealt with the machine solu­tion ofso-called geometric-analogy intelli­gence test questions. Each member of thisclassof problems consists of a set of linedrawings labeled Figures A, B, C, c, C2,

BOX 5-3 Solving Geometric Analogy Problems

In 1963, Tom Evans [Evans 68)devised a computer program,ANALOGY, which couldsuccess­fully answer questionsof a typefound on IQ tests: Figure A is tofigure B, as figure C is to whichof the following figures? The op­eration of this geometricanalogyprogram can best be illustrated bya toy example. Suppose we have thepatterns shown in Fig. 5-5. In de­scribing figure A, the programwouldfirst assign names to each pattern.Thus,

FIGURE 5-5Geometric Analogy Example.

(AI = +), (A2 = - ),and (A3 = 0).

Similarly, in figureB,

(Bl = -), (B2 = + ),

and {B3 = 0).

+

o

A

+

o

B

In figure A, analysis would revealthat A2 is to the right of AI, A3 isbelow AI, and A3 is below A2.Analysis of figure B would show thatB2 is to the right of BI, 83 is belowBI, and B3 is below B2.

Comparing the descriptions ofthe two figures, the programwouldfind that Al = B2, that A2 = BI,and A3 = B3. From this it woulddeduce that the transformation ruleis: interchange Al and A2 to obtainfigure B from figure A.

Supposewe havethe patternsshown in Fig. 5-6.

We wantthe programto answerthe question: figure A is to figure Bas figure C is to which figure (CI,C2, or C3)? Again, the programhasto find the correspondences betweensubfigures in figure C and the subfi­gures in figures CI, C2, and C3.

After determining these corre­spondences, the programcan applythe transformation found betweenfigures A and B, and will findthatboth figures C2 and C3 are possibleanswers. It finally determines thatfigure C3 is not acceptable becausethe Z pattern has also been trans­formed (from bottom to top). Thus,it is sometimes necessaryto specify

VVlrxvlfYXlrzl~L:J~~

c

FIGURE 5-6Analogy Test Patterns.

which things remain constant, aswell as which thingschange, inspecifying the transformation rule.For example, in the transformationrule relating figures A and B, it isnecessary to assert that the bottompattern does not changeits locationwith respect to the two other pat­terns. Given this extended transfor­mation rule, figure C2 is selectedasthe desired solution.

ANALOGY usesa fixed modelprovided bythe designer(the set oftransformation operators), and finds,by exhaustive search, the propertransformation parameters for thevarious pairs of figures . The ap­proach dependson having a smallnumberof patterns in each figure,otherwise the combinatorial growthof the pattern relationship listswould become excessive.

Page 17: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

145

LEARNING

· . . The task to be performed can bedescribed by the question, Figure A isrelated to Figure B, as Figure Cis relatedto which of the following figures? To solvethe problem the program mustfind trans­formations that convertFigure A intoFigureB and also convertFigure C intoone of the answerfigures. The sequenceof steps in the solution is as follows:

• Overlapping subfigures mustbe identi­fied in line drawings, e.g., two overlap­pingcircles must be recognized as such.

• The relationships between the sub­figures in a mainfigure mustbe deter­mined, e.g., the dot is inside the square.

• The transformations between subfiguresin going from FigureA to Figure Bmustbe found, e.g., the triangle inFigureA corresponds to the smallsquare in Figure B, or the dot inside thesquare in FigureA is outside the squarein Figure B.

• The subfigures in Figure C that areanalogous to the subfigures in Figure Amust be recognized.

• The program must analyze whatfigureresults when the transformations areapplied to FigureC.

• The transformed FigureC must then becompared to the candidate figures.

A description of Evans' ANALOGYprogram is given in Box5-3.

Learning descriptions. Astudy byWinston [Winston 75] deals with develop­ing descriptions of classes of blocks-worldobjects (seeBox 5-4). Given a set of ob­jects identified as to class, known as atrainingset, the program is to produce(learn) a description of the class. Theprogram is given a set ofdescription prim­itives and develops a description of the

object class as trainingobjects are pre­sented one at a time. The programnotesthe commonalities between positive in­stances, and differences between negativeinstancesand the current description.Each example leads to a modified descrip­tion that ultimately characterizes thatclass of objects. Whena trainingexampleis presented that is incompatible with thepresent description, the program back­tracks to a previous description and at­tempts to modify it so as to be compatiblewith the newinstance.

The final description producedbyWinston's system is dependenton thevocabulary provided and on the sequenceof examples shown. In addition, the pro­gram must assume that the class labelingof the examples is correct, since the modi­fication of the description depends on theclassassignments given. Furthermore,some of the examples that are not of theclassmust differ onlyto a small extentfromthose that are in the class; other­wise, the programwill not be able todiscover the fine distinctions betweenmembership and nonmembership in theclass. (For example, in learning whatan"arch" is, the program is given the exam­ple of a non-arch whose support poststouch, but is otherwise a valid arch. Thisallows the program to detect this impor­tant difference between a non-arch and anarch.)

Learning generalizations of descrip­tionsand procedures. It is possibletoconsider the problem ofgeneralizing adescription (or a procedure) as a searchproblem in which the space of all possibledescriptions is examined to find (leam)the most general description that satisfiesa set of training examples. This approachis used in LEX [Mitchell 83], a program

Page 18: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

146

LEARNING

I] BOX 5-4 Learning Descriptions Based on (Given) Descriptive Primitives

In 1970, Patrick Winston [Winston75) devised a learning programwhich was able to produce descrip­tions of classes of simple blockconstructions presented to theprogram in the form of line draw­ings. (A specializedsubprogramconverted these line drawings intoa semantic network description.)

The semantic net in Fig. 5-7indicates that an arch, such as isshown in the figure, consists ofthree pieces, a, b, and c. Training

o

oFIGURE 5-8Block Structures That Do Not Form an "Arch."

oa

b c

LYING - hasp~

must be must besupported by supported by

Q -rnustnot- 0P abut '-.:.(has property has property

/ \STANDING STANDING

FIGURE 5-7Semantic Net for the Concept"Arch."

instances cause the various links tobe inserted, e.g., that piece a hasthe properly of being supportedby pieces band c; pieces band cmust not abut; and piece a mustbelyingdown while pieces b and carestanding.

The must not abut descriptionis deduced by the systemonlyaftershowingit a non-arch exampleconsisting of a set of blocks in whichthe supports do abut, as shown inFig. 5-8. An implicit requirementisthat examples of a non-arch shouldnot have characteristics that differfrom the existingarches and are notimportant in definingthe arch class.If the top block in an exampleiscurved, and the supports touch,then the program will not knowwhichof the two deviationsis caus­ing the blocks to be an instanceof anon-arch. Thus, each non-arch used

for training must be a near-miss toan actualarch.

If examples are shown to thesystem in which the top piece isalways a rectangularsolid, then theprogram will assumethat this is arequirementfor an arch. Only afterthe system is shownan arch consist­ing of a triangular top piecewill thedescription be broadenedto includeboth rectangular and triangular toppieces.

An approach such as Winston'sthat tries to finda description thatis consistentwith all of the traininginstanceswill be strongly affected byerroneous data, i.e., data in whicha false-positive or false-negativeinstancehas been given. A false­positive causes the description to bemoregeneral than it shouldbe, anda false-negative causesthe descrip­tion to be overspecialized.

Page 19: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

147

LEARNING

that learns "rules of thumb," known moreformally as heuristics, forsolving prob­lems in integral calculus. Basically, a ruleindicates that ifa certain form of mathe­matical expression is found, then a spe­cific transformation should be applied tothe expression. Sometimes these proce­dures involve looking up the integrationform in a table, while at other times theymay involve carrying out operations onthe expression to convert it to a moremanageable form. The program beginswith about 50 operators (procedures) forsolving problems in the integral calculus,and it learns a set of successively moregeneral heuristics which constitute a strat­egy for usingthese operators (Box 5-5).

The generalization language madeavailable to LEX is crucial since it deter­mines the range of concepts that LEX candescribe, and thus learn. For example, thedesigner has provided a generalizationhierarchy that asserts that sin and cos arespecific instances of a trigonometric func­tion, and that a trigonometric function is aspecific instance of a transcendental func­tion. LEX usesfour modules:

1. The problem solver utilizes what­everoperatorsand heuristics arecurrently available to solve a givenpracticeproblem.

2. The critic analyzes the solution treegeneratedbythe problem solver toproduce a set ofpositive and negativetraining instances from which heuris­tics will be inferred. Positive in­stancescorrespond to steps along thebest solution path, and negative in­stances correspond to steps leadingaway from this solution path.

3. The generalizer proposes and refinesheuristics to produce more effective

problem solving behavior on subse­quent problems. It formulates heuris­tics bygeneralizing from the traininginstances provided by the critic.

4. The problem generator generatespractice problems that will be inform­ative (i.e., they will lead to trainingdata useful forproposing and refiningheuristics), yeteasy enough to besolved using existing heuristics.

These modules work together to proposeand incrementally refine heuristics fromtraininginstances collected overseveralpropose-solve-criticize-generalize learningcycles.

Unlike mostsystems that retain onlya singledescription at any given time,LEX describes each partially learnedheuristic using a representation (calledversion space) that provides a way ofcharacterizing all currently plausible de­scriptions (versions) ofthe heuristic. Thebasic ordering relationship used in theversion space representation of rules isthat of more-specific-than. For example,the precondition (large red circle) is morespecific than (large? circle), which is morespecific than (? ? circle), where? indicatesthat this condition need not be satisfied.More formally stated, a rule G1 is more­specific-than a rule G2, if and only if thepreconditions of G1 match a proper subsetof the instances that Gz matches, andprovided that the two heuristics make thesame recommendation.

LEX stores in version space only themaximally specific and maximally generaldescriptions that satisfy the trainingdata.Thus the more-specific-than relation (pro­vided as part of LEX's originally given,hierarchically ordered vocabulary) parti­ally orders the spaceof possible heuris-

Page 20: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

148

LEARNING

-tics, provides the basis for their efficientrepresentation, and a prescription fortheir generalization.

Concept Learning

In concept learning, the learning systemmustdevelop appropriate concepts (de­scriptive primitives) for dealing with itsenvironment. Because of itsdifficulty,little progress of a general nature has

been made in this problem area. One ofthe few significant efforts that both ad­dresses the issue of conceptlearning, andhas achieved somecomputational success,is AM. The AM program [Lenat 77] runsexperiments in numbertheory and ana­lyzes the results to develop new conceptsor theorems in the subject.

The program begins with 115 incom­pletedatastructures, each correspondingto an elementary set-theoretic concept

Il BOX 5-5 Version Space: A Representation for DescriptionsB

The LEX program [Mitchell 83b)dealswiththe problem of learningrules of thumb to solvecalculusintegrationproblems. Each heuris­tic rulesuggests an integration pro­cedure (operator) to apply whenthe given expression satisfies a cor­responding functional form. LEXuses a representationcalledversionspace to keep track of the rules thatit is learning. Stored information iskept to a reasonable size by includ­ing only the mostspecific and themost general descriptions of therules that satisfy the trainingexam­ples.

The programis given a set ofoperators for solving calculus prob­lems.A typical operator,

OP1: INTEGRAL (u dv) -->uv- INTEGRAL (v du),

indicates howthe INTEGRAL of anexpression can be solved, and oftenthe solution is recursive, i.e., itrequiressolution of an additionalINTEGRAL.

An example ofa version spacerule isshown in Fig. 5-9(1). Themostspecific heuristic is marked Sand indicates that if a specific ex­pression, {3xcos(x) dx}, is encoun­tered, then operator 1 shouldbeused, with the variables u and dvinthis operatorreplaced bythe specificvalues shown.

The mostgeneral rule, denotedby G, indicates the substitutions thatshould be madein operator 1 for themoregeneralexpressions fl(x) andf2(x). Thus while many additionalheuristics are implied, Sand Gcompletely delimit the rangeofalternatives in a predefined general­to-specific ordering.

Assume that a new traininginstance involving {INTEGRAL 3xsin(x)} is encountered, and theprogram discovers the solution

INTEGRAL 3xsin(x) dx -->Apply OP1 with u= 3xanddv=sin(x)dx.

Given the generalization hierar-

chysupplied bythe designer,

sin(x),cos(x) --> trig(x) --> transc(x)-->f(x)

kx--> monorntrj-« poly(x) -->f(x),

LEX determines that the mostspecific version of the heuristic ofFig.5-9(1) could be generalized toinclude both cos(x) and sin(x), byusingtrig(x) in placeofcos(x). Thisgeneralization is shown in the re­vised version space representation ofthe heuristic rule, Fig. 5-9(2).

However, an example in theform {INTEGRAL sin(x) 3x}pro­vides evidence that the mostgeneralform of the heuristicrule,as speci­fiedin Fig. 5-9(1), can fail unless it isfurther restricted. Specifically, theoriginal generalization proposed is:Apply OP1 with u-fl(x) anddv=f2(x) dx, but the sin(x)3xtrain­ing example shows that this assign­ment of u and v can result in a morecomplicated expression that OP1cannot integrate, and that an inter­changeof fl(x) and f2(x) may be

Page 21: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

149

LEARNING

(such as union or equality). Each datastructure has 25 different facets or slotssuch as examples, definitions, generaliza­tions, analogies, interestingness, etc., tobe filled out. Very interesting slots can begranted full concept module status. This isthe space that AM begins to exploreguided bya large body of heuristic rules.

AM operates underthe guidance of250 heuristic rules which indirectly con­trol the system throughan agenda mecha-

BOX 5-5 (continued)

nism, a global list of tasks for the systemto perform, togetherwith reasons whyeach task is desirable. A task directivemight cause AM to define a new concept,or to explore somefacet of an existingconcept, or to examine some empiricaldata for regularities, etc.,The programselects from the agenda that task that hasthe best supporting reasons, and thenexecutes it. The heuristic rules suggestwhich facet ofwhich concept to enlarge

(1) VERSION SPACE REPRESENTATION OF A HEURISTIC

S: INTEGRAL 3x cos(x) dx-Apply OP1 with u=3x and dv= cos(x) dxG: INTEGRAL f1(x) f2(x) dx-Apply OP1 with u=f1(x) and dv=f2(x) dx

(2) REVISED VERSION SPACE REPRESENTATION OF THE HEURISTIC

S: INTEGRAL 3x trig(x)-Apply OP1 with u=3x and dv=trig(x) dxG: g1: INTEGRAL poly(x) f2(x) dx-Apply OP1 with u=poly(x), and dv=f2(x) dx

g2: INTEGRAL transc(x) f1(x) dx-r-Applv OP1 with u=f1(x), and dv=transc(x) dx

GENERALIZATION HIERARCHY SUPPLIED BY DESIGNER

sin(x).cos(x) --+ trig(x)- transc(x) - f(x)kx-monom(x)--+poly(x) -f(x) ,

f iGURE 5-9Version Space Representations ofVarious Forms of a Heuristic Rule forSymbolic Integration.

necessary depending on the specificfunctional form of nand f2. Thedescription ofthe heuristic in ver­sionspace is revised to take this newinformation into accountby break­ingthe most general form of theheuristic intotwo statements. Inaddition,thegeneralization hierarchy

is used to replace n(x) in one gener­alizationby its more specific namepolynomial, and in the other expres­sionto replace f2(x) by its morespecific name transc(x). The revisedform of the heuristic of Fig. 5-9(1),afterthe two trainingexamples justdiscussed, is shown in Fig. 5-9(2).

We note that the learningprocess described here always in­volves the narrowing of a descrip­tion represented in version space.Thisnarrowing occurs in discretesteps, based on the predefinedhierarchically ordered vocabulary forthe givenproblemdomain.

Page 22: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

150

LEARNING

next, and how and when to create newconcepts. Lenat provided the system witha specific algorithm for rank-orderingconcepts in terms of how interesting theyare. The primary goal of AM is to maxi­mize the interestingness level of the con­cept space it is enlarging.

Many heuristics used in AM embodythe beliefthat mathematics is an empiricalinquiry-the approachto discovery is toperform experiments, observe the results,gather statistically significant amounts ofdata, induce from that data some newconjecturesor new concepts worth isolat­ing, and then repeat this whole processagain. An exampleof a heuristic dealingwith this type of experimentation is Aftertrying in vain to findsome non-examplesofX, if many examplesofX exist, con­sider the conjecture that X is universal,always-true. Consider specializing X.

Another large set of heuristics dealswith focus ofattention: whenshould AMkeep on the same track, and when not. Afinal set of rules deal with assigning inter­estingness scores based on symmetry,coincidence, appropriateness, usefulness,etc., For example, Concept C is interest­ing if C is closelyrelated to the very inter­esting concept X.

Experimental evidence indicates thatAM's heuristics are powerful enough totake it a few levels away from the kind ofknowledge it began with, but onlya fewlevels. As evaluated byLenat, of the 200new concepts AM defined, about 130 wereacceptableand about 25 weresignificant;60 to 70 of the concepts werevery poor.

Although AM is described as "explor­ing the space of mathematical concepts,"in essence AM was an automatic program­ming system, whose primitive actions

producedmodifications to pieces of LISPcode, which represent the characteristicfunctions ofvarious mathematical con­cepts. For example, given a LISPprogramthat detectswhen two lists are equal, it ispossible to make a "mutation" in the codethat now causes it to compare the lengthof two lists, and another mutation mightcause it to test whether the first items ontwo listsare the same. Thus, becausesuchmutations often result in meaningfulmathematical concepts, AM was able toexploitthe natural tie between LISP andmathematics, and was benefiting from thedensity ofworthwhile mathematical con-

. cepts embedded in LISP. The main limita­tion of AM was its inability to synthesizeeffective new heuristics based on the newconcepts it discovered. It first appearedthat the samemachinery used to discovernew mathematical conceptscould also beused to discover new heuristics, but thiswas not the case. The reason is that thedeep relationship between LISP andmathematics does not existbetween LISPand heuristics. When AM applied its mu­tation operators to viable and useful heu­ristics, the almost inevitable resultwasuseless new heuristic rules.

In evaluating the accomplishments ofAM, there is also the question of theextent to which the designer implicitlysuppliedthe limited set of mathematicalconcepts that were (or couldbe)gener­ated by the system, bysupplying the par..ticular initial set of heuristics, frame slots,and the definitions and procedures thatdetermine how these slotsget instanti­ated. An extensive criticism of AM, and alater relatedprogram called EURISKO,was offered by Richie and Hanna [Richie84]. In their response [Lenat 84], Lenat

Page 23: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

151

DISCUSSION

and Brown comparethe AM/EURISI{Oapproach to that of the perceptron, adevice described earlier in this chapter:

The paradigm underlying AM andEURISKO may be thoughtof as the newgeneration of perceptrons, perceptronsbased on collections or societies ofevolv­ing, self-organizing, symbolic knowledgestructures. In classical perceptrons, allknowledge had to be encoded as topolog­ical networks of linked neurons, withweights on the links. The representationscheme used in EURISKO provides muchmore powerful linkages, taking the formofheuristics about concepts, includingheuristics for how to use and evolveheuristics. Bothtypes ofperceptrons relyon the law of large numbers, on a kindoflocal-global property ofachieving ade­quate performance through the interac­tions of many relatively simple parts.

DISCUSSION

There are several key issues that havearisen in our exposition of the learningprocess:

Representation. As is the caseinother areas of AI, having an appropriaterepresentation is crucial for learning-it isoften necessary to be able to modify anexisting representation, or create a newone, in dealingwith a given problemdomain. However, the learning programsreviewed had relatively fixed representa­tions, provided bythe designer, whichboundedthe learningprocess. For exam­ple, the problem of telling when a situa­tion, object, or eventis similar to anotherlies at the heart of the learning process; arelated problemis determining whensomething is a moregeneral or specificinstance of something else, i.e., determin­ingthe generalization hierarchy. Most of

the systems examined use a generalizationhierarchy provided by the designer, andmethods for measuring similarity almostinvariably depend on comparing prede­fined attribute vectors, or graph matchingbased on a predefined vocabulary. Thus,.existing learningsystems are inherentlylimited to instantiating a predefinedmodel; their ability to really discoversomething new, or to exhibitcontinuinggrowth is virtually nonexistent.

Problem generation. A systemshould be able to generate its own prob­lemsso that it can refine its learned strat­egies. This type of problem generation isseen in a child learning to use buildingblocks. Various structuresare tried so thatthe physics ofblock building can belearned. LEX, and to some extent AM,were the only machinelearning programsdiscussed that had the ability to effectivelyselect problems. In all the others it is upto the human operator (or trial and error)to chooseappropriate problems to pro­mote learning.

Focusofattention. A learningsystem should be able to alter its focusof attention, so that problem solutionscan be effectively found. Many of theprogramsthat we examined used a fixedapproach to problem solving, and didnot havethe ability to focus on a prob­lem bottleneck. Capability in this areais related to self..monitoring, i.e., if oneknows howwell he is doing, then criticalproblemareas can be identified, and pri..orities can be altered accordingly.

Limitson learning ability. How isnew learningrelated to the knowledgestructures already possessed by a system?Totake an extremeexample, no matterhowmany ants wetest, or how hard they

Page 24: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

152

LEARNING

try, it is inconceivable that an ant coulddevise the equivalent of Einstein's theoryof relativity. On what grounds is this in­tuition based? It certainly appears to bethe case that any system with a capacityfor selfmodification, or learning, haslimits on whatit can reasonably hope toachieve. Atpresent, we do not evenhavea glimmering of a theory that quantifiessuch limits. The amount of knowledgecurrent machine learning systems start offwith is invariably basedon practicality andother ad hoc criteria.

Human learning. There appearsto be almost nothing in the physiological

or psychological literature concerninghumanlearning that can aid us in thedesign ofa machine that learns. Themechanisms used by the human to learnautonomously when immersed in a com­plexenvironment remains a mystery. Inparticular, the ability of the human toform new concepts when required is notunderstood.

Perhaps the mostinteresting openquestion is whetherit is possible for amechanical process to create new con­cepts and representations using an ap­proach morepowerful than trial anderror. At present, no such procedure isknown.

Appendix5-1

Parameter Learning for an Implicit Model

Aswasshown in Fig. 5-2, a thresh­old device accepts a set of binaryinputs, e.g., TRUE or FALSE, 0 or1, -lor +1, and outputs a binaryresult. Each input is multiplied by aweight, ui; and if the sum of theweighted inputs exceedsa thresh­old value, T, then the output is onevalue, otherwise it is the other. Thethreshold device can be used forgeneral computingpurposes, sinceappropriate weightsettings willcause it to behave like the AND,OR, and NOTfunctions mentionedin Chapter 4. However, this device(function) has specialsignificance forclassification-type (pattern recogni­tion)computations.

Much of the work on thresholddevices has dealt withthe questionof howdifferent classesof objectscouldbe recognized by automaticadjustment of the weights, to; Sup­pose wehavetwo classes A and Band for objects in classB we wantthe sum of the weighted inputs,SUM = WIt; +w2!;+... , to begreater than a threshold, T. Weknow intuitively that if in the train­ing modewe obtain a SUM less thanT for a test pattern in class B, thenthe weights correspondingto posi­tiveinput terms should be increasedand those correspondingto negativeinput terms and the valueof thethreshold should be decreased. (The

converse action shouldbe taken ifSUM is incorrectly greater than Tfor test patterns in classA.)

It is possible to provea surpris­ingly powerful theorem which saysthat if a single thresholddevice iscapable of recognizing a class ofobjects, then it can learn to recog­nize the class by usingthe followingweight adjustment procedure:

1. Start with anyset ofweights.2. Present the device witha pat­

tern (described by a vector of-1, + 1 valued features) ofclassC, or not of class C.

3. If the pattern is classified cor­rectly (8UM > T for a pattern

Page 25: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

153

APPENDIX 5-1

Class ( 1 (2

1 -1 -1

2 - 1 + 1

3 + 1 -1

4 + 1 + 1

--=--..Pattern ~I ~+ 1

Measurements~ - 1-----'

in C;SUM - < T for a patternnot in C) then go to step 2.

4. If SUM - < T for a pattern, X,in C, replace each W , by(w,+f,{x» and T by (T-1); IfSUM > T for a pattern, X, notin C, replace each W , by (wi ­

f,{x» andT by (T+1).5. Goto step2.

(a) Threshold network. (b)Encoding ofclassdesignation with respectto the two outputdevices of the threshold network.

FIGURE 5-10Classifying a Pattern as Belonging to One of Four Classes.

(a)After a finite number of itera­

tions, in which correctly labeledmembers ofthe input population areused in the weightadjustment proce­dure, the device will correctly recog­nizethesepatterns. The order inwhich the inputpatterns are pre­sentedis not important; it is onlyrequired that the numberof appear­ances ofeach pattern is proportionalto the length of the training se­quence (i.e., to the total number ofpatternsactually presented). Note

that the theorem onlyassures usthat members in the trainingse­quences will be correctlyidentified.Correctclassification of newpatternswill occur only if the training exam-

(b)

pies are truly representative of classC and its complement, and thefeatures are well chosen. For exam­ple, if a featuresuch as color doesnot distinguish one class from an-

TABLE 5-1 • Outputs ofThree Feature Detectors for a Set of Patterns

Feature Detectors

Input fD 1 Enclosures fD2 Verticals ffi3 HorizontalsPatterns If < 2 Enclosures-+ -1 If < 2 Verticals-+-1 If < 3 Horizon tals-+- 1

If ~2 Enclosures-++ 1 If ~2 Verticals-+ + 1 If ~ 3 Horizontals -++ 1

1 A -1 1 -1

2 8 1 - 1 1

3 8 1 1 -1

4 A -1 -1 -1

5 B 1 -1 -1

6 R -1 1 1

7 B -1 -1 1

8 B 1 1 1

Page 26: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

other, then measurements of thisfeature cannot contribute to theclassification process.

Although wehavediscussedclassification usingtwo classes, the

154

LEARNING

approach can be extendedto manyclasses (see Fig. 5-10). The figureshows how the binaryoutput of twoclassifiers can be used to classify apattern into one of four classes.

An example of threshold deviceparameter learning is presented inTables 5-1 and 5-2. The problem isto tellwhether a set of measure­ments derivedfrom a pattern on a

TABLE 5-2 • Learning to Recognize a Set of Patterns by Adjustment of Weights in a Threshold Network(The training sequen ce is the set of input patterns 1-7 of Table 5-1)

Input Weighted New New New NewPattern Class FOI FD2 FD3 Sum T WI W2 W3

0 - - - - - .?f0 0 0 0

1 -1 -1 1 -1 o IL 0 0 0 0

2 1 1 -1 1 0 -1 1 -1 1 +-change

3 1 1 1 - 1 - 1 -2 2 0 0 +-change

4 -1 -1 -1 - 1 -2 -2 2 0 0

5 1 1 -1 -1 2 -2 2 0 0

6 -1 - 1 1 1 -2 -2 2 0 0

7 1 -1 -1 1 -2 -3 1 - 1 1 +-change

1 -1 -1 1 -1 -3 -3 1 -1 1

2 1 1 -1 1 3 -3 1 -1 1

3 1 1 1 -1 -1 -3 1 -1 1

4 -1 - 1 -1 -1 -1 -2 2 0 2 +-change

5 1 1 - 1 -1 0 -2 2 0 2

6 -1 - 1 1 1 0 -1 3 -1 1 +-change

7 1 -1 -1 1 -1 -2 2 -2 2 +-change

1 -1 - 1 1 -1 -6 -2 2 -2 2

2 1 1 -1 1 6 - 2 2 - 2 2

3 1 1 1 -1 -2 -3 3 - 1 1 +-change

4 -1 - 1 -1 - 1 -3 -3 3 - 1 1

5 1 1 - 1 -1 3 -3 3 -1 1

6 -1 -1 1 1 -3 - 3 3 -1 1

7 1 -1 -1 1 -1 -3 3 -1 1

8 1 1 1 1 3 - 3 3 -1 1 +-Thispattern, not in thetrainingset, is correctlyidentified bythe"learned" set ofweights

Page 27: Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

retina depicts the presence of thecharacter A or the characterB. Thethreshold device outputs - 1 toindicate the characterA and +1 toindicate the characterB. Threefeature detectors makemeasure­ments on the pattern values in theretina to determine three differentcharacteristics or features of thepattern.

Table 5-1 shows the responsesofthethreefeature detectors fora

155

APPENDIX 5-2

set ofeightpatterns. Notethat theoutputofa feature detector is +1or -l.

Table 5-2 shows the sequenceoftrial weights obtained using theweight adjustment procedure de­scribed previously. The three weightsare initially zero, andare modifiedwhenever the inputpattern is mis­classified. For example, inputpat­tern 2 hasa weighted sumthat doesnot exceed the threshold. This is

incorrect, sincepattern 2 is in the+1 class. Therefore the weightmodification procedureis used.

Theprocedure continuesuntilallof the patternsare correctlyclassified. Once the correct set ofweights hasbeen obtained, thedevice canclassify a pattern that itwas not trained on, as shown in theresponse to pattern 8.