Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or

5

Learning

Learning can be defined asany deliberate or directedchange in the knowledgestructureofa system that allows it to perform better onlater repetitions of somegiven type of task. Learning is an essentialcomponentof intelligent behavior-anyorganism that lives in a complex andchanging environment cannot be designedto explicitly anticipate allpossible situations, but must be capable of effective selfmodification. There are several basic waysto learn, i.e., to add to one's ability toperform or to know about things in theworld:

1. Genetically-endowed abilities.Knowledge can be stored in thegenes of men or animals, or in the

circuits of machines.2. Suppliedinformation.

Someonecan demonstratehowto perform an action orcan describe or provide factsabout an object or situation.

3. Outsideevaluation. Someone cantell you whenyouare proceedingcorrectly, or when you have obtainedthe correct information.

4. Experienceor observation. One canlearn byfeedback from his environment; the evaluation is usually doneby the learner measuring his approach to some explicitgoal.

5. Analogy. New facts or skills can beacquired by transforming and augmentingexisting knowledge thatbears a strong similarity in some

129

130

LEARNING

respect to the desired newconcept orskill.

In computers, the first two types oflearning correspondto providing themachine with a predesigned set of procedures, or adding information to a database. If the supplied information is in theform of advice, rather than a completeprocedure, then converting advice to anoperational formis a problem in the domain of reasoning rather than learning.Types 3, 4, and 5 in the above list involve"learning on your own," or autonomouslearning, and present some of the mostchallenging problems in our understanding of intelligentbehavior. These types oflearning are the subjectof this chapter.

Autonomous learninglargely dependson recognizing when a newproblem situation is sufficiently similar to a previousone, so that either a single generalizedsolution approach can be formulated, ora previous solution can be applied. However, the recognition of similarity is a verycomplex issue, often involving the perception of analogy between objects or situations. Wewill firstdiscuss some aspectsof animaland human learning, and thenrepresentations and techniques for discovering and quantifying similarity (includingmethods for recognizing known patterns).Finally, techniques for automatic machinelearning will be described. We will addressthe following questions:

• How do animals and peoplelearn?• How can we quantify the conceptof

similarity?• How can we identify a particularsitua

tion as being similar to or an instanceof an already-learned concept?

• How much do youhave to knowinorder to be able to learn more?

• What are the different modes of learning, and how can such ability be embedded in a computer?

HUMAN AND ANIMAL LEARNING

Complicated behavior patterns canbe found in all higheranimals. Whenthese behavior patterns are rigid andstereotyped, they are known as instincts.Instincts are sometimes alterable byexperience, but only within narrow limits. In higher vertebrates, especially themammals, behavior becomes increasinglymodifiable by learning eventhough rigidinstinctive behavior is still present

It is now recognized that inheritanceand learning are both fundamental indetermining the behavior of higheranimals, and the contributions of these twoelements are inextricably intertwined inmostbehavior patterns. Inheritance canbe regarded as determining the limitswithin which a particularbehavior pattern can be modified, while learningdetermines, within these limits, the precisecharacter of the behavior pattern. In somecases the limits imposed by inheritanceleave little room for modification, while inother casesthe inherited limits may be sowide that learning plays the majorrole indetermining behavior. r

Experiments conducted by W. H.Thorpe of Cambridge University [Thorpe65] show someinstances of the interaction between inheritance and learning. Heraised chaffinches in isolation and foundthat such birds wereunableto sing anormalchaffinch song. Thorpe then demonstrated that young chaffinches raised inisolation and permitted to hear a recording of a chaffinch song when'about six

131

HUMAN INTELLIGENCE

months old would quickly learn to singproperly. However, young chaffinchespermitted to hear recordings of songs ofotherspecies that singsimilar songsdidnot ordinarily learn to singthese othersongs. Apparently chaffinches learn tosing only by hearing other chaffinches;they inheritan ability to recognize andrespond only to the songs of their ownspecies.

Whenchaffinch songs are playedbackward to the young birds, they respond to these, even though to our earssuch a recording does not sound like achaffinch song.

Thus a chaffinch songis neitherentirely inherited or wholly learned; it isbothinheritedand learned. Chaffinchesinherit the neural and muscular mechanisms responsible for chaffinch song, theyinherit the ability to recognize chaffinchsong when theyhear it, and they inheritsevere limits on the type ofsong they canlearn. The experience ofhearing anotherchaffinch sing is necessary to activate, andperhaps somewhat adjust, their inheritedsinging abilities, and in thissense theirsong is learned.

Types of Animal Learning

It ispossible to classify the many typesofanimal learningaccording to the conditions which appear to induce the observedbehavior modifications:

Habituation and sensitization. Habituation is the waning of a response as aresult of repeated continuous stimulation not associated with anykind ofreward or reinforcement An animaldrops those responses that experience indicates are ofno value to its

goals. Sensitization is an enhancement ofan organism's responsivenessto a (typically harmful) stimulus.Instances of both habituation andsensitization can be found in thesimplest organisms.

Associative learning. Associating a responsewith a stimulus with which itwas not previously associated. Anexample of classical conditioning, orassociative learning, is Pavlov's experimentin which a dog learned to associate the sound of a bellwith foodafter repeated trials in which the bellalways rang just beforethe foodwasprovided. Another form of conditionedlearning, called operant conditioning, or instrumental learning, isillustrated by the Skinner box. Ahungry rat is given the opportunity todiscover that a pelletof food will begiven every time he presseson a barwhena signal lightis lit, but no foodwill be provided when the light is off.The rat quickly learns to press on thebar only when the light is lit Thedistinction between classical andoperant conditioning is that in classical conditioning the conditionedstimulus (e.g., bell) is always followedby the unconditioned stimulus (e.g.,food), regardless of the animal's response; in operantconditioning,reinforcement (e.g., food) is onlyprovided when the animal respondsto the conditioning stimulus (e.g.,lightis lit) in somedesiredway (e.g.,pressing on the bar).

Trial-and-error learning. An associationis established between an arbitraryaction and the corresponding outcome. Various actions are tried, and

132

LEARNING

future activity profits from the resulting reward or punishment. This typeof learningcan oftenbe classified asa form of operant conditioning asdescribed above.

Latent learning. Learning that takesplace in the absence of a reward. Forexample, rats that run through amaze unrewarded will learn the mazefaster and with fewer errors whenfinally rewarded (compared to otherrats that neverencountered the mazebefore, but received rewards from thestart of the learning sessions).

Imprinting. The animal learns to make astrong association with another organism or sometimes an object. Thesensitive phase in which learning ispossible lasts for a short time at aspecific point in the animal's maturation, say a day or two afterbirth, andonce this phase has passed, the animalcannot be imprinted withanyother object. For example, it wasfound that a duckling will learn tofollow a large colored boxwith aticking clock inside if the box is thefirstthing it observes afterhatching.The duckling preferred the box to itsmother!

Imitation. Behavior that is learned byobserving the actions of others.

Insight. This typeof learningis theability to respond correctly to a situation significantly different from anypreviously encountered. The newstimuli may be qualitatively and quantitatively different from previous ones.An example would be (1) a monkeyobtaininga bunch of bananas abovehis reach bymoving a box under the

bananas and standingon the box,(assuming that he had neverseen thisprocedure used before), and (2) subsequently usingthe box as a tool toobtain objectsout of his normalreach. The firstpart of the monkey'sactivities involve problem solving, thesecond involves learning. We notehere the difference between learningand problem solving: problem solvinginvolves obtaining a solution to aspecificproblem, while learningprovides an approach to solving a classof problems.

The evolution of anatomical complexity amonganimals has paralleled a steadyadvance in learningcapability from simplehabituation to associative learning, whichappears first as classical conditioning andthen as trial-and-error learning. It is as ifevolutionary progress in the powers ofbehavioral adjustment consists of elaborating and coordinating new types ofresponsesthat are built upon and act withthe simple ones of more primitive animals.

Some remarkable examples of animallearningare presented in Box5-1.

Piaget's Theory of HumanIntellectual Development

A study of the intellectual development ofa child can provide insight into learningmechanisms. Basedon an extensive seriesof experiments and informal observations,Piaget [Flavell 63] has formulated a theory of human intellectual development.Piaget views the child as someone who istryingto make sense of the world bydiscovering the nature of physical objectsand their interaction, and the behavior of

133

HUMAN INTELLIGENCE

III BOX 5-1 Examples of Animal Learning

Visual Learning

Pigeons have been trained to respond to the presence or absenceof human images in photographs.The pictures are so varied that simplestimulus characterization seemsimpossible. The results suggest remarkable powers of conceptualization by pigeons as to whatis human.

Monkeys have been trainedto selectthe pictures of three different insects from amongpictures ofleaves, fruit, and branchesof similarsize and color.

Counting

Araven learned to open a box thathad the same numberof spots on itslidas were on a key card. Eventually, the birdwas trained to liftonlythe lid that had the same number ofspots on it as the numberof objectsin front ofthe box.

Pigeons have been trainedtoeat only a specific numberof grains

out of a larger number offered. Theyhave also learned to eat only a specific numberN of peas dropped intoa cup (oneat a time at randomintervals rangingfrom 1 to 30 seconds); the pigeon neversees morethan one pea in the cup at a time,but only eats the firstN.

Ajackdaw learned to openblacklidson boxes until it hadsecured twobaits, green lidsuntil ithad three, red lids until it had four,and white lids until it had securedfive.

A parrot after beingshownfour, six, or seven light flashes isable to take four, six, or sevenpieces of irregularly distributed foodout of trays. Numerous randomchanges in the time sequence ofvisual stimuli did not affect thepercentageof correct solutions.After the bird learned this task, thesignal of several light flashes wasreplaced by notes on a flute. Theparrot was able to transfer immediatelyto these newstimuli withoutfurther training.

Learning Visual Landmarks

The female diggerwasp can efficiently and consistently locate hernest when she returns from hunting.It has been shownthat she locatesthe nest by noting both distant andadjacent landmarks.

When a beehive is moved to anew location, the worker bees comingout on foraging flights will pauseand circle in increasing arcs aroundthe new site for a few momentsbefore flying off. The insect learnsthe relative positionsof newlandmarks in this manner, so as to beableto find its way back.

The goby can jump from poolto poolat low tide withoutbeingstrandedon dry land, eventhoughtheycannot see the neighboringpools before leaping. The goblesapparently swim over rock depressionsat high tide and thereby learnthe generalfeatures and topographyof the limitedarea around the homepool.They then use this informationin making their jumps at low tide.

people and their motivations. Thus,Piaget views the child as developing increasingly well-articulated and interrelatedrepresentations that are used to interpretthe world. Interaction with the world iscrucial becausewhen there is sufficientmismatch between the representationsand reality, the representations are modified or transformed. Piaget thinks of thechild's intellectual development as having

four discrete stages:'

1. Sensorimotorstage (years 0-2).By discovery through trial-and-error

'There has been somecontroversy as to whetherthedevelopment occursin discontinuous jumpsor as asingle continuous process that can be partitionedinto discrete stages. Cunningham (Cunningham 721,in an attemptto formalize intelligence, describesthese stages in terms ofdata structuresand operationson thesestructures.

134

LEARNING

manipulation experiments, childrendevelop a simple cause-and-effectunderstanding of how they can physically interact with their immediateenvironment. In particular, theydevelop an ability to predict the effectson the environment of specific motoractionson their part; for example,they learn to correctly judgespatialrelations, identify and pick up objects, learn that objects have permanence (even when theyare no longervisible), and learn to move about.They also develop an ability to usesigns and facial expressions to communicate.

2. Symbolic-operational or preoperational stage (years 2-7). Childrenstart to develop a symbolic understanding of their environment. Theydevelop an ability to communicate vianatural language, to read and write,and form internal representations ofthe external world; theycan nowperform mental experiments to predict whatwill happen in new situations. However, their thinking is stilldominated byvisual impressions anddirect experience; their generalization ability is poor. For example, if5-year-old children are shown a rowof black checkers directly oppositearowcontaining an equalnumber ofred checkers, they will say that thereare an equal number of each. If thered checkersare now spaced closertogether, but none are removed, thechildren say there are more blackthan red checkers. Their visual comparison of the lengthsof the two rowsdominates the more abstract notionof numerical equality. The ability ofpreoperationalchildren to form or to

recognize classdistinctions is alsopoor-they tend to categorize objectsbysingle salient features, and assumethat if two objects are alike in oneimportantattribute, they must bealike in other (or all) attributes.

3. Concrete-operational stage (years7-11). Children acquire some ofthe concepts and general principleswhich govern cause-and-effect relationships in their direct interactionwith the environment. In particular,they develop an understanding ofsuch conceptsas invariance, reversibility, and conservation; e.g., theycan correctly judgethat a volume ofwater remains constant regardless ofthe shape of the container into whichit is poured. Five-year-old childrencan follow a known route (e.g., getting home from school), but cannotgive adequate directions to someoneelse, or drawa map. Theydo not getlost because they know that theymust tum at certain recognized locations. The 8-year-old is able to forma conceptual pictureof the overallroute and can readily produce acorresponding map. Until childrenreach the age of 11 or 12 years, theirmethodof discovery is generallybased on trial-and-error experimentation. They do not formulate andsystematically evaluate alternativehypotheses.

4. Formal-operational stage (years11+). The young adult develops afull ability for symbolic reasoning,and the ability to conceive of possibilitiesbeyond what is present in reality.

Recently, many of the Piaget experi-

135

SIMILARITY

ments have been interpreted in terms ofinformation-processing metaphors, seeking explanations as to why children perform better on intellectual tasks as theyget older. Basically, the questions are (1)do children think better (because they canhold more information in working memory, can retrieve information faster, andcan reason faster), or (2) do they knowmore (have more knowledge, enablingthem to performtasks more efficiently)?Anderson [Anderson 85] discusses experiments that indicate that both effects arepresent.

One basicaspect of Piaget's theory isthat children develop an understanding oftheir physical environmentby manipulation of objectswithin it; he believed thatlanguage played a much less importantrole. A challenge to his work is based onthe idea that the child's performance insome of his experiments was less a matterof competence, than of learningthe correct meaningof certain verbal expressions

"u th "such as "same amount, or more an.While Piaget's theory attempts to

describe the changes in, and characteristics of, human learningability, it providesvery little insightinto the specific mechanisms that underlie such learning ability. Thus it provides very littleguidancewith respect to the question of how tobuild machines capable of learning. AsMargeret Boden has written [Boden 81]:

One of the weaknessesof Piagetiantheory ... is its lack of specification ofdetailedprocedural mechanisms competent to generate the behavior it describes. . . Even Piaget's careful description ofbehaviors . .. is relatedto uncomfortablyvague remarks about the progression ofstages, without it beingmade clear just

how one stage (or set of conceptualstructures) comesto followanother. . . .

SIMILARITY

Assigning names or labels to objects,events, and situations is one of the mostimportantand recurring themes in thestudy of intelligent behavior. This patternmatching problem typically involves measuring the degree of similarity betweendata describing the given state of theworld, and previously stored models.Basically, a common representationforthe objects under consideration must befound, and then a metric must be definedrelative to this representation quantifyingthe degree ofsimilarity between any of theobjects and models.

The pattern-matching problem arisesin many situations: For example, the< IF (condition) THEN (action)> template is a basic method of encodingknowledge within the computerparadigm,and is used in a variety of applicationsranging from computer programming tothe way rules are structured in expertsystems (see Chapter 7). Determiningwhenan existing situation matches theIF condition of a template is a patternmatching problem. Another example ofthe pattern matching problem is tryingtofindsimilarsymbolic patterns in two logical or algebraic expressions so that theycan be combinedor simplified. (The unitication problem in the predicate calculus,discussed in Chapter 4, is an exampleof asophisticated matching problem.) Finally,a significant portion of vision and perception is concernedwith recognizing aknownsensory pattern (i.e., the patternrecognitionproblem).

136

LEARNING

Some similarity problems can requirethat we find an exactmatch, while othersare satisfied with finding an approximatematch. This critical distinction separatesthe methods based on the symbolic approaches typically employed in the cognitiveareas of AI, from those based onmeasurement(feature) spaces and statistical decision theory that are often employed in the perceptual areas ofAI. Inboth cases, the objects beingcomparedare usually first transformed into somecanonical (standard) form to allow thecomputational procedures to deal with apractical number of primitive elementsand relationships.

Note that there is the questionoflevel of generality to be used in similarityevaluation, even after conversion to acanonicalform. For example, ifwe arecomparingtwo concepts, one ofwhichcontains the trigonometric term sine andthe other cosine, then we haveto moveto the more general term trigonometricfunction in order to obtain a match.Knowing the level of generality to usein a matchingoperation is oftencru-cial in performing similarity analysis.

Similarity Based on Exact Match

The exact match problem usually arises ina precise world, such as the game world;e.g., whenwe try to match a chessboardconfiguration againsta stored set of configurations. Sometimes partial exactmatches are sufficient, while at othertimes a match of the entire description isrequired.

A representation oftenused for exactmatching is the graph with labeledarcs

FIGURE 5-1Graph Representation and the PartialMatchProblem.

The problem is to find instances of the subgraph in thegraph (e.g., 9,4,8 corresponds to a.b,c),

and nodes, as shown in Fig. 5-1. Mathematically, in the complete matchproblemone tries to find a mapping (isomorphism)between two graphs so that when a pair ofconnected nodes in one graph maps to apair in the other graph, the edge connecting the firstpair will map to the edgeconnecting the second pair. The partialmatch problemof finding isomorphismsbetween a graph and a subgraph is computationally more difficult becausethereare many more combinations to be tested.

Another common matching probleminvolves strings of symbols, as shownbelow:

Find the string "SED" in the string:

"A REPRESENTATION OFTEN USEDFOR MATCHING"

As in the graph case, the completematch of two strings is less difficult thantrying to match a substringagainstastring.

137

LEARNING

Similarity Basedon ApproximateMatch

Approximate match problems arise inattempting to dealwith the physical world.For example, in visual perception thesensed objectcan differ from the referenceobjectdue to noisymeasurements,i.e., measurements degraded due to theeffects of perspective, illumination, etc.Because many of these differences are attoo Iowa level of resolution to be meaningful, or are irrelevant to our ultimatepurpose, we must define a metric forgoodness-of-match, rather than expectingto find an exact match. Using this approach, two objects are given the samelabel if their match score is high enough.In Appendix 9-1 wedescribe matching oftwo shapes usinga "rubber sheet" representation. We imagine one of the figuresto be drawn on a transparentrubber sheetandsuperimposed on the other; we matchthe figures bystretching the rubber sheetas required. The measure of goodness ofmatch is based both on the quality ofmatch between individual componentsof the objects beingcompared, and therequired stretchingof the rubber sheetto attain a match between componentsof the figures.

LEARNING

Unsupervised (unmonitored and undirected) learning is really a paradox: Howcan the learner determine progress if hedoesnot know whathe is supposed tolearnfrom his environment? This situation presentssevere problems in attemptingto devise programs that can learn. Ifthe designerintroduces very specific cri-

teria for success, then the learning program is given too much focus; in fact, anexplicit statement of how to evaluate success can provide the machine with justthose concepts that were supposed to belearned. Unfortunately, ifcriteriafor success are not provided, then learning isminimal.

We take the view that a system learnsto dealwith its environment by instantiating given models or bydevising newones.It will be convenient to divide autonomouslearningtechniques into the followingmodel-based approaches:

• Parameter learning. The learner hasbeen supplied with one or more modelsof the world, eachwith parametersofshape, size, etc., and it is the purposeofthe learner to select an appropriatemodelfrom the given set, and to derivethe correct parameters for the chosenmodel. A model can be as simpleas thedescription of a straightline, and in thiscase the parameters can be the slope ofthe line and its distance from somegiven point in space. Another modelcouldbe a decision making device withweighted inputs, and the parameters tobe found are the weights.

• Description learning. Here the learnerhas been given a vocabulary of namesand relationships, and from these hemust create a description which formsthe desired model.

• Concept learning. This is the mostdifficult form of learningbecause theavailable primitives are no longeradequate. New vocabulary terms must beinvented that allow efficient representation ofthe relevant concepts.

138

LEARNING

Two basic problems in autonomouslearning are (I) the credit assignmentproblem: how to reward or punish eachsystem component that contributes toan end result, and (2) local evaluationof progress: how to tell when a changeis actually progress toward the desiredend goal.

Model Instantiation:Parameter Learning

Almost all AI systems with learning abilityare limited to learning within a givenmodel or representation. That is, currentAI systems do not significantly modifytheir basic vocabulary (representations).Typicallv, learning consists of adjustingthe parameters of the given model basedon statistical techniques, or inserting anddeleting connections in a given graphicalrepresentation.

Parameter Learning Using an ImplicitModel. In the implicit form of parameterlearning, the model is not given directly,and may, in fact, never be known. Thoexamples of this type of learning are described below.

The Threshold Network. Early computerscientists thought of the brain as a looselyorganized, randomly interconnected network of relatively simple devices. They feltthat many simple elements, suitably connected, could yield interesting complexbehavior, and could form a self-organizingsystem. The system could learn becausethe random connections were designed tobecome selectively strengthened or weakened as the system interacted with itsenvironment. A form of implicit model

would thus be formed. In the 1960s, thethreshold device wasstudied intensively asthe basic element in such a self-organizingsystem. The name "perceptron" was oftenused to designate the basic device inwhich weighted inputs are summed andcompared to a threshold. In a typicalimplementation for visual pattern recognition, a two-dimensional retina is set up sothat local regions of the retina can beanalyzed by feature detectors, as shown inFig. 5-2.

Sets or vectors of k retinal values,x= [X I ,X21• • .xJ, are passed to N featuredetectors, each of which determines ascore !;(x) representing the degree ofpresence or absence of some specificfeature or object in the image. Thesefeature scores"are then combined in aweighted vote:

SUM =

WI • f; + w2• f; +W3 • ~ + . .. w/ (N'

SUM is compared to a threshold, T,and if SUM is greater than T, we say thatthe system responds positively. Wewantthe system to respond positively for oneclass of objects and negatively for objectsnot in this class, e.g., positively when it ispresented with the character "B," andnegatively when presented with "A."

,An algorithm for adjustment of theweights (i.e., the training) of such a systemis described in Appendix 5-1. Althoughsuch training capability is attractive, thistype of device has serious innate limitations; if it must make a global decisionabout a pattern by examining local features, then there are certain attributes

'The N feature detectors are typicallythresholddevices that produce an f,(x) equal to +1 if thefeature is present , or a -I if it is absent.

139

LEARNING

w1*f1Feature f1

I Kdetector Weight1 \ WI

Threshold

• • Device(with• +1 or -1 • threshold

• j • n

Feature Weight ~wN*fNdetector WNN fN

+1

-1

FIGURE 5..2 A Threshold Device Used as a Pattern Classifier

Patterns are assigned to one oftwo categories, labeled +1 or -1.

of the pattern that it cannotdeduce.For example, connectivity ofpatterns asshown in the examples of Fig. 5-3cannotbe determined by local measurements (seeMinsky [Minsky 67]). Intuitively, one cansee the difficulty: different pieces ofeachpatternare critical for keeping the patternconnected, and there is no single weightingarrangement that can capturetheconnectivity concept.

There has recently been renewedinterest in threshold networks because ofthe development of new techniques forobtainingthe network parameters (learning algorithms). The techniques are basedon "simulated annealing," a statisticalmechanics methodfor solving complexoptimization problems. (Thename comesfrom the fact that the procedure is analogous to the use of successively reduced

•••• •••• •••• • •• •• •• • • • • •• ••••• •••• •••• • • • •• • • • • ••••• •••• •••• •••• • •Connected Not Not Connected Not

connected connected connected

FIGURE 5-3The Connectedness Problem Cannotbe Solved by a Single Threshold Device.

140

LEARNING

temperature stages in the annealing ofmetals.) These techniques can be used tofind the weights in a multilevel thresholdnetworkby considering the weight determination problem to be a minimizationproblem in manyvariables. This approachavoids being trapped in local minima bythe use of two mechanisms: (1) the adjustments are made randomly, and (2) anadjustable "temperature"parameter, T,controls the probability of acceptance ofany change in a variable.

To minimize a function E(xi) of themany variables, (Xi), the minimum E isfound by starting with a randomset of Xi

and introducingsmall random perturbations into these Xi. If a perturbation decreases E, it is accepted. If a perturbationincreasesE it is accepted with a probability that is a function of the change in Eand the parameter T: the probability decreases with the size of the change in Eand increaseswith increasing T. By beginning witha high T there is a high probability of acceptance of a change in Xi. T isdecreasedas the processproceeds, thusdecreasing the probability of the acceptance of a bad perturbation.The randomacceptance of "bad" perturbationsas afunction of the changein E and T is themechanism that drives the system out oflocal minima.

Using this approachSejnowski[Sejnowski 85] was able to train a multilayer network to transform natural language text in English to the phoneticcode for the sounds.The trainingset wasa large sampleof natural language textand the corresponding phoneticcodesprepared,by linguists. "Using the trainingset,.the networkweights were adjustedusing simulated annealing so that thephonetic codes developed by the network

agreed with the codes in the trainingsetprepared by the linguists. The spokenform of the text was obtainedby feedingthe phonetic code developed by thethreshold network into a sound synthesizer. It is rather dramaticto hear theeffects of the training process: the outputof the network is first a babbleof sounds,and then becomes more and morehuman-sounding as the trainingprocessproceeds.Aftertraining, the network isable to synthesize the soundsof text notin the trainingset.

It should be recognized, however,that adjustment of weights to correctlyrecognize the members of a trainingset isnot the critical issue in building a learningsystem. Whatwe require is that the correspondingrecognition rule generalizesdistinctions represented by the trainingset so as to be applicable to new membersof the class. Since the weight-adjustmentlearning algorithms do not addressthisquestion, even successful generalizationon a few selectedproblems provides verylittle assurancethat this type of approachcan be successful on a wide range of realproblems.

The "Bucket Brigade"ProductionSystem. Holland [Holland 83] introduced a learning mechanism (the bucketbrigade algorithm or BBA)which offersan interestingapproach to the problemof howto credit portions of the systemthat are contributingto a solution, thecredit assignmentproblem. He usesthe production rule representation, tobe further discussed in Chapter 7, ofthe form, IF < condition> THEN<action>.

In a production system, all the productionsinteract with a global list (GL), or

141

LEARNING

FIGURE 5-4The BucketBrigade Approach to ParameterLearning.

Rule 2posts C

Payoff byoutside worldto active rule@ becauseCwas posted

;"",----... ......./ .......~

Rule 1 " .posts B ~Iobal list

A.......-----~~---- - B

C

./ CD IfA then BI

(" -0 If B1h~~--

Payment to ruleQ)byrule@ because Bwasposted to global list

Parameter LearningUsingan ExplicitModel. In parameterlearning for anexplicit model, a pattern or situation isexpressed as a list of attribute measurements corresponding to the arguments ofthe model. The model itselfalso containsfree parameters which may be adjustedtoimprove performance. Learningconsistsof changingthe freeparameters (which

Ifa production receives more from itsconsumers than it pays out to its suppliers, then it has made a profitand its capital (utility parameter) increases. Certainactionsproducepayoffs fromthe externalenvironment, and allproductions active atsuch a time share this payoff. The profitability of a production thus depends on itsbeing part of a chainof transactions thatultimately obtainsa payoff from the environment, and such that this payoff exceeds the incurredexpenses. In this way,a productioncan receive due credit eventhough it is far removed from the successful action to which it contributed.

working storage,which stores informationrepresenting the current state of theworld; whenthe < condition> part of aproduction is satisfied, the < action>portionof the rule is executed. Typically,the internal action taken is to alter or addto the information stored in GL; theremight also be someexternalaction (suchas sending a message) which does notconcern us here. Pure production systemsare completely data driven, and thus donot require an explicit control or sequencing mechanism. The only control functionrequired of the system is to determinewhich productions are satisfied by theinformation stored in GL, and to carry outthe actions of these satisfied productions.A collection of rules of this type can beconsidered to be an implicit modelof thedomain to which these rules apply.

In the BBA approach, whenthe< condition> portion of a productionissatisfied it does not automatically activatethe < action> portion. Rather, the production is allowed to make a bid based onan associated utility parameter(whichmeasures its past effectiveness and itscurrent relevance); only the highestbiddingproductions are allowed to fire. Asshown in Fig. 5-4, the BBA treats eachproduction rule as a middleman in aneconomic system, and its utility parameteris a measure of its ability to makea profit.Whenever a production fires, it makesapayment by giving part of the value of itsutility parameter to its suppliers. (Thesuppliers of a production P are thoseother productions that posted data on GLthat allowed P's condition portion to besatisfied.) In a similar way, a productionreceives a payment from its consumerswhen it helps them satisfy the conditionpart of their rules.

142

LEARNING

can be measurement weights as in thecase of an implicit model) basedon observed performance. An example of learning usingan explicit model is an adaptivecontrol system. Here, the system, e.g., anaircraft, is modeled using equations thatdescribe its dynamic response. Algorithmsare provided that adjustthe parameters ofthe dynamic model, basedon the measured in-flight performance of the aircraft.

One of the earliest and mosteffectiveparameter learningprograms for an explicitmodel is Samuel's checker-playingprogram [Samuel 67], described inBox 5-2.

Mostgame-playing programs (including Samuel's) use a gametree representa-

tion in which branches at alternate levelsof the tree representsuccessive moves bythe player and his opponent, and thenodes represent the resulting game(orboard) configuration. Typically, the complete tree is too large to be explicitlyrepresented or evaluated. Instead, a valueis assigned to nodes a few levels beyondthe current position using an evaluationfunction that guides the program in theselection of desirable moves to be made.This evaluation function is not guaranteedto provide a correct rankingof potentialmoves; it usually takes into accountthenumber, nature, and location of thepieces, and other attributes that the designerfeels characterize the position.

• BOX 5-2 Checkers: Parameter Learning Using a Problem-Specific Model

Samuel's checker-playing program isa successful example ofparameterlearningusing a problem-specificmodel; this programis able to beatall but a few of the best checkerplayers in the world. Agametreerepresentation is used in which eachnode correspondsto the board configuration after a possible move, andthe move to be made is determinedafterassigning values to the nodesand choosinga legal move leadingto the node with the highestscore.Sincea full exploration of all possible moves requires the evaluation ofabout 1040 moves, the approach is toevaluate only a few moves by applyinga heuristicevaluation functionthat assigns scores to the nodesbeingconsidered.To improve theperformance of the checker player,

one can allow it to search fartherinto the gametree, or one canimprove the heuristic evaluationfunction to obtain a more accurate estimate of the valueof eachposition.

Samuel improved the lookahead capability of his system bysaving every board position encountered during play, along with itsmostextensive evaluation. Totakeup little computer storage space,and to retrieve the results rapidly,requiredthe use of clever indexingtechniques and taking advantage ofboardsymmetries. When a previously considered board position isencountered in a later game itsevaluation score is retrieved frommemory rather than recomputed.Improved look-ahead capability

comes about as follows: Ifwelookahead only three levels to a boardposition P, but P has a previouslystored evaluation value that represents a look-ahead ofthree levels,then wehave performed the equivalent of a look-ahead ofsixlevels.

The evaluation function dependson (1) measures ofvariousaspects of the gamesituation, suchas "mobility," (2) the function to beusedto combine these measures,and finally (3) the specific weightingsthat shouldbe usedfor eachmeasure in the evaluation function.The designer supplies the measurementalgorithms and the form ofthe function to be used, while theweightings are obtained bya "learning"procedure. The system learnsto play better by adjusting weights

143

LEARNING

The problems in designing a goodevaluation function are (1) the choice ofthe components of the evaluation function, (2) the method of combining thesecomponents into a composite evaluationfunction, and (3) how to modify the components and the composite functions asexperience is gained in playingthe game.(The credit assignment problem ariseshere: how to assign credit or blame dueto something that happened early in thegame, since the good or bad results donot show up until much later.)

The third factor is the one that typicallyconstitutes the machine learning toplay the game. Note, however, that thislearning is cruciallydependent on the

BOX 5-2 (continued)

cleverness of the human designer in solving the first two problems.

Model Construction:Description Models

Programs capable of forming their owndescription models take an initial set ofdata as their input, and form a theory ora set of rival theories with respect to thegiven data. The theories are descriptionsor transformations that satisfythe initialdata . The program then attempts to generalize these theories to cover additionaldata. This typically results in a number ofadmissible theories. Finally, the programchooses the strongest theory from these,

which determine how much eachparametercontributes to the evaluation function. Samuel investigatedtwo forms of the evaluation function:

1. A function ofthe formMOVE_VALUE = SUM(w/f;),where f, is the numerical valueof the zu, board feature, and WI

is the corresponding weightassigned to that feature. Sinceeach feature independentlycontributes to the overall score,the weight can be considered asa measure of the importance ofthat feature. Thus, featureswith low weights can be discarded.

2. An evaluation function in whichfeatures were grouped togetherto obtaina scorebyusing a

look-up table that says If boardfeature A has a value ofX, andboard feature B has a value ofy, . . . , then that group offeatures has a score ofZ. Thevalues assigned to the groupedfeatures in the look-uptablescorrespond to the weights inthe first evaluation function;theseare modified by a learningprocedure.The detailed record ofgames

played byexperts, book games, wereused fortraining the system. thechecker program takingthe part ofoneofthe experts. If the programmakes the move that the expertmade fora particulargame situation,the evaluation function was assumedto be correct. since it caused thecorrect path in the gametree to be

taken. If the wrong move was made,thenthe evaluation function wasmodified. Various ad hoc weightmodification schemes were incorporated into the program. Samuelfound that the secondevaluationfunction was far more effective thanthe first.

The approach of relating specific boardpositions and corresponding expert responses has animportant advantage. Reward andpunishment training (learning)procedures can be based on thelocal situation, rather than on winning or losingthe game(becausethe experthas a global view in mindwhen making his move). This localindication of ultimate success orfailure is not usually available innon-game playing situations.

/

144LEARNING

i.e., one that best explains the data encountered to date.

Learning analogical relationships.An early investigation of this class ofproblemsbyT.G. Evans [Evans 68] stillstands as a remarkably insightful work on

analogy and the learning process (see Box5-3). Evans dealt with the machine solution ofso-called geometric-analogy intelligence test questions. Each member of thisclassof problems consists of a set of linedrawings labeled Figures A, B, C, c, C2,

BOX 5-3 Solving Geometric Analogy Problems

In 1963, Tom Evans [Evans 68)devised a computer program,ANALOGY, which couldsuccessfully answer questionsof a typefound on IQ tests: Figure A is tofigure B, as figure C is to whichof the following figures? The operation of this geometricanalogyprogram can best be illustrated bya toy example. Suppose we have thepatterns shown in Fig. 5-5. In describing figure A, the programwouldfirst assign names to each pattern.Thus,

FIGURE 5-5Geometric Analogy Example.

(AI = +), (A2 = - ),and (A3 = 0).

Similarly, in figureB,

(Bl = -), (B2 = + ),

and {B3 = 0).

+

o

A

+

o

B

In figure A, analysis would revealthat A2 is to the right of AI, A3 isbelow AI, and A3 is below A2.Analysis of figure B would show thatB2 is to the right of BI, 83 is belowBI, and B3 is below B2.

Comparing the descriptions ofthe two figures, the programwouldfind that Al = B2, that A2 = BI,and A3 = B3. From this it woulddeduce that the transformation ruleis: interchange Al and A2 to obtainfigure B from figure A.

Supposewe havethe patternsshown in Fig. 5-6.

We wantthe programto answerthe question: figure A is to figure Bas figure C is to which figure (CI,C2, or C3)? Again, the programhasto find the correspondences betweensubfigures in figure C and the subfigures in figures CI, C2, and C3.

After determining these correspondences, the programcan applythe transformation found betweenfigures A and B, and will findthatboth figures C2 and C3 are possibleanswers. It finally determines thatfigure C3 is not acceptable becausethe Z pattern has also been transformed (from bottom to top). Thus,it is sometimes necessaryto specify

VVlrxvlfYXlrzl~L:J~~

c

FIGURE 5-6Analogy Test Patterns.

which things remain constant, aswell as which thingschange, inspecifying the transformation rule.For example, in the transformationrule relating figures A and B, it isnecessary to assert that the bottompattern does not changeits locationwith respect to the two other patterns. Given this extended transformation rule, figure C2 is selectedasthe desired solution.

ANALOGY usesa fixed modelprovided bythe designer(the set oftransformation operators), and finds,by exhaustive search, the propertransformation parameters for thevarious pairs of figures . The approach dependson having a smallnumberof patterns in each figure,otherwise the combinatorial growthof the pattern relationship listswould become excessive.

145

LEARNING

· . . The task to be performed can bedescribed by the question, Figure A isrelated to Figure B, as Figure Cis relatedto which of the following figures? To solvethe problem the program mustfind transformations that convertFigure A intoFigureB and also convertFigure C intoone of the answerfigures. The sequenceof steps in the solution is as follows:

• Overlapping subfigures mustbe identified in line drawings, e.g., two overlappingcircles must be recognized as such.

• The relationships between the subfigures in a mainfigure mustbe determined, e.g., the dot is inside the square.

• The transformations between subfiguresin going from FigureA to Figure Bmustbe found, e.g., the triangle inFigureA corresponds to the smallsquare in Figure B, or the dot inside thesquare in FigureA is outside the squarein Figure B.

• The subfigures in Figure C that areanalogous to the subfigures in Figure Amust be recognized.

• The program must analyze whatfigureresults when the transformations areapplied to FigureC.

• The transformed FigureC must then becompared to the candidate figures.

A description of Evans' ANALOGYprogram is given in Box5-3.

Learning descriptions. Astudy byWinston [Winston 75] deals with developing descriptions of classes of blocks-worldobjects (seeBox 5-4). Given a set of objects identified as to class, known as atrainingset, the program is to produce(learn) a description of the class. Theprogram is given a set ofdescription primitives and develops a description of the

object class as trainingobjects are presented one at a time. The programnotesthe commonalities between positive instances, and differences between negativeinstancesand the current description.Each example leads to a modified description that ultimately characterizes thatclass of objects. Whena trainingexampleis presented that is incompatible with thepresent description, the program backtracks to a previous description and attempts to modify it so as to be compatiblewith the newinstance.

The final description producedbyWinston's system is dependenton thevocabulary provided and on the sequenceof examples shown. In addition, the program must assume that the class labelingof the examples is correct, since the modification of the description depends on theclassassignments given. Furthermore,some of the examples that are not of theclassmust differ onlyto a small extentfromthose that are in the class; otherwise, the programwill not be able todiscover the fine distinctions betweenmembership and nonmembership in theclass. (For example, in learning whatan"arch" is, the program is given the example of a non-arch whose support poststouch, but is otherwise a valid arch. Thisallows the program to detect this important difference between a non-arch and anarch.)

Learning generalizations of descriptionsand procedures. It is possibletoconsider the problem ofgeneralizing adescription (or a procedure) as a searchproblem in which the space of all possibledescriptions is examined to find (leam)the most general description that satisfiesa set of training examples. This approachis used in LEX [Mitchell 83], a program

146

LEARNING

I] BOX 5-4 Learning Descriptions Based on (Given) Descriptive Primitives

In 1970, Patrick Winston [Winston75) devised a learning programwhich was able to produce descriptions of classes of simple blockconstructions presented to theprogram in the form of line drawings. (A specializedsubprogramconverted these line drawings intoa semantic network description.)

The semantic net in Fig. 5-7indicates that an arch, such as isshown in the figure, consists ofthree pieces, a, b, and c. Training

o

oFIGURE 5-8Block Structures That Do Not Form an "Arch."

oa

b c

LYING - hasp~

must be must besupported by supported by

Q -rnustnot- 0P abut '-.:.(has property has property

/ \STANDING STANDING

FIGURE 5-7Semantic Net for the Concept"Arch."

instances cause the various links tobe inserted, e.g., that piece a hasthe properly of being supportedby pieces band c; pieces band cmust not abut; and piece a mustbelyingdown while pieces b and carestanding.

The must not abut descriptionis deduced by the systemonlyaftershowingit a non-arch exampleconsisting of a set of blocks in whichthe supports do abut, as shown inFig. 5-8. An implicit requirementisthat examples of a non-arch shouldnot have characteristics that differfrom the existingarches and are notimportant in definingthe arch class.If the top block in an exampleiscurved, and the supports touch,then the program will not knowwhichof the two deviationsis causing the blocks to be an instanceof anon-arch. Thus, each non-arch used

for training must be a near-miss toan actualarch.

If examples are shown to thesystem in which the top piece isalways a rectangularsolid, then theprogram will assumethat this is arequirementfor an arch. Only afterthe system is shownan arch consisting of a triangular top piecewill thedescription be broadenedto includeboth rectangular and triangular toppieces.

An approach such as Winston'sthat tries to finda description thatis consistentwith all of the traininginstanceswill be strongly affected byerroneous data, i.e., data in whicha false-positive or false-negativeinstancehas been given. A falsepositive causes the description to bemoregeneral than it shouldbe, anda false-negative causesthe description to be overspecialized.

147

LEARNING

that learns "rules of thumb," known moreformally as heuristics, forsolving problems in integral calculus. Basically, a ruleindicates that ifa certain form of mathematical expression is found, then a specific transformation should be applied tothe expression. Sometimes these procedures involve looking up the integrationform in a table, while at other times theymay involve carrying out operations onthe expression to convert it to a moremanageable form. The program beginswith about 50 operators (procedures) forsolving problems in the integral calculus,and it learns a set of successively moregeneral heuristics which constitute a strategy for usingthese operators (Box 5-5).

The generalization language madeavailable to LEX is crucial since it determines the range of concepts that LEX candescribe, and thus learn. For example, thedesigner has provided a generalizationhierarchy that asserts that sin and cos arespecific instances of a trigonometric function, and that a trigonometric function is aspecific instance of a transcendental function. LEX usesfour modules:

1. The problem solver utilizes whateveroperatorsand heuristics arecurrently available to solve a givenpracticeproblem.

2. The critic analyzes the solution treegeneratedbythe problem solver toproduce a set ofpositive and negativetraining instances from which heuristics will be inferred. Positive instancescorrespond to steps along thebest solution path, and negative instances correspond to steps leadingaway from this solution path.

3. The generalizer proposes and refinesheuristics to produce more effective

problem solving behavior on subsequent problems. It formulates heuristics bygeneralizing from the traininginstances provided by the critic.

4. The problem generator generatespractice problems that will be informative (i.e., they will lead to trainingdata useful forproposing and refiningheuristics), yeteasy enough to besolved using existing heuristics.

These modules work together to proposeand incrementally refine heuristics fromtraininginstances collected overseveralpropose-solve-criticize-generalize learningcycles.

Unlike mostsystems that retain onlya singledescription at any given time,LEX describes each partially learnedheuristic using a representation (calledversion space) that provides a way ofcharacterizing all currently plausible descriptions (versions) ofthe heuristic. Thebasic ordering relationship used in theversion space representation of rules isthat of more-specific-than. For example,the precondition (large red circle) is morespecific than (large? circle), which is morespecific than (? ? circle), where? indicatesthat this condition need not be satisfied.More formally stated, a rule G1 is morespecific-than a rule G2, if and only if thepreconditions of G1 match a proper subsetof the instances that Gz matches, andprovided that the two heuristics make thesame recommendation.

LEX stores in version space only themaximally specific and maximally generaldescriptions that satisfy the trainingdata.Thus the more-specific-than relation (provided as part of LEX's originally given,hierarchically ordered vocabulary) partially orders the spaceof possible heuris-

148

LEARNING

-tics, provides the basis for their efficientrepresentation, and a prescription fortheir generalization.

Concept Learning

In concept learning, the learning systemmustdevelop appropriate concepts (descriptive primitives) for dealing with itsenvironment. Because of itsdifficulty,little progress of a general nature has

been made in this problem area. One ofthe few significant efforts that both addresses the issue of conceptlearning, andhas achieved somecomputational success,is AM. The AM program [Lenat 77] runsexperiments in numbertheory and analyzes the results to develop new conceptsor theorems in the subject.

The program begins with 115 incompletedatastructures, each correspondingto an elementary set-theoretic concept

Il BOX 5-5 Version Space: A Representation for DescriptionsB

The LEX program [Mitchell 83b)dealswiththe problem of learningrules of thumb to solvecalculusintegrationproblems. Each heuristic rulesuggests an integration procedure (operator) to apply whenthe given expression satisfies a corresponding functional form. LEXuses a representationcalledversionspace to keep track of the rules thatit is learning. Stored information iskept to a reasonable size by including only the mostspecific and themost general descriptions of therules that satisfy the trainingexamples.

The programis given a set ofoperators for solving calculus problems.A typical operator,

OP1: INTEGRAL (u dv) -->uv- INTEGRAL (v du),

indicates howthe INTEGRAL of anexpression can be solved, and oftenthe solution is recursive, i.e., itrequiressolution of an additionalINTEGRAL.

An example ofa version spacerule isshown in Fig. 5-9(1). Themostspecific heuristic is marked Sand indicates that if a specific expression, {3xcos(x) dx}, is encountered, then operator 1 shouldbeused, with the variables u and dvinthis operatorreplaced bythe specificvalues shown.

The mostgeneral rule, denotedby G, indicates the substitutions thatshould be madein operator 1 for themoregeneralexpressions fl(x) andf2(x). Thus while many additionalheuristics are implied, Sand Gcompletely delimit the rangeofalternatives in a predefined generalto-specific ordering.

Assume that a new traininginstance involving {INTEGRAL 3xsin(x)} is encountered, and theprogram discovers the solution

INTEGRAL 3xsin(x) dx -->Apply OP1 with u= 3xanddv=sin(x)dx.

Given the generalization hierar-

chysupplied bythe designer,

sin(x),cos(x) --> trig(x) --> transc(x)-->f(x)

kx--> monorntrj-« poly(x) -->f(x),

LEX determines that the mostspecific version of the heuristic ofFig.5-9(1) could be generalized toinclude both cos(x) and sin(x), byusingtrig(x) in placeofcos(x). Thisgeneralization is shown in the revised version space representation ofthe heuristic rule, Fig. 5-9(2).

However, an example in theform {INTEGRAL sin(x) 3x}provides evidence that the mostgeneralform of the heuristicrule,as specifiedin Fig. 5-9(1), can fail unless it isfurther restricted. Specifically, theoriginal generalization proposed is:Apply OP1 with u-fl(x) anddv=f2(x) dx, but the sin(x)3xtraining example shows that this assignment of u and v can result in a morecomplicated expression that OP1cannot integrate, and that an interchangeof fl(x) and f2(x) may be

149

LEARNING

(such as union or equality). Each datastructure has 25 different facets or slotssuch as examples, definitions, generalizations, analogies, interestingness, etc., tobe filled out. Very interesting slots can begranted full concept module status. This isthe space that AM begins to exploreguided bya large body of heuristic rules.

AM operates underthe guidance of250 heuristic rules which indirectly control the system throughan agenda mecha-

BOX 5-5 (continued)

nism, a global list of tasks for the systemto perform, togetherwith reasons whyeach task is desirable. A task directivemight cause AM to define a new concept,or to explore somefacet of an existingconcept, or to examine some empiricaldata for regularities, etc.,The programselects from the agenda that task that hasthe best supporting reasons, and thenexecutes it. The heuristic rules suggestwhich facet ofwhich concept to enlarge

(1) VERSION SPACE REPRESENTATION OF A HEURISTIC

S: INTEGRAL 3x cos(x) dx-Apply OP1 with u=3x and dv= cos(x) dxG: INTEGRAL f1(x) f2(x) dx-Apply OP1 with u=f1(x) and dv=f2(x) dx

(2) REVISED VERSION SPACE REPRESENTATION OF THE HEURISTIC

S: INTEGRAL 3x trig(x)-Apply OP1 with u=3x and dv=trig(x) dxG: g1: INTEGRAL poly(x) f2(x) dx-Apply OP1 with u=poly(x), and dv=f2(x) dx

g2: INTEGRAL transc(x) f1(x) dx-r-Applv OP1 with u=f1(x), and dv=transc(x) dx

GENERALIZATION HIERARCHY SUPPLIED BY DESIGNER

sin(x).cos(x) --+ trig(x)- transc(x) - f(x)kx-monom(x)--+poly(x) -f(x) ,

f iGURE 5-9Version Space Representations ofVarious Forms of a Heuristic Rule forSymbolic Integration.

necessary depending on the specificfunctional form of nand f2. Thedescription ofthe heuristic in versionspace is revised to take this newinformation into accountby breakingthe most general form of theheuristic intotwo statements. Inaddition,thegeneralization hierarchy

is used to replace n(x) in one generalizationby its more specific namepolynomial, and in the other expressionto replace f2(x) by its morespecific name transc(x). The revisedform of the heuristic of Fig. 5-9(1),afterthe two trainingexamples justdiscussed, is shown in Fig. 5-9(2).

We note that the learningprocess described here always involves the narrowing of a description represented in version space.Thisnarrowing occurs in discretesteps, based on the predefinedhierarchically ordered vocabulary forthe givenproblemdomain.

150

LEARNING

next, and how and when to create newconcepts. Lenat provided the system witha specific algorithm for rank-orderingconcepts in terms of how interesting theyare. The primary goal of AM is to maximize the interestingness level of the concept space it is enlarging.

Many heuristics used in AM embodythe beliefthat mathematics is an empiricalinquiry-the approachto discovery is toperform experiments, observe the results,gather statistically significant amounts ofdata, induce from that data some newconjecturesor new concepts worth isolating, and then repeat this whole processagain. An exampleof a heuristic dealingwith this type of experimentation is Aftertrying in vain to findsome non-examplesofX, if many examplesofX exist, consider the conjecture that X is universal,always-true. Consider specializing X.

Another large set of heuristics dealswith focus ofattention: whenshould AMkeep on the same track, and when not. Afinal set of rules deal with assigning interestingness scores based on symmetry,coincidence, appropriateness, usefulness,etc., For example, Concept C is interesting if C is closelyrelated to the very interesting concept X.

Experimental evidence indicates thatAM's heuristics are powerful enough totake it a few levels away from the kind ofknowledge it began with, but onlya fewlevels. As evaluated byLenat, of the 200new concepts AM defined, about 130 wereacceptableand about 25 weresignificant;60 to 70 of the concepts werevery poor.

Although AM is described as "exploring the space of mathematical concepts,"in essence AM was an automatic programming system, whose primitive actions

producedmodifications to pieces of LISPcode, which represent the characteristicfunctions ofvarious mathematical concepts. For example, given a LISPprogramthat detectswhen two lists are equal, it ispossible to make a "mutation" in the codethat now causes it to compare the lengthof two lists, and another mutation mightcause it to test whether the first items ontwo listsare the same. Thus, becausesuchmutations often result in meaningfulmathematical concepts, AM was able toexploitthe natural tie between LISP andmathematics, and was benefiting from thedensity ofworthwhile mathematical con-

. cepts embedded in LISP. The main limitation of AM was its inability to synthesizeeffective new heuristics based on the newconcepts it discovered. It first appearedthat the samemachinery used to discovernew mathematical conceptscould also beused to discover new heuristics, but thiswas not the case. The reason is that thedeep relationship between LISP andmathematics does not existbetween LISPand heuristics. When AM applied its mutation operators to viable and useful heuristics, the almost inevitable resultwasuseless new heuristic rules.

In evaluating the accomplishments ofAM, there is also the question of theextent to which the designer implicitlysuppliedthe limited set of mathematicalconcepts that were (or couldbe)generated by the system, bysupplying the par..ticular initial set of heuristics, frame slots,and the definitions and procedures thatdetermine how these slotsget instantiated. An extensive criticism of AM, and alater relatedprogram called EURISKO,was offered by Richie and Hanna [Richie84]. In their response [Lenat 84], Lenat

151

DISCUSSION

and Brown comparethe AM/EURISI{Oapproach to that of the perceptron, adevice described earlier in this chapter:

The paradigm underlying AM andEURISKO may be thoughtof as the newgeneration of perceptrons, perceptronsbased on collections or societies ofevolving, self-organizing, symbolic knowledgestructures. In classical perceptrons, allknowledge had to be encoded as topological networks of linked neurons, withweights on the links. The representationscheme used in EURISKO provides muchmore powerful linkages, taking the formofheuristics about concepts, includingheuristics for how to use and evolveheuristics. Bothtypes ofperceptrons relyon the law of large numbers, on a kindoflocal-global property ofachieving adequate performance through the interactions of many relatively simple parts.

DISCUSSION

There are several key issues that havearisen in our exposition of the learningprocess:

Representation. As is the caseinother areas of AI, having an appropriaterepresentation is crucial for learning-it isoften necessary to be able to modify anexisting representation, or create a newone, in dealingwith a given problemdomain. However, the learning programsreviewed had relatively fixed representations, provided bythe designer, whichboundedthe learningprocess. For example, the problem of telling when a situation, object, or eventis similar to anotherlies at the heart of the learning process; arelated problemis determining whensomething is a moregeneral or specificinstance of something else, i.e., determiningthe generalization hierarchy. Most of

the systems examined use a generalizationhierarchy provided by the designer, andmethods for measuring similarity almostinvariably depend on comparing predefined attribute vectors, or graph matchingbased on a predefined vocabulary. Thus,.existing learningsystems are inherentlylimited to instantiating a predefinedmodel; their ability to really discoversomething new, or to exhibitcontinuinggrowth is virtually nonexistent.

Problem generation. A systemshould be able to generate its own problemsso that it can refine its learned strategies. This type of problem generation isseen in a child learning to use buildingblocks. Various structuresare tried so thatthe physics ofblock building can belearned. LEX, and to some extent AM,were the only machinelearning programsdiscussed that had the ability to effectivelyselect problems. In all the others it is upto the human operator (or trial and error)to chooseappropriate problems to promote learning.

Focusofattention. A learningsystem should be able to alter its focusof attention, so that problem solutionscan be effectively found. Many of theprogramsthat we examined used a fixedapproach to problem solving, and didnot havethe ability to focus on a problem bottleneck. Capability in this areais related to self..monitoring, i.e., if oneknows howwell he is doing, then criticalproblemareas can be identified, and pri..orities can be altered accordingly.

Limitson learning ability. How isnew learningrelated to the knowledgestructures already possessed by a system?Totake an extremeexample, no matterhowmany ants wetest, or how hard they

152

LEARNING

try, it is inconceivable that an ant coulddevise the equivalent of Einstein's theoryof relativity. On what grounds is this intuition based? It certainly appears to bethe case that any system with a capacityfor selfmodification, or learning, haslimits on whatit can reasonably hope toachieve. Atpresent, we do not evenhavea glimmering of a theory that quantifiessuch limits. The amount of knowledgecurrent machine learning systems start offwith is invariably basedon practicality andother ad hoc criteria.

Human learning. There appearsto be almost nothing in the physiological

or psychological literature concerninghumanlearning that can aid us in thedesign ofa machine that learns. Themechanisms used by the human to learnautonomously when immersed in a complexenvironment remains a mystery. Inparticular, the ability of the human toform new concepts when required is notunderstood.

Perhaps the mostinteresting openquestion is whetherit is possible for amechanical process to create new concepts and representations using an approach morepowerful than trial anderror. At present, no such procedure isknown.

Appendix5-1

Parameter Learning for an Implicit Model

Aswasshown in Fig. 5-2, a threshold device accepts a set of binaryinputs, e.g., TRUE or FALSE, 0 or1, -lor +1, and outputs a binaryresult. Each input is multiplied by aweight, ui; and if the sum of theweighted inputs exceedsa threshold value, T, then the output is onevalue, otherwise it is the other. Thethreshold device can be used forgeneral computingpurposes, sinceappropriate weightsettings willcause it to behave like the AND,OR, and NOTfunctions mentionedin Chapter 4. However, this device(function) has specialsignificance forclassification-type (pattern recognition)computations.

Much of the work on thresholddevices has dealt withthe questionof howdifferent classesof objectscouldbe recognized by automaticadjustment of the weights, to; Suppose wehavetwo classes A and Band for objects in classB we wantthe sum of the weighted inputs,SUM = WIt; +w2!;+... , to begreater than a threshold, T. Weknow intuitively that if in the training modewe obtain a SUM less thanT for a test pattern in class B, thenthe weights correspondingto positiveinput terms should be increasedand those correspondingto negativeinput terms and the valueof thethreshold should be decreased. (The

converse action shouldbe taken ifSUM is incorrectly greater than Tfor test patterns in classA.)

It is possible to provea surprisingly powerful theorem which saysthat if a single thresholddevice iscapable of recognizing a class ofobjects, then it can learn to recognize the class by usingthe followingweight adjustment procedure:

1. Start with anyset ofweights.2. Present the device witha pat

tern (described by a vector of-1, + 1 valued features) ofclassC, or not of class C.

3. If the pattern is classified correctly (8UM > T for a pattern

153

APPENDIX 5-1

Class ( 1 (2

1 -1 -1

2 - 1 + 1

3 + 1 -1

4 + 1 + 1

--=--..Pattern ~I ~+ 1

Measurements~ - 1-----'

in C;SUM - < T for a patternnot in C) then go to step 2.

4. If SUM - < T for a pattern, X,in C, replace each W , by(w,+f,{x» and T by (T-1); IfSUM > T for a pattern, X, notin C, replace each W , by (wi

f,{x» andT by (T+1).5. Goto step2.

(a) Threshold network. (b)Encoding ofclassdesignation with respectto the two outputdevices of the threshold network.

FIGURE 5-10Classifying a Pattern as Belonging to One of Four Classes.

(a)After a finite number of itera

tions, in which correctly labeledmembers ofthe input population areused in the weightadjustment procedure, the device will correctly recognizethesepatterns. The order inwhich the inputpatterns are presentedis not important; it is onlyrequired that the numberof appearances ofeach pattern is proportionalto the length of the training sequence (i.e., to the total number ofpatternsactually presented). Note

that the theorem onlyassures usthat members in the trainingsequences will be correctlyidentified.Correctclassification of newpatternswill occur only if the training exam-

(b)

pies are truly representative of classC and its complement, and thefeatures are well chosen. For example, if a featuresuch as color doesnot distinguish one class from an-

TABLE 5-1 • Outputs ofThree Feature Detectors for a Set of Patterns

Feature Detectors

Input fD 1 Enclosures fD2 Verticals ffi3 HorizontalsPatterns If < 2 Enclosures-+ -1 If < 2 Verticals-+-1 If < 3 Horizon tals-+- 1

If ~2 Enclosures-++ 1 If ~2 Verticals-+ + 1 If ~ 3 Horizontals -++ 1

1 A -1 1 -1

2 8 1 - 1 1

3 8 1 1 -1

4 A -1 -1 -1

5 B 1 -1 -1

6 R -1 1 1

7 B -1 -1 1

8 B 1 1 1

other, then measurements of thisfeature cannot contribute to theclassification process.

Although wehavediscussedclassification usingtwo classes, the

154

LEARNING

approach can be extendedto manyclasses (see Fig. 5-10). The figureshows how the binaryoutput of twoclassifiers can be used to classify apattern into one of four classes.

An example of threshold deviceparameter learning is presented inTables 5-1 and 5-2. The problem isto tellwhether a set of measurements derivedfrom a pattern on a

TABLE 5-2 • Learning to Recognize a Set of Patterns by Adjustment of Weights in a Threshold Network(The training sequen ce is the set of input patterns 1-7 of Table 5-1)

Input Weighted New New New NewPattern Class FOI FD2 FD3 Sum T WI W2 W3

0 - - - - - .?f0 0 0 0

1 -1 -1 1 -1 o IL 0 0 0 0

2 1 1 -1 1 0 -1 1 -1 1 +-change

3 1 1 1 - 1 - 1 -2 2 0 0 +-change

4 -1 -1 -1 - 1 -2 -2 2 0 0

5 1 1 -1 -1 2 -2 2 0 0

6 -1 - 1 1 1 -2 -2 2 0 0

7 1 -1 -1 1 -2 -3 1 - 1 1 +-change

1 -1 -1 1 -1 -3 -3 1 -1 1

2 1 1 -1 1 3 -3 1 -1 1

3 1 1 1 -1 -1 -3 1 -1 1

4 -1 - 1 -1 -1 -1 -2 2 0 2 +-change

5 1 1 - 1 -1 0 -2 2 0 2

6 -1 - 1 1 1 0 -1 3 -1 1 +-change

7 1 -1 -1 1 -1 -2 2 -2 2 +-change

1 -1 - 1 1 -1 -6 -2 2 -2 2

2 1 1 -1 1 6 - 2 2 - 2 2

3 1 1 1 -1 -2 -3 3 - 1 1 +-change

4 -1 - 1 -1 - 1 -3 -3 3 - 1 1

5 1 1 - 1 -1 3 -3 3 -1 1

6 -1 -1 1 1 -3 - 3 3 -1 1

7 1 -1 -1 1 -1 -3 3 -1 1

8 1 1 1 1 3 - 3 3 -1 1 +-Thispattern, not in thetrainingset, is correctlyidentified bythe"learned" set ofweights

retina depicts the presence of thecharacter A or the characterB. Thethreshold device outputs - 1 toindicate the characterA and +1 toindicate the characterB. Threefeature detectors makemeasurements on the pattern values in theretina to determine three differentcharacteristics or features of thepattern.

Table 5-1 shows the responsesofthethreefeature detectors fora

155

APPENDIX 5-2

set ofeightpatterns. Notethat theoutputofa feature detector is +1or -l.

Table 5-2 shows the sequenceoftrial weights obtained using theweight adjustment procedure described previously. The three weightsare initially zero, andare modifiedwhenever the inputpattern is misclassified. For example, inputpattern 2 hasa weighted sumthat doesnot exceed the threshold. This is

incorrect, sincepattern 2 is in the+1 class. Therefore the weightmodification procedureis used.

Theprocedure continuesuntilallof the patternsare correctlyclassified. Once the correct set ofweights hasbeen obtained, thedevice canclassify a pattern that itwas not trained on, as shown in theresponse to pattern 8.

Documents

Learning - University of Edinburghhomepages.inf.ed.ac.uk/rbf/BOOKS/FISCHLER/ocr/129-155.pdf · learning correspondto providing the machinewith a predesigned set ofproce dures, or