39
COM1070: Introduction to Artificial Intelligence: week 9 Yorick Wilks Computer Science Department University of Sheffield www.dcs.shef.ac.uk/-yorick

COM1070: Introduction to Artificial Intelligence: week 9 Yorick Wilks Computer Science Department University of Sheffield

Embed Size (px)

Citation preview

COM1070: Introduction to Artificial Intelligence: week 9

Yorick WilksComputer Science DepartmentUniversity of Sheffieldwww.dcs.shef.ac.uk/-yorick

Summary: What are Neural Nets?

Important characteristics: Large number of very simple neuronlike processing

elements. Large number of weighted connections between

these elements. Highly parallel. Graceful degradation and fault tolerant

Key concepts

Multi-layer perceptron.Backpropogation, and supervised learning.Generalisation: nets trained on one set of data, and

then tested on a previously unseen set of data. Percentage of previously unseen set they get right shows their ability to generalise.

What does ‘brain-style computing’ mean?

Rough resemblance between units and weights in Artificial Neural Network (or ANNs) and neurons in brain and connections between them.

Individual units in a net are like real neurons. Learning in brain similar to modifying connection

strengths. Nets and neurons operate in a parallel fashion. ANNs store information in a distributed manner as do

brains. ANNs and brain degrade gracefully. BUT these structures still model logic gates as well

and are not a different kind of non-von Neumann machine

BUT

Artificial Neural Net account is simplified. Several aspects of ANNs don’t occur in real brains. Similarly brain contains many different kinds of neurons, different cells in different regions.

e.g. not clear that backpropogation has any biological plausibility. Training with backpropogation needs enormous numbers of cycles.

Often what is modelled is not the kinds of process that are likely to occur at neuron level.

For example, if modelling our knowledge of kinship relationships, unlikely that we have individual neurons corresponding to ‘Aunt’ etc.

Edelman, 1987 suggests that it may take units ‘in the order of several thousand neurons to encode stimulus categories of significance to animals’.

Better to talk of Neurally inspired or Brain-style computation.

Remember too that (as with Aunt) even the best systems have nodes pre-coded with artificial notion slike the phonemes (corresponding to the phonetic alphabet). These cannot be precoded in the brain (as they are n Sejnowski’s NETTALK) but must themselves be learned.

Getting closer to real intelligence?

Idea that intelligence is adaptive behaviour.Ie an organism that can learn about its environment is

intelligent.Can contrast this with approach that assumes that

something like playing chess is an example of intelligent behaviour.

Connectionism still in its infancy:- still not impressive compared to ants, earthworms or

cockroaches.But arguably still closer to computation that does occur

in brain than is the case in standard symbolic AI. Though remember McCarthy’s definition of AI as common-sense reasoning (esp. of a prelinguistic child).

And might still be a better approach than the symbolic one.

Like analogy of climbing a tree to reach the moon – may be able to perform certain tasks in symbolic AI, but may never be able to achieve real intelligence.

Ditto with connectionism/ANNs ---both sides use this argument.

Past-tense learning model

references:

Chapter 18: On learning the past tenses of English verbs. In McClelland, J.L., Rumelhart, D.E. and the PDP Research Group (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 2: Psychological and Biological Models, Cambridge, MA: MIT Press/Bradford Books.

Chapter 6: Two simulations of higher cognitive processes. Bechtel, W. and Abrahamsen, A. (1991) Connectionism and the mind: An introduction to parallel processing in networks. Basil Blackwell.

Past-tense model

A model of human ability to learn past-tenses of verbs.

Presented by Rumelhart and McClelland (1986) in their ‘PDP volumes’:

Main impact of these volumes: introduced and popularised the ideas of Multi-layer Perceptron, trained by means of Backpropagation

Children learning to speak:

Baby: DaDa

Toddler: Daddy

Very young child: Daddy home!!!!

Slightly older child: Daddy came home!

Older child: Daddy comed home!

Even older child: Daddy can home!

Stages of acquisition in children:

Stage 1: past tense of a few specific verbs, some regulare.g. looked, neededMost irregular: came, got, went, took, gaveAs if learned by rote (memorised).Stage 2: evidence of general rule for past-tense.I.e. add ed to stem of verb.And often overgeneralise irregularse.g. camed or comed instead of came.Also (Berko, 1958) can generate past tense for an

invented word. E.g. if they use rick describe an action, will tend to say ricked when using the word in the past-tense.

Stage 3

: Produce correct forms for both regular and irregular verbs.

Table: Characteristics of 3 stages of past-tense acquisition

Verb Type Stage 1 Stage 2 Stage 3

Early verbs Correct Regularised Correct

Regular Correct Correct

Irregular Regularised Correct

Novel Regularised Regularised

U-shaped curve – correct past-tense form used for verbs in Stage 1, errors in Stage 2 (overgeneralising rule), few errors in Stage 3.

Suggests Stage 2 children have acquired rule, and Stage 3 children have acquired exceptions to rule.

Aim of Rumelhart and McClelland: to show that connectionist network could show many of same learning phenomena as children.

- same stages and same error patterns.

Overview of past-tense NN model

Not a full-blown language processor that learns past-tenses from full sentences heard in everyday experience.

Simplified: model presented with pairs, corresponding to root form of word, and phonological structure of correct past-tense version of that word.

Can test model by presenting root form of word, and looking at past-tense form it generates.

More detailed account

Input and Output Representation

To capture order information used Wickelfeatures method of encoding words.

460 inputs:Wickelphones: represent target phoneme and

immediate context.e.g. came - #Ka, kAm, aM#These are coarse-coded onto Wickelfeatures, where 16

wickelfeatures correspond to each wickelphone.Input and output of net consist of 460 units.Inputs are ‘standard present’ forms of verbs, outputs are

corresponding past forms, regular or irregular, and all are in the special ‘wikel’ format.

This is a good example of need to find a good way of representing the input can’t just present words to a net; have to find a way of encoding those words so they can be presented as a set of inputs.

Assessing output: compare the pattern of output Wickelphone activations to the pattern that the correct response would have generated.

Hits: a 1 in output when a 1 in target and a 0 in output when a 0 in target.

False alarms: 1s in the output not in the target.

Misses: 0s in output, not in target.

Training and Testing

Verb is input, and propagated across weighted connections – will activate wickelfeatures in output that correspond to past-tense of verb.

Used perceptron-convergence procedure to train net.

(NB not multi-layer perceptron: no hidden layer, and not trained with backpropagation. Problem must be linearly separable).

Target tells output unit what value it should have. When actual output matches target, no weights adjusted. When computed output is 0, and target is 1, need to increase the probability that unit will be active the next time that pattern presented. All weights from all active input units increased by small amount eta. Also threshold reduced by eta.

When computed output is 1 and target is 0, we want to reduce the likelihood of this happening. All weights from active units are reduced by eta, and threshold increased by eta.

Perceptron convergence procedure will find a set of weights that will allow the model to get each output unit correct, provided such a set of weights exist.

Before training: Divided 560 verbs into high frequency (regular and irregular), medium (regular and irregular) and low frequency (regular and irregular).

1. Train on 10 high frequency verbs (8 irregular)

Live – lived

Look – looked

Come – came

Get – got

Give – gave

Make – made

Take – took

Go – went

Have – had

Feel – felt

2. After 10 epochs, 410 medium frequency verbs added (76 irregular)

190 more epochs (training cycles)

Net showed dip in performance on irregular verbs which is like Stage 2 in children.

And when net made errors, these errors were like children‘s – I.e. adding ‘ed’.

e.g. for come – comed

3. Tested 86 low frequency verbs it had not been trained on.

Got 92% right of regular verbs, 84% right for irregular.

Results

With simple network, and no explicit encoding of rules it could simulate important characteristics of human children learning English past-tense. Same U-shaped curve produced for irregular words.

Main point: past-tense forms can be described using a few general rules, but can be accounted for by connectionist net which has no explicit rules.

Both regular and irregular words handled by the same mechanism.

Objectives:

To show that Past-tense formulation could be carried out by net, rather than by rule system.

To capture U-shaped function.

Rule-system

Linguists: stress importance of rules in describing human behaviour.

We know the rules of language, in that we are able to speak grammatically, or even to make judgements of whether a sentence is or is not grammatical.

But this does not mean we know the rule like we know the rule ‘i before e except after c’: may not be able to state them explicitly.

But has been held (e.g. Pinker, 1984 following Chomsky), that our knowledge of language is stored explicitly as rules. Only we cannot describe them verbally because they are written in a special code only the language processing system can understand:

Explicit inaccessible rule view

Alternative view: no explicit inaccessible rules. Our performance is characterisable by rules, but they are emergent from the system, and are not explicitly represented anyway.

e.g. honeycomb: structure could be described by a rule, but this rule is not explicitly coded. Regular structure of honeycomb arises from interaction of forces that wax balls exert on each other when compressed.

Parallel distributed processing view: no explicit (albeit inaccessible) rules.

Advantages of using NNs to model aspects of human behaviour.

Neurally plausible, or at least ‘brain-style computing’. Learned: not explicitly programmed. No explicit rules; permits new explanation of

phenomenon. Model both produces the behaviour and fits the data:

errors emerge naturally from the operation of the model.

Contrast to symbolic models in all 4 respects (above)

Rumelhart and McClelland:

…lawful behaviour and judgements maybe produced by a mechanism in which there is no explicit representation of the rule. Instead, we suggest that the mechanisms that process language and make judgements of grammaticality are constructed in such a way that their performance is characterizable by rules, but that the rules themselves are not written in explicit form anywhere in the mechanism..’

Important counter-argument to linguists, who tend to think that people were applying syntactic rules.

Point: can have syntactic rules that describe language, but that doesn’t mean that when we speak syntactically (as if we were following those rules) that we literally are following rules.

Many philosophers have made a similar point against the reality of explicit rules --e.g. Wittgenstein.

The ANN approach provides a computational model of how that might be possible in practice----to have the same behavioural effect as rules but without there being any anywhere in the system.

On the other hand, the standard model of science is of amny possible rule systems describing the same phenomenon--that also allows that real rules (in a brain) could be quite different from the ones we invent to describe a phenomenon.

Some computer scientists (e.g Charniak) refuse to accept incomprehensible explanations.

Specific criticisms of the model:

Criticism 1

Performance of model depends on use of Wickelfeature representation: and this is an adaptation of standard linguistic featural analysis. – ie it relies on symbolic input representation(cf. phonemes in NETALK)

Ie what’s the contribution of the architecture?

Criticism 2

Pinker and Prince (1988): role of input and U-shaped curve.

Model’s entry to Stage 2 due to addition of 410 medium frequency verbs.

This change is more abrupt than is the case with children--there may be no relation between this method of partitioning the training data and what happens to children.

But later research (Plunkett and Marchman 1989) show that U-shaped curves can be achieved without abrupt changes in input. Trained on all examples together (using backpropogation net).

Presented more irregular verbs, but still found regularization, and other Stage 2 phenomena for certain verbs.

Criticism 3

Nets are not simply exposed to data, so that we can then examine what they learn.

They are programmed in a sense: Decisions have to be made about several things including

Training algorithm to be used Number of hidden units How to represent the task in question

Input and output representation Training examples, and manner of presentation

Criticism 4

At some point after or during learning this kind of thing, humans become able to articulate the rule.

Eg regular past tenses end in –ed.

Also can control and alter these rules – eg could pretend to be a younger child and say ‘runned’ even though she knows it is incorrect (cf. some use learned and some learnt, lit is UK and lighted US)

Hard to see how such kind of behaviour would emerge from a set of interconnected neurons.

Conclusions

Although the Past-tense model can be criticised, it is best to evaluate it in the context of the time (1986) when it was first presented.

At the time, it provided a tangible demonstration that Possible to use neural net to model an aspect of

human learning Possible to capture apparently rule-governed

behaviour in a neural net

Contrasting Neural Computing with Symbolic Artificial Intelligence

Overview of main differences between them. Relationship to the brain

(a) Similarities between Neural Computing and the brain

(b) Differences between brain and Symbolic AI – evidence that brain does not have a von Neumann architecture.

Ability to provide an account of thought and cognition

(a) Argument by symbolicists that only symbol system can provide an account of cognition

(b) Counter-argument that neural computing (subsymbolic) can also provide an account of cognition

(c) Hybrid account?

• Main differences between Connectionism and Symbolic AI

Knowledge: knowledge represented by weights and activations versus explicit propositions.

Rules: rule-like behaviour without explicit rules versus explicit rules.

Learning: Connectionist nets trained versus programmed. But there are now many machine learning algorithms that are wholly symbolic----both kinds only work in a specialised domain.

Examinability: Can examine symbolic program to ‘see how it works’. Less easy in the case of Neural Computing – problems with black box nature – set of weights opague.

Relationship to the brain: Brain-style computing versus manipulation of symbols. Different models of human abilities.

Ability to provide an account of human thought: see following discussion about need for symbol system to account for thought.

Applicability to problems: Neural computing more suited to pattern recognition problems, Symbolic computing to systems characterisable by rules.

But for a different view, that stresses similarities between GOFAI and NN approaches see Boden, M. (1991) Horses of a different colour, In Ramsey, W., Stich, S.P. and D.E. Rumelhart, ‘Philosophy and Connectionist Theory’, Lawrence Erlbaum Associates: Hillsdale, New Jersey, pp 3-19, where she points out some of the similarities. See also YW in Foundations of AI book, on web course list.

Fashions: historical tendency to model brain on fashionable technology.

mid 17th century: water clocks and hydraulic puppets popular

Descartes developed hydraulic theory of brain

Early 18th century Leibniz likened brain to a factory.

Freud: relied on electromagnetics and hydraulics in descriptions of mind.

Sherrington: likened nervous system to telegraph.

Brain also modelled as telephone switchboard.

Might use computer to model human brain; but is human brain itself a computer?

Differences between Brains and von Neumann machines

McCulloch and Pitts: simplified account of neurons as On/Off switch.

In early days seemed that neurons were like flip-flops in computers.

Flip-flop: can be thought of as tiny switches that can be either off or on. But now clear that there are differences:

- rate of firing of neuron important, as well as on/off feature

- Neuron has enormous number of input and output connections, compared to logic gates.

- speed: neuron much slower. Takes thousandth of a second to respond, whereas flip-flop can shift position from 0 to 1 in thousand-millionth of a second. I.e. brain takes a million times longer.

Thus if brain running an AI program, stepping through instructions, would take at least 1000th sec for each instruction.

Brain can extract meaning from sentence, or recognise visual pattern in about 1/10th second.

So, if this is being accomplished by stepping through program, program can only be 100 instructions long.

But current AI programs contain 1000s of instructions!

Suggests brain operates in parallel, rather than as sequential processor.

Symbol manipulators: most (NOT ALL) are sequential – carrying out instructions in sequence.

- human memories: content-addressable. Access to memory via its content.

E.g. can retrieve memory via description:(e.g. could refer to Turing Test either as ‘Turing Test’, or

as ‘assessment of intelligence based on Victorian parlour game, and would still access memory).

But memory in computer has unique address: cannot get at memory without knowing its address (at the bottom level that is!).

- memory distributionIn computer string of symbol tokens exists at specific

physical location in hardware.But our memories do not seem to function like that.E.g. Lashley and search for the engram

Trained rats to learn route through maze to food. Destroyed different areas of brain. As long as only 10 percent destroyed, no loss of memory, regardless of which area of brain destroyed.

Lashley (1950) ‘…There are no special cells reserved for special memories… The same neurons which retain memory traces of one experience must also participate in countless other activities…’

and conversely a single memory must be stored in many places across a brain----there was brief fashion for ‘the brain as a hologram’ because of the way a hologram stores information.