53
Information-theoretic approaches to syntactic processing John Hale

Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

Information-theoretic approaches tosyntactic processing

John Hale

Page 2: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

Thank you:

Zhong Chen

Tim Hunter

Jiwon Yun

Page 3: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

sentence comprehension

easy hard

‣reading times‣error scores‣eye fixations‣scalp potentials

Page 4: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

Q. when is comprehension more (vs less) difficult?

A. where more (vs less) information is conveyed

leading idea

Page 5: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

choice of continuationinforms the hearer

the boy eats shy...the boy eats using...the boy eats like...

the boy eats... the boy eats the...the boy eats his...the boy eats at...the boy eats of...the boy eats went...

Page 6: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

conditional entropy

probability (lets pretend)the boy eats shy people for breakfast 1.0× 10−25

the boy eats using chopsticks on Tuesday 1.0× 10−7

the boy eats like a hippopotamous 1.0× 10−6

the boy eats the dog with a spoon 0.0001the boy eats his sister’s bicycle 0.0005the boy eats at Denny’s frequently 0.00001the boy eats of the forbidden fruit 1.0× 10−66

the boy eats went for a walk 0.0

avg uncertainty of this distribution

Page 7: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

H(Derivation|Prefix= “the boy eats”)

fluctuation

H(Derivation|Prefix= “the boy eats his”)

any downward change quantifies information gained from “his”

Page 8: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

entropy reduction hypothesis

observed processing effort reflects decreases in Hi

where Hi abbreviates H(Derivation|Prefix = w0···i)

Page 9: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

outline

Entropy reduction studiesrelative clauses in English and Korean

How does it work? computing ↓Hi

Why does it work? reflections on information theory & linguistics

Page 10: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

garden path sentences

Bever 70

the horse raced past the barn fell

Page 11: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

1.00 S → NP VP0.88 NP → DT NN0.12 NP → NP RRC1.00 PP → IN NP1.00 RRC → Vppart PP0.50 VP → Vpast0.50 VP → Vppart PP1.00 DT → the0.50 NN → horse0.50 NN → barn0.50 Vppart → groomed0.50 Vppart → raced0.50 Vpast → raced0.50 Vpast → fell1.00 IN → past

naive probabilistic grammar

Hale JPR 03

Page 12: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

the

S

NP

DT

the

N RRC

VPfell

S

NP

DT

the

N

VPfell

S

NP

DT

the

N RRC

VPraced

S

NP

DT

the

N

VPraced

entropy: 4.65 bits

Page 13: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

entropy: 3.65 bits

S

NP

DT

the

N

horse

RRC

VPfell

S

NP

DT

the

N

horse

VPfell

S

NP

DT

the

N

horse

RRC

VPraced

S

NP

DT

the

N

horse

VPraced

↓H=1

the horse

Page 14: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

the horse raced

S

NP

DT

the

N

horse

RRC

VPraced

Vraced

raced

PP

RRC

VPfell

S

NP

DT

the

N

horse

RRC

VPraced

Vraced

raced

PP

VPfell

S

NP

DT

the

N

horse

RRC

VPraced

Vraced

raced

PP

RRC

VPraced

S

NP

DT

the

N

horse

RRC

VPraced

Vraced

raced

PP

VPraced

S

NP

DT

the

N

horse

VPraced

Vraced

raced

PP

entropy: 5.2 bits ↓H=none

Page 15: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

last word gives 4 bits

the horse raced past the barn fell

1

2

3

4

bitsreduced Garden-pathing

total: 6.2 bits

Page 16: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

wide coverage dependency parser

Hall & Hale AMLaP 07

Page 17: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

0.20 NP → SPECNP NBAR0.40 NP → I0.40 NP → John1.00 SPECNP → DT0.50 NBAR → NBAR S[+R]0.50 NBAR → N1.00 S → NP VP0.87 S[+R] → NP[+R] VP0.13 S[+R] → NP[+R] S/NP1.00 S/NP → NP VP/NP0.50 VP/NP → V[SUBCAT2] NP/NP0.50 VP/NP → V[SUBCAT3] NP/NP PP[to]0.33 VP → V[SUBCAT2] NP0.33 VP → V[SUBCAT3] NP PP[to]0.33 VP → V[SUBCAT4] PP[for]0.33 V[SUBCAT2] → met0.33 V[SUBCAT2] → attacked0.33 V[SUBCAT2] → disliked1.00 V[SUBCAT3] → sent1.00 V[SUBCAT4] → hoped

1.00 PP[to] → PBAR[to] NP1.00 PBAR[to] → P[to]1.00 P[to] → to1.00 PP[for] → PBAR[for] NP1.00 PBAR[for] → P[for]1.00 P[for] → for1.00 NP[+R] → who0.50 DT → the0.50 DT → a0.17 N → editor0.17 N → senator0.17 N → reporter0.17 N → photographer0.17 N → story0.17 N → ADJ N1.00 ADJ → good1.00 NP/NP → �

GPSG-style fragment

Page 18: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

center embedding

21 bits the reporter disliked the editor39 bits the reporter [ who the senator attacked ] disliked the editor48 bits the reporter [ who the senator [ who John met ] attacked ]

disliked the editor

but yet

24 bits John met the senator [ who attacked the reporter[ who disliked the editor ] ]

Page 19: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

subject vs object-extracted RC

the reporter who ∅ sent the photographer to the editorhoped for a good story

the reporter who the photographer sent ∅ to the editorhoped for a good story

selectively slower

Grodner and Gibson CogSci 05among others

Page 20: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

bits ↔ reading time

RT(wi) = α (↓ Hi) + β

Page 21: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

the reporter who sent the photographer to the editor hoped for a good storyregion

360

380

400

420

440

460

msec Entropy reduction -- subject relative

Observed

Predicted

subject-extracted

Page 22: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

the reporter who the photographer sent to the editor hoped for a good storyregion

360

380

400

420

440

460

480

msec Entropy reduction -- object relative

Observed

Predicted

ERH: more workat embedded V

object-extracted

Page 23: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

bits ↔ reading time

RT(wi) = α (↓ Hi) + β

α = 7.38β = 377r2 = 0.49, p < 0.01

Page 24: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

Many types of RCs

indirect object

the man who Stephen explained the accident to ∅ is kind

oblique

the girl who Sue wrote the story with ∅ is proud

genitive subject

the boy whose brother ∅ tells lies is always honest

genitive object

the sailor whose ship Jim took ∅ had one leg

Page 25: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

cP

c’✘✘✘✘c

tP✥✥✥✥

dP(5)

d’

d

I

t’✘✘✘✘

t✦✦✦little v✧✧

v

know

❜❜little v

❛❛t

little vP✏✏✏

dP

t(5)

���little v’

✧✧little v

t

❜❜vP

v’

v

t

CeP

Ce’✭✭✭✭✭✭✭Ce

that

❤❤❤❤❤❤❤tP

dP(4)

d’✘✘✘✘d

the

c relP✥✥✥✥

dP(1)✧✧

nP(0)

n’

n

man

❜❜d’

✚✚d

who

NumP

Num’

Num nP

t(0)

c rel’✘✘✘✘

c rel

tP✭✭✭✭✭

dP(3)

d’

d

Stephen

❤❤❤❤❤t’✥✥✥✥

t✏✏✏little v✧✧

v

explain

❜❜little v

���t

-ed

little vP✘✘✘✘

dP

t(3)

���little v’✏✏✏

little v

t

���vP✏✏✏

dP(2)

d’✚✚d

the

◗◗NumP

Num’

Num nP

n’

n

accident

���v’

✧✧dP

t(2)

❜❜v’

v

t

p toP

p to’✏✏✏p to✚✚

Pto

to

◗◗p to

���PtoP✟✟

dP(1)

t(1)

❍❍Pto’

Pto

t

dP

t(1)

t’✦✦t

Be

be

t

-s

❛❛BeP

Be’✧✧

Be

t

❜❜aP

✑✑dP

t(4)

❜❜a’

a AP

A’

A

kind

minimalist grammarsà la Stabler 97

Page 26: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

predicted work vs human accuracy

250 300 350 400 450 500error score

30

35

40

45

50

55

bits reducedper sentence

Accessibility Hierarchyr2!0.45, p"0.001

SU DO IO OBL GEN

Keenan & S. Hawkins 87, Hale 06

Page 27: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

Korean Subj-RC advantage

500

1000

1500

2000

ORC

SRC

W10W9NounW6W5W4W3W2W1

Word-by-word reading time observation (Kwon 2008)ms

Verb.ADN

Verb.DECL

Page 28: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

Dependency width doesn’t derive it

SRC!RC ! Object Verb

"HeadNoun

ORC!RC Subject ! Verb

"HeadNoun

Page 29: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

ERH+MG does derive the SRC advantage

0

5

10

15

20

25

ORC

SRC

Verb.DECLNOMNounVerb.ADNNOM/ACCNoun

Word-by-word comprehension di!culty predictionbits

Page 30: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

C-DeclP

C-Decl’✭✭✭✭C-Decl

❤❤❤❤❤T-DeclP✭✭✭✭✭✭

CaseP(6)✭✭✭✭✭✭✭✭DP(5)

T-RelP(4)✏✏✏CaseP(3)

t(3)

���T-Rel’✏✏✏

T-Rel���

v-RelP✘✘✘CaseP

t(3)

v-Rel’✘✘✘

v-Rel

V-RelP✥✥✥✥

CaseP(2)✟✟

DP(1)

D’

D NP

N’

N

uywon

❍❍Case’

Case

ul

DP

t(1)

V-Rel’✦✦

V-Rel

kongkyekhan

❛❛CaseP

t(2)

D’✘✘✘D

C-RelP✘✘✘

CaseP(3)✟✟

DP(0)

D’

D NP

N’

N

kica

❍❍Case’

Case DP

t(0)

C-Rel’✧✧

C-Rel❜❜T-RelP

t(4)

❤❤❤❤❤❤❤❤Case’

Case

ka

DP

t(5)

❤❤❤❤❤❤T-Decl’✏✏

T-Decl��

v-DeclP✏✏CaseP

t(6)

��v-Decl’✟✟

v-Decl❍❍V-DeclP

V-Decl’

V-Decl

yumyenghaycyessta

“The reporter who attacked the senator became famous”

Yun, Whitman & Hale 10

Page 31: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

One of four clause types

N

0 1

NOM/ACC

1 2

V-D

ECL

2

3

V-ADN

2 3

V-A

DV

2

3

N3

4

N‘fact’

3

4

matrix clause

adjunct clause

relative clause

complement clause

Figure 5: Continuations signal clause-types

!"

!#

$"

$#%&'( )*+ ,*+

"

#

!"

!#

$"

$#

- -,./011

2 3415

%&'( )*+ ,*+

(a) Matrix Clause

!"

!#

$"

$#%&'( )*+ ,*+

"

#

!"

!#

$"

$#

- -,./011

2 032 - -,. 2 3415

%&'( )*+ ,*+

(b) Adjunct Clause

!"

!#

$"

$#%&'( )*+ ,*+

"

#

!"

!#

$"

$#

- -,./011

2 03- 456' -,. 2 3718

%&'( )*+ ,*+

(c) Complement Clause

!"

!#

$"

$#%&'( )*+ ,*+

"

#

!"

!#

$"

$#

- -,./011

2 03- - -,. 2 3415

%&'( )*+ ,*+

(d) Relative Clause

Figure 6: Word-by-word comprehension difficulty predictions derived by the INFORMATION-THEORETICAL Entropy Reduc-tion Hypothesis. Horizontal axes labels name word classes. SBJ abbreviates “subject-extracted”, OBJ “object-extracted”.Clause-types (a)–(d) are as in Figure 3.

Page 32: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

N

0 1

NOM/ACC

1 2

V-D

ECL

2

3

V-ADN

2 3

V-A

DV

2

3

N3

4

N‘fact’

3

4

matrix clause

adjunct clause

relative clause

complement clause

Figure 5: Continuations signal clause-types

!"

!#

$"

$#%&'( )*+ ,*+

"

#

!"

!#

$"

$#

- -,./011

2 3415

%&'( )*+ ,*+

(a) Matrix Clause

!"

!#

$"

$#%&'( )*+ ,*+

"

#

!"

!#

$"

$#

- -,./011

2 032 - -,. 2 3415

%&'( )*+ ,*+

(b) Adjunct Clause

!"

!#

$"

$#%&'( )*+ ,*+

"

#

!"

!#

$"

$#

- -,./011

2 03- 456' -,. 2 3718

%&'( )*+ ,*+

(c) Complement Clause

!"

!#

$"

$#%&'( )*+ ,*+

"

#

!"

!#

$"

$#

- -,./011

2 03- - -,. 2 3415

%&'( )*+ ,*+

(d) Relative Clause

Figure 6: Word-by-word comprehension difficulty predictions derived by the INFORMATION-THEORETICAL Entropy Reduc-tion Hypothesis. Horizontal axes labels name word classes. SBJ abbreviates “subject-extracted”, OBJ “object-extracted”.Clause-types (a)–(d) are as in Figure 3.

Novel predictionselectively slower

Page 33: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

Confirmed experimentally

Kwon, Yun et al CUNY2011

NCCNPs with subject pro

NCCNPs with object pro

W1   W2    

W3   W4   W5   W6   W7   W8   W9   W10   W11   W12   W13  

    pro                       .  

Last   month   hei   editor-­‐ACC   bribe   taking   suspicion-­‐with   threaten-­‐ADN   fact-­‐NOM   was.revealed-­‐as   chancellori-­‐TOP   immediately   press.conference-­‐ACC   held    

‘The  chancellori  immediately  held  a  press  conference  as  the  fact  that  hei  threatened  the  editor  for  taking  a  bribe  last  month  was  revealed.’  

W1   W2   W3     W4   W5   W6   W7   W8   W9   W10   W11   W12   W13  

      pro                     .  

Last   month   editor-­‐NOM   himi   bribe   taking   suspicion-­‐with   threaten-­‐ADN   fact-­‐NOM   was.revealed-­‐as   chancellori-­‐TOP   immediately   press.conference-­‐ACC   held    

‘The  chancellori  immediately  held  a  press  conference  as  the  fact  that  the  editor  threatened  himi  for  taking  a  bribe  last  month  was  revealed.’  

300 ms

400 ms

500 ms

600 msno context Obj prono context Sbj procontext Obj procontext Sbj pro

W13W12W11W10W9W8W7W6W5W4W3W2W1

*

object-“extracted”slower, p < 0.007

Page 34: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

outline

Entropy reduction studiesrelative clauses in English and Korean

How does it work? computing ↓Hi

Why does it work? reflections on information theory & linguistics

Page 35: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

computing Hi

=c +nom agrD � =i +rel c �=>agrD droot the d I=d =d i met =n d -rel whon -nom boy

weighted MCFG G

Minimalist Grammar (or other formalism)

chart

input string w=w1w2w3...wi

prefix of a sentence in L(G)

Page 36: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

computing Hi

chart’s itemsform a graph

= a system of equations,whose solutions are(sums of) probabilities

weighted "intersection" grammar G´

H(G′) = Hi

Page 37: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

�H = �h + M �H

�h = �H −M �H = (I −M) �H

�H = (I −M)−1�h

�h vector of 1-step rewriting entropies�H vector of infinite step rewriting entropiesM “fertility matrix” giving expected number j symbols birthed by the i

th symbolI the identity matrix with ones down the diagonal

entropy of probabilistic grammar

Grenander 67

Page 38: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

�H = �h + M �H

�h = �H −M �H = (I −M) �H

�H = (I −M)−1�h

�h vector of 1-step rewriting entropies�H vector of infinite step rewriting entropiesM “fertility matrix” giving expected number j symbols birthed by the i

th symbolI the identity matrix with ones down the diagonal

entropy of probabilistic grammar

Grenander 67

Page 39: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

�h vector of 1-step rewriting entropies�H vector of infinite step rewriting entropiesM “fertility matrix” giving expected number j symbols birthed by the i

th symbolI the identity matrix with ones down the diagonal

�H = �h + M �H

�h = �H −M �H = (I −M) �H

�H = (I −M)−1�h

entropy of probabilistic grammar

Grenander 67

Page 40: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

outline

Entropy reduction studiesrelative clauses in English and Korean

How does it work? computing ↓Hi

Why does it work? reflections on information theory & linguistics

Page 41: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

© 1949 SCIENTIFIC AMERICAN, INC

Page 42: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

Entropy reduction in 1953

“Psycholinguistics” Sebeok and Osgood, eds.§5.3 Applications of Entropy Measures to Problems of Sequence Structure

Page 43: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

Whatever Happened to Information Theory in Psychology?

R. Duncan LuceUniversity of California, Irvine

Although Shannon’s information theory is alive and well in a number of fields, after aninitial fad in psychology during the 1950s and 1960s it no longer is much of a factor,beyond the word bit, in psychological theory. The author discusses what seems to him(and others) to be the root causes of an actual incompatibility between informationtheory and the psychological phenomena to which it has been applied.

Claude Shannon, the creator of informationtheory, or communication theory as he preferredto call it, died on February 24, 2001, at age 84.So, I would like to dedicate this brief piece tohis memory and in particular to recall his sem-inal contribution “A Mathematical Theory ofCommunication,” which was published in twoparts in the Bell System Technical Journal in1948 and rendered more accessible in the shortmonograph by Shannon and Weaver (1949).Let me begin by saying that information the-

ory is alive and well in biology, engineering,physics, and statistics, although my conclusionis that for quite good reasons it has had littlelong-range impact in psychology. One rarelysees Shannon’s information theory in contem-porary psychology articles except to the extentof the late John W. Tukey’s term bit, which isnow a permanent word of our vocabulary. If welook at the table of contents of J. Skilling’s(1989) Maximum Entropy and Bayesian Meth-ods, we find the pattern of chapters on applica-tions shown in Table 1. You will note none arein psychology.Because it is doubtful that many young psy-

chologists are learning the subject, a few wordsare necessary to set the stage. As mathematicalexpositor extraordinaire Keith Devlin (2001, p.21) stated: “Shannon’s theory does not deal

with ‘information’ as that word is generallyunderstood. Instead, it deals with data—the rawmaterial out of which information is obtained.”Now, that makes it sound akin to what wenormally think to be the role of statistics, whichis correct. It begins with an abstract, finite set ofelements and a probability distribution over it.Let the elements of the set be identified with thefirst m integers and let p(i), where i ! 1, . . . , m,be the probabilities that are assumed to cover allpossibilities, that is, ¥i!1

m p(i) ! 1. Uncertaintyis a summary number that is a function of theprobabilities; that is, U({p(i)}) ! U(p(1), . . . ,p(m)). What function?

Shannon’s Measure of Information

To get at that, Shannon imposed a number ofplausible properties that he believed such anumber should satisfy. Others subsequentlyproposed alternatives, among them Aczel andDaroczy (1975); Aczel, Forte, and Ng (1974);and Luce (1960). Probably the best alternativewas that offered by Aczel et al. (1974), which,with less than total precision, was as follows:1. The labeling of the elements is totally

immaterial; all that counts is the set of mprobabilities.2. For m ! 2, U(12,

12) ! 1. This defines the

unit of uncertainty, the bit.3. limpn0U(p, 1 " p) ! 0.4. U( p1, . . . , pm, 0) ! U(p1, . . . , pm).5. If P and Q are two distributions and P ! Q

denotes their joint distribution, U(P ! Q) !U(P) # U(Q), with ! holding if P and Q areindependent.The mathematical conclusion is that

This article was prepared at the instigation of Alan Bo-neau as a contribution to a Division 1 session at the 2001Annual Meeting of the American Psychological Associationin San Francisco. I have benefited from interchanges on thetopic with Peter Killeen and Donald Laming.Correspondence concerning this article should be ad-

dressed to R. Duncan Luce, School of Social Sciences,Social Science Plaza 2133, University of California, Irvine,California 92697-5100. E-mail: [email protected]

Review of General Psychology Copyright 2003 by the Educational Publishing Foundation2003, Vol. 7, No. 2, 183–188 1089-2680/03/$12.00 DOI: 10.1037/1089-2680.7.2.183

183

“The elements of choice in information theory are absolutely neutral and lack any internal structure.That is fine for a communication engineer ....[but]by and large, however, the stimuli of psychological experiments are to some degree structured, and so, in a fundamental way, they are not in any sense interchangeable.”

Page 44: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

Formal grammar and information theory:together again?

By Fernando Pereira

AT & T Laboratories Research, A247, Shannon Laboratory, 180 Park Avenue,Florham Park, NJ 07932-0971, USA ([email protected])

In the last 40 years, research on models of spoken and written language has been split

between two seemingly irreconcilable traditions: formal linguistics in the Chomsky

tradition, and information theory in the Shannon tradition. Zellig Harris had advo-

cated a close alliance between grammatical and information-theoretic principles in

the analysis of natural language, and early formal-language theory provided another

strong link between information theory and linguistics. Nevertheless, in most research

on language and computation, grammatical and information-theoretic approaches

had moved far apart.

Today, after many years on the defensive, the information-theoretic approach has

gained new strength and achieved practical successes in speech recognition, informa-

tion retrieval, and, increasingly, in language analysis and machine translation. The

exponential increase in the speed and storage capacity of computers is the proxi-

mate cause of these engineering successes, allowing the automatic estimation of the

parameters of probabilistic models of language by counting occurrences of linguistic

events in very large bodies of text and speech. However, I will argue that information-

theoretic and computational ideas are also playing an increasing role in the scien-

ti�c understanding of language, and will help bring together formal-linguistic and

information-theoretic perspectives.

Keywords: formal linguistics; information theory; machine learning

1. The great divide

In the last 40 years, research on models of spoken and written language has been

split between two seemingly irreconcilable points of view: formal linguistics in the

Chomsky tradition, and information theory in the Shannon tradition. The famous

quote of Chomsky (1957) signals the beginning of the split.

(1) Colourless green ideas sleep furiously.

(2) Furiously sleep ideas green colourless.

: : : It is fair to assume that neither sentence (1) nor (2) (nor indeed any

part of these sentences) has ever occurred in an English discourse. Hence,

in any statistical model for grammaticalness, these sentences will be ruled

out on identical grounds as equally `remote’ from English. Yet (1), though

nonsensical, is grammatical, while (2) is not.

Phil. Trans. R. Soc. Lond. A (2000) 358, 1239{1253

1239

c 2000 The Royal Society

on July 14, 2011rsta.royalsocietypublishing.orgDownloaded from

“Probabilities can be assigned to complex linguistic events, even novel ones, by using the causal structure of the underlying models to propagate the uncertainty in the elementary decisions.”

⇒ these models incorporatelinguistic theories!

April, 2000

Page 45: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

Conclusions

Information theory helps identifywhich RCs are hard where

the account uses substantial syntactic claims

the difference since 1953 is the grammar

Page 46: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

bonus slides

Page 47: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

definitionsLet G be a (probabilistic) grammar, X a random variable whose outcomes x are derivations on G,and Y a related variable whose outcomes y are initial substring of sentences in L(G).

I(X;Y ) = H(X)−H(X|Y )= H(X)− E [H(X|y)]

mutual informationof grammar and prefix string

information conveyedby a particular prefix

I(X; y) = H(X)−H(X|y)

Page 48: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

I(X;Y ) = H(X)−H(X|Y )= H(X)− E [H(X|y)]

definitionsLet G be a (probabilistic) grammar, X a random variable whose outcomes x are derivations on G,and Y a related variable whose outcomes y are initial substring of sentences in L(G).

mutual informationof grammar and prefix string

I(X; y) = H(X)−H(X|y)

information conveyedby a particular prefix

Page 49: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

I(X; ynew)− I(X; yold)

=�H(X)−H(X|ynew)

�−

�H(X)−H(X|yold)

= H(X|yold)−H(X|ynew)

definitionsLet G be a (probabilistic) grammar, X a random variable whose outcomes x are derivations on G,and Y a related variable whose outcomes y are initial substring of sentences in L(G).

info conveyed by increment

Page 50: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

definitionsLet G be a (probabilistic) grammar, X a random variable whose outcomes x are derivations on G,and Y a related variable whose outcomes y are initial substring of sentences in L(G).

info conveyed by increment

I(X; ynew)− I(X; yold)

=�H(X)−H(X|ynew)

�−

�H(X)−H(X|yold)

= H(X|yold)−H(X|ynew)

Page 51: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

I(X; ynew)− I(X; yold)

=�H(X)−H(X|ynew)

�−

�H(X)−H(X|yold)

= H(X|yold)−H(X|ynew)

definitionsLet G be a (probabilistic) grammar, X a random variable whose outcomes x are derivations on G,and Y a related variable whose outcomes y are initial substring of sentences in L(G).

info conveyed by increment

Page 52: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

I(X; ynew)− I(X; yold)

=�H(X)−H(X|ynew)

�−

�H(X)−H(X|yold)

= H(X|yold)−H(X|ynew)= ↓ Hi where yold = w1 · · · wi−1,

ynew = w1 · · · wi

definitionsLet G be a (probabilistic) grammar, X a random variable whose outcomes x are derivations on G,and Y a related variable whose outcomes y are initial substring of sentences in L(G).

info conveyed by increment

Page 53: Information-theoretic approaches to syntactic processingconditional entropy probability (lets pretend) the boy eats shy people for breakfast 1.0 × 10−25 the boy eats using chopsticks

the best heuristic tracksuncertainty about the rest of the sentence

0 20 40 60 80 100 120

020

40

60

80

100

sum of expected derivation lengths for predicted categories

mean s

teps to fin

ishin

g the p

ars

e

r = 0.4,p < 0.0001

Hale 2011