39
Identifying Minimalist Languages from Dependency Structures Edward Stabler UCLA Department of Linguistics Johns Hopkins University, April 2002 Reporting work done in part with: Edward Keenan, Greg Kobele UCLA Linguistics Charles Taylor, Travis Collier UCLA Organismic Biology, Ecology, and Evolution Stabler, JHU 0-1 Abstract: The human acquisition of human languages is is based on the anal- ysis of signal and context. To study how this might work, a simplified robotic setting is described in which the problem is divided into two basic steps: an analysis of the linguistic events in context that yields dependency structures, and the identification of grammars that generate those structures. A learnabil- ity result that generalizes (Kanazawa 1994) has been obtained, showing that non-CF, and even non-TAG languages can be identified in this setting, and more realistic assessments of the learning problem are under study. Human languages are complex, and it is a challenge to find strategies which will let us understand their components. (In the same way, the physical properties of the table that my papers are on is complex, and so it is no surprise that physicists and chemists do not directly study tables, but simpler components of tables, first.) The strategy I will describe here is to study some things that are similar to human languages in some crucial respects, but otherwise very much simpler. Imagine a simplified language learner who cannot speak or hear but only read and write. Imagine this learner can see and reason (but perhaps only in some quite primitive ways), and that this learner encounters expressions – character sequences – like this in certain perceptual and cognitive contexts.

Identifying minimalist languages from dependency structures

Embed Size (px)

Citation preview

Identifying Minimalist Languages from Dependency Structures

Edward Stabler

UCLA Department of Linguistics

Johns Hopkins University, April 2002

Reporting work done in part with:

Edward Keenan, Greg Kobele UCLA Linguistics

Charles Taylor, Travis Collier UCLA Organismic Biology, Ecology, and Evolution

Stabler, JHU 0-1

Abstract: The human acquisition of human languages is is based on the anal-ysis of signal and context. To study how this might work, a simplified roboticsetting is described in which the problem is divided into two basic steps: ananalysis of the linguistic events in context that yields dependency structures,and the identification of grammars that generate those structures. A learnabil-ity result that generalizes (Kanazawa 1994) has been obtained, showing thatnon-CF, and even non-TAG languages can be identified in this setting, and morerealistic assessments of the learning problem are under study.

Human languages are complex, and it is a challenge to find strategies which will let usunderstand their components. (In the same way, the physical properties of the table thatmy papers are on is complex, and so it is no surprise that physicists and chemists do notdirectly study tables, but simpler components of tables, first.)

The strategy I will describe here is to study some things that are similar to humanlanguages in some crucial respects, but otherwise very much simpler.

Imagine a simplified language learner who cannot speak or hear but only read and write.Imagine this learner can see and reason (but perhaps only in some quite primitive ways),and that this learner encounters expressions – character sequences – like this in certainperceptual and cognitive contexts.

language learned in context

John ’s eat -ing cookie -s

covariation with context:

choice of elements: John vs Bill, eat vs buy, cookie vs pie

configuration: cookies, John’s eating

s’John eating cookies?

Stabler, JHU 1-1

Regular covariation with contexts C, provides evidence about what the expressions mean.We notice:

1. not all elements can be replaced by something else to obtain a semantically relatedexpression (the affixes cannot be replaced by anything else!)

2. not all permutations yield expressions, and in particular, semantically related ex-pressions

3. the semantic variations related to choice of (non-affixal) elements are unlike varia-tions related to order

expressions: elements and configuration

John ’s eat -ing cookie -s

covariation with context:

choice of elements: John vs Bill, eat vs buy, cookie vs pie

configuration: cookies, John’s eating

s’John eating cookies?

• Two different kinds of influences on meanings.

[[choice of elements]]

[[configuration, affixes, (prosody)]]

• Only certain permutations are wffs, & semantically related.

Why?

Stabler, JHU 2-1

It’s easy to provide a particular grammar with a formal semantics such that only certainpermutations are well formed, and among the well-formed ones, only some are semanticallyrelated.

But such a grammar does not explain why this would be the case, in human languages

It would be perfectly easy to design a language in which john eats means John eats, whileeats John means Mary eats.

And it would be perfectly easy to design a language in which every permutation of a sen-tence is also a sentence, semantically related (even synonymous)

Human languages are not like that. Why not? Our idea is that the answer should derivefrom fundamental properties of the human learner. This presentation will describe somefirst steps towards this goal.

We will imagine a first step that recognizes certain dependencies among elements in thestring, and a second step which recognizes the kinds of dependencies the whole languageallows. This work will build on a certain tradition in formal learning theory.

learning string languages

evidence: an infinite text containing (all and only) elements of L

learner: initial segments of texts → grammars

identification: converge on G such that L(G)=L

L is learnable iff some learner can identify it from any text for L

Class of languages L is learnable iff every L∈ L is

(Gold, 1967) no proper superset of the finite languages can be learned

(Angluin, 1980) class L is learnable iff every L∈ L has a finite DL

DL

L

no such intermediate language L′!

Stabler, JHU 3-1

This first simple idea about evidence, learner and success (from Gold) yields mainly nega-tive results. But there is one very interesting strand of positive results.

k-reversible languages

���������������������������

���������������������������

Fin

CF

Reg

(Angluin, 1982) learnable:

regular languages where xz,yz ∈ L implies ∀w,xw ∈ L iff yw ∈ L

Stabler, JHU 4-1

A canonical receptor for a regular language classifies strings (prefixes) x according to their“good finals” the set of suffixes w that xw ∈ L. So Angluin considers languages where theclassification of each prefix is unambiguously established by a single suffix.

Generalizing, we can let the classification of a prefix be given by the suffix z togetherwith the last k symbols of the prefix.

k-valued AB languages

���������������������������������������������������������

���������������������������������������������������������

Fin

CF

Reg

(Kanazawa, 1998) learnable from function-argument relations:

k-valued AB-languages

Stabler, JHU 5-1

A different but related dimension of complexity operates in a similar way in Kanazawa’sresults. Developing earlier work (Shinohara, 1990; Sakakibara, 1992; Buszkowski and Penn,1990), he shows that if we allow each string to have at most k categories in a simpleAdjukiewicz-Bar-Hillel categorial grammar, then the whole language of function-argumentstructures can be identified in the limit from a text of function-argument structures.

Each string is unambiguous, so its category is unambiguously the least complex categorythat can play the same role in all the derivations where the string occurs. This categorycan be computed by unifying the simplest categories needed in each derivation in the textof structures.

Interestingly, the string languages of these languages can also be identified from texts ofstrings.

Unfortunately, all the dependencies in these grammars are ones that can be expressed incontext free grammars, and we would like more than that to describe human languages.

configuration in human languages

the men have the horses teach feed

‘the men taught Hans to feed the horses’

De mannen hebben Hans de paarden leren voerencopying:

formally: {xx| x ∈ {0,1}∗} �∈ CF

Stabler, JHU 6-1

Letting each predicate point to its arguments, we have dependencies in this Dutch examplethat cannot be expressed with CF grammars (Bresnan et al., 1982; Bach, Brown, and Marslen-Wilson, 1986)

When crossing dependencies of this kind are syntactically marked, as has been claimedin the case of Swiss-German, even the string set is not generable by a CF grammar (Shieber,1985)

configuration in human languages

rolling up: not will−1s want−inf begin−inf home−go−inf

M

M V4

Nem fogok haza−menni kezdeni akarni V1 V2M V4 V3

Nem fogok akarni kezdeni haza−menni V1 V2 V3 V4

Nem fogok akarni haza−menni kezdeni V1 V2 V3

formally: {anbncn| n ∈ N} �∈ CF{anbncndnen| n ∈ N} �∈ CF, TAG

Stabler, JHU 7-1

One analysis of verbal clusters in Hungarian (Koopman and Szabolcsi, 2000) suggests thatthey “roll up” from the end of the string as shown here.[M] moves around V4,then [M V4] rolls up around V3,then [M V4 V3] rolls up around V2,…

Dependencies like this cannot be expressed with CF grammars either. With this kind ofderivation, we can define languages with any number of “counting dependencies, beyondthe power of both CF and TAG. (Though they can be captured by MC-TAGs)

configuration in human languages

mixed:

do the robots really all see themselves

Stabler, JHU 8-1

In our account of English, we might want to generate not only the right strings, but alsoso so in a way that reflects the semantic dependencies, and the patterns of dependencieshere (though quite controversial in detail) can be quite complex too.

There are various ideas about how to capture these kinds of patterns:

1. augment the CFG with features and unification, so that a constituent can be labeledwith (possibly infinitely many) feature values

2. augment the CFG with another level of analysis of some kind (Kaplan and Bresnan,1982)

- but both of these yield systems that can be difficult to decide (and difficult to ana-lyze conceptually in a sufficiently abstract way to assess analyses and compare withalternative frameworks) (Johnson, 1988; Torenvliet and Trautwein, 1995)

3. there are a wide range of more disciplined and hence rigorously interrelated approaches:various versions of TAG, CCG, MCFG, PRCG, MGs

We will pursue the last strategy.

minimalist grammars

lexicon : finite set of lexical items, each a finite sequence of features:

category N,V,A,P,C,D,…

selector =N,=V,=A,=P,=C,=D,…

licensor +wh,+case,…

licensee -wh,-case,…

phonetic+semantic Beatrice,Benedick,criticize,…

Two structure building operations: merge, move

Stabler, JHU 9-1

The grammars described here are inspired by early work in the “minimalist tradition” insyntax (Chomsky, 1995), and by work in categorial grammar (Steedman, 1988; Moortgat,1996), and have been proven to be expressively equivalent to MCFGs, MC-TAGs, and othersystems. (These proofs of expressive equivalence reveal that the grammars are also quitesimilar in the sense that an easy derivation in one system corresponds to an easy derivationin the others, etc. These systems are fundamentally very similar.)

The grammar has two parts: a finite lexicon, which is operated upon by two fixed structurebuilding rules.

minimalist grammars

(1) Merge triggered by =X, attaches X on right if simple, left otherwise

kisses::=D =D V + Pierre::D ⇒ <

kisses:=D V Pierre

<

making:=D V tortillas

+ Maria::D ⇒ >

Maria <

making:V tortillas

Stabler, JHU 10-1

Each structure building operation applies to expressions, deleting a pair of features (shownin red), and building headed binary tree structures like those shown here, with the ordersymbols “pointing” towards the head.

In this “bare” notation, a phrase is a maximal subtree with a given head.

The features are deleted in order, and the only affected features are those on the head ofthe two arguments.

minimalist grammars

(2) Move triggered by +X, moving maximal -X subconstituent to specifier:

<

will:+case T >

Maria:-case <

speak Nahuatl

⇒ >

Maria <

will:T >

<

speak Nahuatl

Stabler, JHU 11-1

When the head of an expression begins with +f, and there is exactly one other node N inthe tree with -f as its first feature, we take the phrase that has N as its head and move itup to the left.

This is a unary, simplification step, but like merge, it deletes a pair of features.

minimalist grammars: example

(MG1) criticize::=D V -v praise::=D V -v

-s::=v +v +case T ε::=V +case =D v

Beatrice::D -case Benedick::D -case and:=T =T T

>

Beatrice >

<

criticize

<

-s:T >

>

Benedick <

TP

DP(3)

D’

D

Beatrice

T’

VP(2)

V’

V

criticize

DP

t(1)

T’

T

-s

vP

DP

t(3)

v’

DP(1)

D’

D

Benedick

v’

v VP

t(2)

Stabler, JHU 12-1

This grammar of 7 lexical items generates an infinite language, because of the recursivecategory given to and

Here we show a “bare tree” of the sort just introduced, but it is easy to compute instead arepresentation more similar to the ones common in the linguistic literature, as shown onthe right.

These structures represent the results of a derivation, but it is also easy to keep a completerecord of the whole derivation, and to do this, we do not need the whole tree structureswith all those empty nodes. It suffices to have categorized tuples of strings – as on nextslide

minimalist grammars

The same derivation, showing every step

(but with categorized tuples of strings rather than trees):

Beatrice criticize -s Benedick:T

criticize -s Benedick:+case T,Beatrice:-case

-s Benedick:+i +case T,criticize:-i,Beatrice:-case

-s::=v +i +case T Benedick:v,criticize:-i,Beatrice:-case

Benedick:=D v,criticize:-i

ε:+case =D v,criticize:-i,Benedick:-case

ε::=V +case =D v criticize:V -i,Benedick:-case

criticize::=D V -i Benedick::D -case

Beatrice::D -case

Stabler, JHU 13-1

For formal study, these standard derivation trees are more useful than the linguists’ rep-resentations, since they show everything.

Notice that all the features that appear in the whole derivation originate in the lexical items,and each step checks and deletes a pair of features. E.g. the first feature of criticize checksthe first feature of Benedick; the second feature of criticize gets checked by the first featureof the empty transitivizer, etc.

In fact, the pattern of feature checking steps determines the whole derivation, and so thereis a simpler representation of the derivation, as a matching graph.

minimalist grammars

The same derivation as a matching graph:

-s =v +i +k T

Beatrice D -k

=V =D v -i

praise =D +k V

Benedick D -k

ε

Stabler, JHU 14-1

In these matching graph, many of the feature-cancellation arcs correspond to semanticdependencies.

To get closer to showing just the semantic dependencies, let’s strip out a little bit ofthe information – we define d-structures as reduced matching graphs this way.

NB: “d” stands for dependency not deep! (but it is interesting to consider the analogiesbetween these structures and the underlying structures of early transformational gram-mars)

minimalist grammars

The same derivation as a matching graph:

-s =v +i +k T

Beatrice D -k

=V =D v -i

praise =D +k V

Benedick D -k

ε

D-structure = remove (i) features and (ii) order on incoming arcs:

Beatrice

2

praise

1

Benedick

2 1

-s

3 2 1

ε

Stabler, JHU 15-1

Stripping out the features and incoming order, we have these “dependency structures,” d-structures, relations among the pronounced parts of these same 5 lexical items. The arcsare numbered (and color coded) according to outgoing order, which might correspond tosome natural salience measure – most salient (least oblique) arguments are selected first.

Some of the relations (all the red arrows – numbered 1 in case you dont’ have color) repre-sent selection relations which may correspond to semantic relations that a language learnermight be able to figure out, and some of the other relations represent alterations in stringorder (movements) which the learner might be able to notice too.

So let’s use these structures to break the learning problem in two, as mentioned in theintroduction. That is, we imagine the learner as recognizing d-structure relations, thencomputing a grammar that defines all the d-structures seen in a text.

minimalist grammars

the same D-structure:

Beatrice

2

praise

1

Benedick

2 1

-s

3 2 1

ε

(MG1) criticize::=D V -v praise::=D V -v

-s::=v +v +case T ε::=V +case =D v

Beatrice::D -case Benedick::D -case and::=T =T T

Stabler, JHU 16-1

I repeat MG1, the grammar generating this d-structure, just to remind you.

Now I want to notice another property of this grammar: it is unambiguous, 1-valued.

Following the tradition from Angluin, Kanazawa, can we use k-valuedness to drive a gener-alization step that can recover exactly the features of a target grammar? (And if so, maybethe string languages can also be learned from strings)

It looks like the k-valued languages are the right kind of class to allow such a result – (nextslide)

k-valued minimalist languages

������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������

Fin

CS

CF

Reg

MG

Stabler, JHU 17-1

If the learning can work this way, it would be nice, because k-valued MGs can define copying,and rolling up dependencies of the sort we might want.

In fact, the languages definable by MGs where each string has at most k categorizationsinclude some finite, some reg, some cf, some cs languages (including copying, rolling up),so we can place this infinite hierarchy of languages classes into the Chomsky hierarchy asshown.

On the position of the MG languages cf. esp. Cornell (1996), Michaelis (1998), Harkema(2000), Michaelis, Mönnich, and Morawietz (2000), Harkema (2001a), Michaelis (2001b),Michaelis (2001a), Harkema (2001b).

Some of these, and also other papers, are available at http://www.ling.uni-potsdam.de/˜michael/

1-valued MG learner: d-structures → G

Input: 〈d1〉

Beatrice criticize

Benedick

1

-s

3 21

3 12

ε

Stabler, JHU 18-1

It turns out that 1-valued MGs can be learned from texts of d-structures (and the extensionto k-valued looks like it will go through too)

I will quickly step through the learning algorithm to illustrate how it works

1-valued MG learner

Step 1.a: Label root category

Beatrice criticize

Benedick

1

-s::T

3 21

3 12

ε

Stabler, JHU 19-1

Let’s assume for the moment that the root category is T, tense.

1-valued MG learner

Step 1.b.i: Identify least incoming arcs of non-root nodes; add new category

labels:

Beatrice::D criticize::E

Benedick::F

>1

-s::T

3 2>1

>3 >12

ε::G

Stabler, JHU 20-1

OK – in this step notice that new categories are introduced in each position: D,E,F,G

1-valued MG learner

Step 1.b.ii: Add new licensee features for each later incoming arc:

Beatrice::D -H criticize::E -J

Benedick::F -I

>1

-s::T

3 2>1

>3>12

ε::G

Stabler, JHU 21-1

a little work suffices to figure this out. – Again, new features in each position

1-valued MG learner

Step 1.c: Add precategory features to match other end of each outgoing arc:

Beatrice::D -H criticize::=F E -J

Benedick::F -I

>1

-s::=G +J +H T

32>1

>3 >12

ε::=E +I =D G

Stabler, JHU 22-1

With this step, we actually have a complete matching graph, with lexical items at the nodes

1-valued MG learner

Step 2. collect lexicon: GF(〈d1〉) is then:

criticize::=F E -J

-s::=G +J +H T ε::=E +I =D G

Beatrice::D -H Benedick::F -I

Step 3. unify to make rigid: GF(〈d1〉) already rigid, so GF(〈d1〉) = RG(〈d1〉)

(MG1) criticize::=D V -v praise::=D V -v

-s::=v +v +case T ε::=V +case =D v

Beatrice::D -case Benedick::D -case and::=T =T T

Stabler, JHU 23-1

Notice that and has not been seen, and we don’t yet know that Benedick and Beatrice havethe same category. The learner’s grammar now generates just the 1 d-structure that it wasgiven.

1-valued MG learner

Input: 〈d1, d2〉

Beatrice criticize

Benedick

1

-s3

21

312

and

2

-s

1

Benedick praise

Beatrice

1

32

1

3 12

εε

Stabler, JHU 24-1

Going through the same labelling steps, we can enrich this to get a matching graph, andcollect the lexical items out of it

1-valued MG learner

Step 2: GF(〈d1, d2〉) is then:

Beatrice::P -U Benedick::O -V and::=L =K T

Beatrice::S -Y Benedick::C -X

-s::=M +W +U K ε::=N +V =P M

criticize::=O N -W praise::=S R -Z

-s::=Q +Z +X L ε::=R +Y =C Q

criticize::=F E -J

-s::=G +J +H T ε::=E +I =D G

Beatrice::D -H Benedick::F -I

NB: GF(〈d1, d2〉) does not generalize at all.

Stabler, JHU 25-1

Obviously this grammar is not rigid, but we can unify the different assignments to eachcategory.

1-valued MG learner

Step 3. unify to make rigid: RG(〈d1, d2〉) =criticize::=D E -J praise::=D E -J

-s::=G +J +H T ε::=E +H =D G

Beatrice::D -H Benedick::D -H and::=T =T T

(MG1) criticize::=D V -v praise::=D V -v

-s::=v +v +case T ε::=V +case =D v

Beatrice::D -case Benedick::D -case and::=T =T T

Stabler, JHU 26-1

We have succeeded, after 2 d-structures, in getting a grammar for the infinite language thatis an alphabetic variant of the target

1-valued MG learner: example 2 copying

a

0

2

2

1

1

2

3

1

0

2

2

1

2 1

12

12

b

3 1 2

a::C -r -l b::=C +r +l T ε::T

2::=C +r A -r 3::=C +r B -r

0::=A +l C -l 1::=B +l C -l

Stabler, JHU 27-1

The strategy works for copying dependencies, but I won’t go through it.

The learning strategy is easily implemented, and I used it to check a range of examples.

1-valued MG learner: example 3 “rolling up”

a

b

1

1

2

2

1

1

2

2

1

v

2

z

1

c

1

2

3

1

2

3

1

2

y

1

d

1

2

4

1

2

4

1

2

x

1

e

1

2

5

1

2

5

1

2

w

1

1

2

1

2

1

2

a::=B A -a b::=C B -b c::=D C -c d::=E D -d e::E -e

1::=2 +a A -a 2::=3 +b 2 -b 3::=4 +c 3 -c 4::=5 +d 4 -d 5::=A +e 5 -e

v::=Z +a T w::=A +e W x::=W +d k y::=k +c Y z::=Y +b Z

Stabler, JHU 28-1

And for a grammar with 5 counting dependencies, but again I won’t go through it

k-valued MGs learnable from d-structures

Thm: From each D-structure, the learner can compute lexical

representations H which have the actual representations as instances.

That is, ∃θ, Hθ ⊆ G.

Thm: Given a “text” of D-structures D1,D2, . . . ,Di defined by any reduced,

k-valued MG G, in the limit as i →∞, we can compute a k-valued

grammar G′ which is an alphabetic variant of G.

Stabler, JHU 29-1

The proof of these results follows Kanazawa’s very closely.

It also looks like these results extend fairly easily to a range of variants of the minimalistgrammar considered here – see for example Kobele et al. (2002).

k-valued MGs learnable from d-structures

doesn’t relevance depend on unrealistic assumptions?

a. can learners really perceive d-structure relations?

b. human languages do not bound syntactic ambiguity

c. the learner does not get arbitrarily long texts: noisy, incomplete

Stabler, JHU 30-1

Each one of these deserves a talk by itself, and work on these issues is in progress.I’ll conclude with brief remarks about each.

semantic dependencies, string position, d-structures

Beatrice

2

praise

1

Benedick

2 1

-s

3 2 1

ε

Selection: from cognitively salient events and participants

Movement: evidenced by string position

Stabler, JHU 31-1

Now look more carefully at the d-structures. They encode two different kinds of relations:selection (always local), and movement (order changes).

semantic dependencies, string position, d-structures

Beatrice

2

praise

1

Benedick

2 1

-s

3 2 1

ε

Selection: from cognitively salient events and participants

better: cognitively salient events + participants → lexical categories

language classification → tense, aspect, definiteness,…

Movement: evidenced by string position

Stabler, JHU 32-1

Some of the lexical predicate-argument relations are readily apprehended and can be iden-tified by various kinds of “cross-situational” learning. which can be rigorously approachedin a robotic setting (Pinker, 1989; Tishby and Gorin, 1994; Siskind, 1996; Thompson andMooney, 1998)

It is perhaps not plausible that tense, aspect, definiteness,…are cognitively salient to thepre-linguistic infant. For these, it is more likely that the linguistic experience itself providesa basis for the categorization – some kind of a “bootstrapping” account

A study of these strategies in a preliminary step for MG learning is in progress…

k-valuedness?

ambiguity in the basic syntax and verb classes from an intro syntax text:

two kinds?

-s::=v +v +case T -s:=N Num (also genitive case)

ε::=V +case =D v ε::=T C (and more…)

read::=D V read::V read::N (and more…)

bill::=D V bill::V bill::N

Stabler, JHU 33-1

Looking at the syntactic ambiguities in a broader grammar, it might seem that there are 2very different kinds.

k-valuedness?

Different semantic types:

-s::=v +v +case T -s:=N Num

ε::=V +case =D v ε::=T C

read::=D V read::V read::N

Bill::=D V bill::V bill::N

plausibly: undoing the conflation of phonetic, semantic features and

representing semantic types eliminates the ambiguity in the grammar

If not: learning still might be possible

Stabler, JHU 34-1

Undoing the conflation of phonetic, semantic features and representing semantic typeseliminates the ambiguity in the grammar. Cf. Fulop (1999), Tellier (1998)

But the prospects for identifying in the d-structures these elements and the d-structurerelations do not look hopeless, and first experiments with this setup in a limited roboticenvironment are promising

(I don’t know quite how to get an empirical and mathematical grip on why this situationmight be nice in this respect. – help on this would be nice)

the feasibility of the MG learner

(open, poor prospects)

Are the k-reversible languages efficiently PAC-identifiable?

• The k-reversible languages are efficiently

PACS-identifiable (Li and Vitányi, 1991; Parekh and Honavar, 1997),

PEC-identifiable (Denis, D’Halluin, and Gilleron, 1996; Denis, 2001)

(open, good prospects) for each G ∈MG there is a finite set DL of

d-structure such that

i. no G′ ∈ MG is such that DL ⊆ ds(G′) ⊂ ds(G)ii. |DL| is a bounded by a polynomial function of no of features

iii. φ(DL) is an alphabetic variant of the target

(open, good prospects) learning in (“non-adversarial”, “white”) noise

models (Angluin and Laird, 1984; Kearns, 1998)

Stabler, JHU 35-1

While PAC-identifiability in Valiant’s 1985 sense (probably approximately correct identifia-bility regardless of the distribution over the input) remains open, there is a closely relatedframework proposed by Li & Valiant which does better, using a kind of a cheat: assume a“simple” distribution over the evidence.

Prospects look good for the second open problem. consider the sets of feature occurrencesin the lexicon that check each other in derivations

• for DL we do not need a derivation for every every =f (+f) / f (-f) occurrence pair in eachset; rather it suffices that these sets be fully connected.

E.g. if +f in lex item1 and in lex item2 check -f in lex3 and -f in lex4,we do not need derivations with (1,3),(1,4),(2,3),(2,4)it is enough to witness (1,3),(2,3),(2,4). This number of connections is poly in n.

• the size of examples needed to witness each type of feature checking is feasible in n(??)

I am not sure which learning model strikes the right balance between the stark situationin Valiant on the one hand, and outright “collusion” between teacher and learner, on theother.

I won’t talk about noise, but I am encouraged by the learnability results for certain (“non-adversarial”, “white”) noise models: cf. e.g. (Angluin and Laird, 1984; Kearns, 1998)

Conclusions

• prospects for 2-stage approach to learning look good

and cognition

d−structure: hypothesis about relations among words

linguistic perception and analysis, with G

non−linguistic perception

modify language model G

hypotheses about language−world relations

Stage 2

Stage 1

• lacking good accounts of what non-linguistic material is cognitivelysalient to language-learning humans, a robotic setting allows us tointroduce complexity as desired

• important properties of human language (e.g. UTAH) may be fromstage 1

Stabler, JHU 36-1

Previous research has focused on stage 2, and the main results of this talk begin there too.

But our hope is that the methods of robotics/artificial life will enable a fruitful study ofhow stage 2 could be fed by stage 1, where the language is grounded and first semantichypotheses about utterances begin.

On UTAH, see e.g. the discussions of this and related proposals in Baker (1988), Pesetsky(1995), and references cited there.

ReferencesAngluin, Dana. 1980. Inductive inference of formal languages from positive data. Information and Control, 45:117–135.

Angluin, Dana. 1982. Inference of reversible languages. Journal of the Association for Computing Machinery, 29:741–765.

Angluin, Dana and Philip D. Laird. 1984. Learning from noisy examples. Machine Learning 14, 14:343–370.

Bach, Emmon, Colin Brown, and Willian Marslen-Wilson. 1986. Crossed and nested dependencies in German and Dutch. Language and CognitiveProcesses, 1:249–262.

Baker, Mark. 1988. Incorporation: a theory of grammatical function changing. MIT Press, Cambridge, Massachusetts.

Bresnan, Joan, Ronald M. Kaplan, Stanley Peters, and Annie Zaenen. 1982. Cross-serial dependencies in Dutch. Linguistic Inquiry, 13(4):613–635.

Buszkowski, Wojciech and Gerald Penn. 1990. Categorial grammars determined from linguistic data by unification. Studia Logica, 49:431–454.

Chomsky, Noam. 1995. The Minimalist Program. MIT Press, Cambridge, Massachusetts.

Cornell, Thomas L. 1996. A minimalist grammar for the copy language. Technical report, SFB 340 Technical Report #79, University of Tübingen.

Denis, François. 2001. Learning regular languages from simple positive examples. Machine Learning, 44:37–66.

Denis, François, Cyrille D’Halluin, and Rémi Gilleron. 1996. Pac learning with simple examples. In 13th Annual Symposium on Theoretical Aspects ofComputer Science, LNCS #1040, pages 231–242, Berlin. Springer-Verlag.

Dummett, Michael. 1993. The Seas of Language. Clarendon Press, Oxford.

Fulop, Sean. 1999. On the Logic and Learning of Language. Ph.D. thesis, University of California, Los Angeles.

Gold, E. Mark. 1967. Language identification in the limit. Information and Control, 10:447–474.

Harkema, Henk. 2000. A recognizer for minimalist grammars. In Sixth International Workshop on Parsing Technologies, IWPT’2000.

Harkema, Henk. 2001a. A characterization of minimalist languages. In Proceedings, Logical Aspects of Computational Linguistics, LACL’01, Port-aux-Rocs, Le Croisic, France.

Harkema, Henk. 2001b. Parsing Minimalist Languages. Ph.D. thesis, University of California, Los Angeles.

Stabler, JHU 36-3

Johnson, Mark. 1988. Attribute Value Logic and The Theory of Grammar. Number 16 in CSLI Lecture Notes Series. Chicago University Press, Chicago.

Kanazawa, Makoto. 1998. Learnable Classes of Categorial Grammmars. CSLI Publications/FOLLI, Stanford, California. (Revised 1994 Ph.D.thesis,Stanford University).

Kaplan, Ronald M. and Joan Bresnan. 1982. Lexical-functional grammar: A formal system for grammatical representation. In Joan Bresnan, editor,The Mental Representation of Grammatical Relations. MIT Press, chapter 4, pages 173–281.

Kearns, Michael. 1998. Efficient noise-tolerant learning from statistical queries. Journal of the Association for Computing Machinery, 45:392–401.

Kobele, Gregory M., Travis Collier, Charles Taylor, and Edward Stabler. 2002. Learning mirror theory. In 6th International Workshop on Tree AdjoiningGrammars and Related Frameworks.

Koopman, Hilda and Anna Szabolcsi. 2000. Verbal Complexes. MIT Press, Cambridge, Massachusetts.

Li, Ming and Paul Vitányi. 1991. Learning concepts under simple distributions. SIAM Journal of Computing, 20(5):911–935.

Michaelis, Jens. 1998. Derivational minimalism is mildly context-sensitive. In Proceedings, Logical Aspects of Computational Linguistics, LACL’98,Grenoble.

Michaelis, Jens. 2001a. On Formal Properties of Minimalist Grammars. Ph.D. thesis, Universität Potsdam. Linguistics in Potsdam 13, Universitätsbib-liothek, Potsdam, Germany.

Michaelis, Jens. 2001b. Transforming linear context free rewriting systems into minimalist grammars. In Philippe de Groote, Glyn Morrill, and ChristianRetoré, editors, Logical Aspects of Computational Linguistics, Lecture Notes in Artificial Intelligence, No. 2099, pages 228–244, NY. Springer.

Michaelis, Jens, Uwe Mönnich, and Frank Morawietz. 2000. Algebraic description of derivational minimalism. In International Conference on AlgebraicMethods in Language Proceesing, AMiLP’2000/TWLT16, University of Iowa.

Moortgat, Michael. 1996. Categorial type logics. In Johan van Benthem and Alice ter Meulen, editors, Handbook of Logic and Language. Elsevier,Amsterdam.

Parekh, Rajesh and Vasant Honavar. 1997. Learning DFA from simple examples. In M. Li and A. Maruoka, editors, Proceedings of the 8th InternationalWorkshop on Algorithmic Learning Theory (ALT’97), Lecture Notes in Artificial Intelligence Number 1316, pages 116–131.

Pesetsky, David. 1995. Zero Syntax: Experiencers and Cascades. MIT Press, Cambridge, Massachusetts.

Pinker, Steven. 1989. Learnability and Cognition: The acquisition of argument structure. MIT Press, Cambridge, Massachusetts.

Sakakibara, Yasubumi. 1992. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97:23–60.

Shieber, Stuart M. 1985. Evidence against the context-freeness of natural language. Linguistics and Philosophy, 8(3):333–344.

Shinohara, T. 1990. Inductive inference from positive data is powerful. In Annual Workshop on Computational Learning Theory, pages 97–110, SanMateo, California. Morgan Kaufmann.

Siskind, Jeffrey Mark. 1996. A computational study of cross-situational techniques for learning word-to-meaning mappings. Cognition, 61:39–91.

Stabler, Edward P. 1997. Derivational minimalism. In Christian Retoré, editor, Logical Aspects of Computational Linguistics. Springer-Verlag (LectureNotes in Computer Science 1328), NY, pages 68–95.

Stabler, Edward P. 1999a. Computational Minimalism: Acquiring and parsing languages with movement. Basil Blackwell, Oxford. Forthcoming.

Stabler, Edward P. 1999b. Remnant movement and complexity. In Gosse Bouma, Erhard Hinrichs, Geert-Jan Kruijff, and Dick Oehrle, editors,Constraints and Resources in Natural Language Syntax and Semantics. CSLI, Stanford, California, pages 299–326.

Stabler, Edward P. 2001. Minimalist grammars and recognition. In Christian Rohrer, Antje Rossdeutscher, and Hans Kamp, editors, Linguistic Formand its Computation. CSLI, Stanford, California. (Presented at the SFB340 workshop at Bad Teinach, 1999).

Steedman, Mark J. 1988. Combinators and grammars. In E. Bach R. Oehrle and D. Wheeler, editors, Categorial Grammars and Natural LanguageStructures. Reidel, Dordrecht.

Tellier, Isabelle. 1998. Meaning helps learning syntax. In Actes du 4ième International Colloquium on Grammatical Inference, ICGI’98, Lecture Notesin Artificial Intelligence 1433, pages 25–36, NY. Springer.

Thompson, Cynthia A. and Raymond J. Mooney. 1998. Semantic lexicon acquisition for learning natural language interfaces. In Proceedings of theSixth Workshop on Very Large Corpora, pages 57–65. Also TR AI 98-273, Artificial Intelligence Lab, University of Texas at Austin.

Tishby, N. and A. Gorin. 1994. Algebraic learning of statistical association for language acquisition. Computer Speech and Language, 8:51–78.

Torenvliet, Leen and Marten Trautwein. 1995. A note on the complexity of restricted attribute-value grammars. In Proceedings of ComputationalLinguistics In the Netherlands, CLIN5, pages 145–164.

Valiant, Leslie. 1984. A theory of the learnable. Communications of the Association for Computing Machinery, 27:1134–1142.