Testing Functional Explanations of Word Order...

Preview:

Citation preview

Testing Functional Explanations of Word Order Universals

Michael Hahn Richard FutrellStanford UC Irvine

(Greenberg 1963)

U3: ‘Languages with dominant VSO order are alwaysprepositional.’

U3: ‘Languages with dominant VSO order are alwaysprepositional.’

U4: ‘With overwhelmingly greater than chancefrequency, languages with normal SOV order arepostpositional.’

U3: ‘Languages with dominant VSO order are alwaysprepositional.’

U4: ‘With overwhelmingly greater than chancefrequency, languages with normal SOV order arepostpositional.’

`Relative position of adposition & noun ~relative position ofverb & object’

OV languages with postpositions

VO languages with prepositions

Why do these universals hold?

Innate constraints on language, ‘Universal Grammar’? (Chomsky 1981)

Facilitation of human communication? (Dryer 1992, Hawkins 1994)

Make languages learnable? (Culbertson 2017)

Why do these universals hold?

Innate constraints on language, ‘Universal Grammar’? (Chomsky 1981)

Facilitation of human communication? (Dryer 1992, Hawkins 1994)

Approach: Test functional explanations by implementing efficiency measures, optimizing grammars, and checking whether universals hold in optimized grammars.

Make languages learnable? (Culbertson 2017)

Three Efficiency Measures

Dependency Length Minimization (Rijkhoff, 1986; Hawkins, 1994, 2003)

Surprisal (Gildea and Jaeger, 2015; Ferrer-i Cancho, 2017)

Parsability (Hawkins, 1994, 2003)

Three Efficiency Measures

Dependency Length Minimization (Rijkhoff, 1986; Hawkins, 1994, 2003)

Three Efficiency Measures

Dependency Length Minimization (Rijkhoff, 1986; Hawkins, 1994, 2003)

21 1

Three Efficiency Measures

Dependency Length Minimization (Rijkhoff, 1986; Hawkins, 1994, 2003)

21 1+ + = 4

Three Efficiency MeasuresSurprisal

Surprisal(w1...wi-1) = -Σi log P(wi|w1...wi-1)

Three Efficiency MeasuresSurprisal

Surprisal(w1...wi-1) = -Σi log P(wi|w1...wi-1)

Estimated using recurrent neural networks, the strongest existing methods for estimating surprisal and predicting reading times.

Three Efficiency MeasuresParsability

Mary has two green books.

Three Efficiency MeasuresParsability

Mary has two green books.

Parsability(utterance) := log P(tree | utterance)

Three Efficiency MeasuresParsability

Mary has two green books.

Parsability(utterance) := log P(tree | utterance)

Estimated using a neural network model (Dozat and Manning 2017)

with extremely generic architecture.

Utility Informativity Cost-=

Amount of Meaning that can be extracted from utterance

Cost of processing utterance

λ

Combining Parsability + Surprisal

Utility Informativity Cost-=

Amount of Meaning that can be extracted from utterance

Cost of processing utterance

Long tradition as an explanation of language (Gabelentz 1903, Zipf 1949, Horn 1984, …)

λ

Combining Parsability + Surprisal

Utility Informativity Cost-=

Amount of Meaning that can be extracted from utterance ~ Parsability

Cost of processing utterance

~ Surprisal

λ

Combining Parsability + Surprisal

Long tradition as an explanation of language (Gabelentz 1903, Zipf 1949, Horn 1984, …)

Utility Informativity Cost-=

Amount of Meaning that can be extracted from utterance ~ Parsability

Cost of processing utterance

~ SurprisalLong tradition as an explanation of language (Gabelentz 1903, Zipf 1949, Horn 1984, …)

Formalized in Rational-Speech Acts models (Frank and Goodman 2012)

λ

Combining Parsability + Surprisal

Utility Informativity Cost-=

Long tradition as an explanation of language (Gabelentz 1903, Zipf 1949, Horn 1984, …)

Formalized in Rational-Speech Acts models (Frank and Goodman 2012)

Related to Signal Processing (Rate-Distortion Theory, Information Bottleneck)

λ

Combining Parsability + Surprisal

Amount of Meaning that can be extracted from utterance ~ Parsability

Cost of processing utterance

~ Surprisal

Why do the universals hold?

Innate constraints on language, ‘Universal Grammar’? (Chomsky 1981)

Facilitation of human communication? (Dryer 1992, Hawkins 1994)

Approach: Test processing explanations by implementing efficiency measures, optimizing grammars, and checking whether universals hold in optimized grammars.

Make languages learnable? (Culbertson 2017)

Testing Functional Explanations

Approach: Optimize the word orders of languages for the three objectives, keeping syntactic structures unchanged

Testing Functional Explanations

Approach: Optimize the word orders of languages for the three objectives, keeping syntactic structures unchanged

Languages have word order regularities ⇒ Not sufficient to optimize the word orders of individual sentences

Testing Functional Explanations

Approach: Optimize the word orders of languages for the three objectives, keeping syntactic structures unchanged

Languages have word order regularities ⇒ Not sufficient to optimize the word orders of individual sentences

Instead: optimize word order rules of entire languages

Testing Functional Explanations

Approach: Optimize the word orders of languages for the three objectives, keeping syntactic structures unchanged

Languages have word order regularities ⇒ Not sufficient to optimize the word orders of individual sentences

Instead: optimize word order rules of entire languages

That is: optimized languages have optimized but internally consistent grammatical regularities in word order, and agree with an actual natural language in all other respects.

Mary has two green books

nsubj

dobj

nummod

amod

Dependency Corpus

Mary has two green books

nsubj

dobj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Dependency Corpus

Mary has two green books

nsubj

dobj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Dependency Corpus Ordering GrammarNOUN ADJamod

0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

-0.2

0.8

Mary has two green books

nsubj

dobj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Dependency Corpus Ordering GrammarNOUN ADJamod

0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

-0.2

0.8

“Object follows verb”

Mary has two green books

nsubj

dobj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Dependency Corpus Ordering GrammarNOUN ADJamod

0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

-0.2

0.8

“Adjective precedes noun”

“Object follows verb”

Mary has two green books

nsubj

dobj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Dependency Corpus Ordering GrammarNOUN ADJamod

0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

-0.2

0.8

“Adjective precedes noun”

“Object follows verb”

“Numerals follow adjectives & precede nouns”

Mary has two green books

nsubj

dobj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Maryhastwogreenbooks

Counterfactual Corpus

Dependency Corpus Ordering GrammarNOUN ADJamod

0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

-0.2

0.8

Mary has two green books

nsubj

dobj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Maryhastwogreenbooks

Counterfactual Corpus

Dependency Corpus Ordering GrammarNOUN ADJamod

0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

-0.2

0.8

Each parameter setting generates a different counterfactual corpus.

Mary has two green books

nsubj

dobj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Maryhastwogreen books

Counterfactual Corpus

Dependency Corpus Ordering GrammarNOUN ADJamod

0.9

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.1

0.5

0.2

Each parameter setting generates a different counterfactual corpus.

Mary has two green books

nsubj

dobj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Maryhas twogreenbooks

Counterfactual Corpus

Dependency Corpus Ordering GrammarNOUN ADJamod

0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.95

04.2

0.82

Each parameter setting generates a different counterfactual corpus.

Dependency Length Surprisal

Parsability

2.35.81.8

We compute processing measures on counterfactual corpora.

Dependency Length Surprisal

Parsability

2.35.81.8

Each parameter setting results in different values for the processing measures.

Dependency Length Surprisal

Parsability

2.94.52.9

Each parameter setting results in different values for the processing measures.

Dependency Length Surprisal

Parsability

3.47.81.2

Each parameter setting results in different values for the processing measures.

Dependency Length Surprisal

Parsability

3.47.81.2

Each parameter setting results in different values for the processing measures.

Which settings optimise the measures?

Dependency Length Surprisal

Parsability

3.47.81.2

Each parameter setting results in different values for the processing measures.

Which settings optimise the measures?

Do the optimised settings replicate the Greenberg correlations?

For each objective, find parameters that optimise it.

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.95

04.2

0.82

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.85

0.1

0.22

Minimize Dep. Length Minimize Surprisal

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

0.5

0.8

NOUN ADJamod0.21

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.45

0.4

0.32

Maximize Parsability Optimize Pars.+Surp.

For each objective, find parameters that optimise it.

Repeat this for corpora from 51 real languages from Universal Dependencies Project.

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.95

04.2

0.82

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.85

0.1

0.22

Minimize Dep. Length Minimize Surprisal

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

0.5

0.8

NOUN ADJamod0.21

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.45

0.4

0.32

Maximize Parsability Optimize Pars.+Surp.

For each objective, find parameters that optimise it.

Repeat this for corpora from 51 real languages from Universal Dependencies Project.

0.1

0.95

04.2

0.82

0.1

0.85

0.1

0.22

Minimize Dep. Length Minimize Surprisal

NOUN ADJamod0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

NOUN ADJ 0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

0.5

0.8

0.7

0.5

0.8

0.21

0.45

NOUN ADJ 0.1

NOUN

NOUN ADJ 0.1

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

0.5

0.8

NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

0.5

0.8

0.4

0.32

Maximize Parsability Optimize Pars.+Surp.

1. How do the objectives compare?2. Which universals are predicted?

Minimize Dep. Length Minimize Surprisal

Surprisal and Parsability minimize Dependency Length

Surprisal and Parsability minimize Dependency Length

Surprisal and Parsability minimize Dependency Length

Communicative Utility predicts Dependency Length Minimization.

Better Parsability

Lower Surprisal

z-transformed on the level of languages

Language optimizes Surprisal and Parsability

Better Parsability

Lower Surprisal

Random Grammars

Language optimizes Surprisal and Parsability

Better Parsability

Lower Surprisal

Random Grammars

Grammars fit to Real Orderings

Language optimizes Surprisal and Parsability

Better Parsability

Better Parsability

Lower Surprisal

Random Grammars

Optimized for Surprisal

Optimized for Parsability

Optimized for Parsability+Surprisal

Grammars fit to Real Orderings

Language optimizes Surprisal and Parsability

(Dryer 1992 in Language)

(Dryer 1992 in Language)

`Relative position of adposition & noun ~relative position ofverb & object’

We formalize the correlations in the Universal Dependencies format.

(Dryer 1992 in Language)

X

XX

We formalize the correlations in the Universal Dependencies format.

For any word order grammar, we can then check which correlations it satisfies.

Are the universals satisfied by models fit to the actual orderings for our 50 languages?

%

Are the universals satisfied by models fit to the actual orderings for our 50 languages?

%

Are the universals satisfied by models fit to the actual orderings for our 50 languages?

Prevalence of SVO (Dryer 1992)

Limitation of formalisation

%

Percentage of grammars optimized for each objective satisfying the universal

Percentage of grammars optimized for each objective satisfying the universal

Assessing Significance:X = “Object precedes verb”Y = “Object-patterner precedes verb-patterner”

Logistic model:Y ~ X + (1+X|family) + (1+X|language)

Predictions largely complementary

Predictions mostly agree

Predictions mostly agree

Communicative Utility replicates predictions of Dependency Length Minimization.

Predictions mostly agree

Communicative Utility replicates predictions of Dependency Length Minimization.Both measures predict most of the correlation universals.

Conclusion

● Tested explanations of Greenberg correlation universals in terms of efficiency of human processing and communication

● Using corpora from 50 languages, constructed counterfactual optimized languages

● Most of the correlations can be derived from pressure to shorten dependencies, decrease surprisal, or increase parsability

● Clear evidence for functional explanations of word order universals

Optimized grammars are easier to parse even when sentences are presented in orders very different from natural language

ACEBD ADBEC ACEDBABCDE

Random grammarOptimized grammar

Random grammars remain hard to parse even as training data increases.

Formalizing ParsabilityNeural parser (Dozat and Manning 2017):

Mary met John

R

Mar

y

met

Jo

hn 1. BiLSTM reads the sentence2. Identify heads by

computing score for each pair of words

Generic architecture, no assumptions beyond sequential nature of input.

Formalizing ParsabilityInformation about syntactic tree that can be extracted from sentence:

Mary met John

R

Mar

y

met

Jo

hn

Formalizing Dependency Length

Distance between word and its syntactic head

summing over all words in sentence

sentence w = w1...wn

Formalizing Surprisal

Formalizing Surprisal

summing over all words in sentence

per-word surprisal

Formalizing Surprisal

Surprisal depends on the probability model P.

Formalizing Surprisal

Surprisal depends on the probability model P.

Right choice of P depends on the entire language!

Formalizing Surprisal

Given a word order grammar θd choose the model that minimizes surprisal on the resulting sentences.

Formalizing Surprisal

Use LSTM recurrent neural networks, the SOTA in probabilistic modelling of natural language and predicting reading times.Very general sequence models, arguably minimizing architectural biases.

Formalizing Informativity

Information about the syntactic tree that can be extracted from the sentence:

Formalizing Informativity

Information about the syntactic tree that can be extracted from the sentence:

Use a recent neural model (Dozat and Manning 2017) with generic architecture and SOTA performance on many languages.

Word Order Grammars

For each dependency type, there are two parameters:a. α: probability that whether dependent precede headb. β: determines distance

Mary

has

two green

booksαverb-object = 0.1

αverb-subject = 0.95

αnoun-numeral = 0.99 αnoun-adjective = 0.8

Mary

has

two green

booksαverb-object = 0.1

αverb-subject = 0.95

αnoun-numeral = 0.99 αnoun-adjective = 0.8

Maryhas

two

green

books

Word Order Grammars

For each dependency type, there are two parameters:a. α: probability that dependent precede headb. β: determines distance

Mary

has

two

green

books

βNoun-Adjective = -0.3

βNoun-Numeral = 0.8

Mary

has

two

green

books

βNoun-Adjective = -0.3

βNoun-Numeral = 0.8

softmax(βNoun-Adjective , βNoun-Numeral ) ~ (0.1, 0.9)

adjective first

numeral first

Mary

has

two

green

books

βNoun-Adjective = -0.3

βNoun-Numeral = 0.8

softmax(βNoun-Adjective , βNoun-Numeral ) ~ (0.1, 0.9)

adjective first

numeral first

Maryhas

twogreen

books

Maryhastwogreenbooks

Word Order Grammars

For each dependency type, there are two parameters:a. α: probability that dependent precede headb. β: determines distance

This specifies the space of possible grammars, within which we optimize.

Mary has two green books

nsubj

dobj

nummod

amod

Mary

hastwo

greenbooks

Tree Topologies

Maryhastwogreenbooks

Counterfactual Corpus

Dependency Corpus Ordering GrammarNOUN ADJamod

0.3

NOUN NUMnummod

VERB NOUNnsubj

VERB NOUNdobj

...

0.7

-0.2

0.8

Mary has two green books

nsubj

obj

nummod

amod

Will be working with trees in the Universal Dependencies format:

Mary has two green books

nsubj

obj

nummod

amod

Will be working with trees in the Universal Dependencies format:

To optimize grammars, we need a space of possible grammars.

SOV

SVO

SOV and VSO support correlationSVO does not

VSO

Support SVO(Gibson et al 2013)

Dependency Length MinimizationShort syntactic dependencies ease processing (Gibson, 1998; Grodner and Gibson, 2005; Demberg and Keller, 2008; Bartek et al., 2011)

Dependency Length MinimizationShort syntactic dependencies ease processing (Gibson, 1998; Grodner and Gibson, 2005; Demberg and Keller, 2008; Bartek et al., 2011)

Quantitative corpus evidence from many languages confirms that languages have shorter dependencies than would be expected at random (Futrell et al., 2015).

Dependency Length MinimizationShort syntactic dependencies ease processing (Gibson, 1998; Grodner and Gibson, 2005; Demberg and Keller, 2008; Bartek et al., 2011)

Quantitative corpus evidence from many languages confirms that languages have shorter dependencies than would be expected at random (Futrell et al., 2015).

Argued to explain several of the Greenberg correlations (Rijkhoff, 1986; Hawkins, 1994, 2003)

21 1

Two Objectives for Optimization

Dependency Length Minimization

Communicative Utility

Two Objectives for Optimization

Communicative Utility

Utility Informativity Cost-=

Amount of Meaning that can be extracted from utterance

Cost of processing utterance

λ

Long tradition as an explanation of language (Gabelentz 1903, Zipf 1949, Horn 1984, …)

Two Objectives for Optimization

Communicative Utility

Utility Informativity Cost-= λ

Two Objectives for OptimizationCommunicative Utility

Utility Informativity Cost-= λ

Mary has two green books.

Two Objectives for OptimizationCommunicative Utility

Utility Informativity Cost-= λ

Mary has two green books.

Informativity(utterance) := log P(tree | utterance) - log P(tree)

Two Objectives for OptimizationCommunicative Utility

Utility Informativity Cost-= λ

Mary has two green books.

Informativity(utterance) := log P(tree | utterance) - log P(tree)We use a neural network model (Dozat and Manning 2017) with extremely generic architecture.

Two Objectives for OptimizationCommunicative Utility

Utility Informativity Cost-= λ

Two Objectives for OptimizationCommunicative Utility

Utility Informativity Cost-= λ

Surprisal(wi|w1...wi-1) = -log P(wi|w1...wi-1)

Two Objectives for OptimizationCommunicative Utility

Utility Informativity Cost-= λ

Surprisal(wi|w1...wi-1) = -log P(wi|w1...wi-1)

We use recurrent neural networks, the SOTA in probabilistic modelling of natural language and predicting reading times.

Dependency Length Surprisal

Parsability

2.35.81.8

(1) For each objective, find parameters that optimise it.

(2) Which universals do the resulting counterfactual languages satisfy?

Dependency Length Surprisal

Parsability

2.35.81.8

(1) For each objective, find parameters that optimise it.

(2) Which universals do the resulting counterfactual languages satisfy?

Recommended