1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007)...

Preview:

Citation preview

1

Text-based typology

Corpora, corpora of elicited texts and parallel corpora

(based on STUF 2007)

МД

2

Pros as compared to questionnaires

Contextualization of examples Naturalistic discourse Intralinguistic variation Potentially, makes up for grammar

gaps

3

Frog stories

(Mercer Mayer)

4

Pear stories

W. Chafe et al. A six-minute film shot in UC

(Berkeley) in 1975 Widely used in various cross-linguistic

research referential density project

5

Referential density (Bickel 2003) Relative frequency of overt NPs:

Via

Nich

ols

201

4

6

Contras of elicited corpora

Not directly comparable events focused and omitted mostly quantitative results

Require massive linguistic effort limited data for each language

Any alternative? Parallel corpora

7

Massively parallel texts Harry Potter

Including subtitles (76, 21) Biblical translations

Pater Noster in 1300 lgs, 400 full texts, 1,000 gospels Marxist texts

State and Revolution: 71 tr in 36 lgs Legal databases:

Proceedings of the European Parliament Universal Declaration of Human Rights (329) Unesco online database of literary translations (1,5 mln

items) Andersen, Le Petit Prince, …

Cysouw and Wälchli 2007

8

Comparability (easy counts)

Parallel corpora: roughly comparable number of sentences

(from 1,663 to 1,528 for Petit Prince) Elicited texts:

pear stories in the same language vary from 29 to 119 sentences (Bickel 2003 via Nichols 2014)

‘Free’ corpora: not applicable…

9

Comparability (methodology) Comparison by intension

definition of a phenomena browsing grammars

Comparison by extension linguistic structures used for

expressing a contextualized situation truly functional

Wälchli 2007

10

Extensional typology in parallel corpora

data we work with may be linguistically different but semantically identical cf. much looser identity in elicited texts

rather, they are “defined as a selection of places in the parallel texts”

they may reflect linguistic variation at points where one language uses the

same construction, another languages uses several

11

Parallel corpora support conventional typology

Newmeyer against Stassen Classical Greek, Latin and Tibetan have the

‘exceed’ type comparative - contra Stassen 1985 Wälchli supports Stassen

A study of parallel corpora does not show ‘exceed’ but ‘separative’ construction

Parallel corpora reflect dominant patterns – exactly where the typology’s primary interests lie But they also numerically reflect variation or

competition between dominant patterns, rather than provide yes or no typology

12

Case studies, among other:

Wälchli 2005: co-compounds Auwera et al. 2004: epistemic poss. in Slavic Wälchli 2006: ‘again’ Wälchli 2001: motion events Wälchli & Zúñiga 2006: motion events ‘again’ Stolz 2004: total reduplication Stolz et al. 2005: comitatives and instrumentals Stolz et al.: absolute possessives

13

Stolz 2003, 2004 Le Petit Prince - quantitative

‘avec’-cline

Total-reduplication-cline

Does this require parallel corpora?

14

Stolz 2003, 2004 Le Petit Prince – qualitative?

Puis il s-épongea le front avec un mouchoir à carreaux rouges.Then he mopped his forhead with a handkerchief decorated with red squares.Zatim obrise čelo rupčičem s crvenim kvadratima.

Wells with a rusty pulley – ornative or a separate category?

15

Pitfalls: data analysis

Easier than raw texts we know what was intended and where to

look still, as any grammatical analysis by a

non expert, subject to mistakes Alignment issuesAnyway, same or easier than with

elicited textsWälchli 2007

16

Pitfalls: sample bias

Europe overrepresented, convenience sampling:

Europe > IE > other families In his study of comitatives, Stolz ended

up with an areal rather than sampling study

17

Pitfalls: style/variant choice

Standard language bias Better include texts reporting speech

‘Hagiolect’ effects ‘The sinners will-Evid not enter the

heaven’ Style incomparability

Bible translation are stylistically diverse Purism

Wälchli 2007

18

Pitfalls: translation bias“Incommensurability” of linguistic structures: some

languages think differently… Australian lgs prefer absolute over relative frame

of reference In Australian Gospels, occurrences of AFR are

found but significantly less frequent than in natural discourse from this area

Wälchli 2007

“Inert” construction – a construction that tends to be imported from the source language

19

Case study: MVC in ‘bring’ and ‘run’ events

Bible-based, Bernhard Wälchli

Multi-verb construction: clauses that contain more than one lexical verb

BRING and RUN events may be described as MVC or “solitarizing” verbs

20

BRING and RUN events (Wälchli)

Examples:

Minnin ti-bouay la ban mouin. (Haitian Creole)lead little-boy def give I

Ač-i-ne Man pat-ăm-a il-se kil-ĕr. (Chuvash)

child-ps3-dat/acc I.gen to-poss1sg-dat take-conv come-imp2pl

‘… bring him unto me.’ (solitarizing)Data usually unavailable from grammars…

21

BRING and RUN events (Wälchli)

Bible-based, Bernhard WälchliMulti-verb construction: clauses that

contain more than one lexical verbBRING and RUN events may be

described as MVC or “solitarizing” verbs

Is there any correlation between the choice of either construction for encoding the two events?

22

BRING and RUN events (Wälchli)

BRING

Solit MVC

RUN

Solit

Dinka, Navajo, Russian

Ainu, Ewe, Khasi

MVC

English, Guarani, Maltese

Choctaw, Chuvash, Khoekhoe

23

BRING and RUN events (Wälchli)

165 languages (Eurasia over-represented) 18 BRING events, six RUN events Correlation between MVC in BRING and RUN

is highly significant (Fisher’s test)

BRING

Solit MVC

RUNSolit 65 12

MVC 46 42

24

BRING and RUN events (Wälchli)

Is a language consistently MVC vs. solitarizing? Surely not – then, is this a typological parameter at all?

25

BRING and RUN events (Wälchli)

But: the distribution is bimodal

26

BRING and RUN events (Wälchli)

If we only consider LOW and HIGH, fewer (14) languages are inconsistent

27

Case study: demonstratives

Potter-based, Federica da Milano 2007

Distance-oriented systems this near – that far

Person-oriented systems this with us – that far from us

Is this a real disctinction, or are these two subtypes of something more general?

28

Demonstratives (da Milano)

48 stimuli (da Milano 2005) Also include reciprocal orientation of

the locutors: face to face, face to back, side by side

83 occurrences of deictic demosntratives in “… and the Chamber of Secrets” this with us – that far from us

29

Demonstratives (da Milano)

‘Tie that round the bars,’ said Fred, throwing the end of a rope to Harry.

‘Przywiąż to do kraty’, powiedział Fred, rzucając Harry’emu koniec liny.

30

Demonstratives (da Milano)

One term systems:

French – cela, ca (ceci not used)German – der/die/das (dieser, jener not

used)

31

Demonstratives (da Milano)

Two term systems:

Unmarked vs. proximal – Scandinavian, English, Northen Italian

Unmarked vs. distal – Polish, Russian, Czech, Hungarian, Modern Greek

Dyad oriented - Catalan

32

Demonstratives (da Milano)Three term systems: proximal, medial,

distal

Dual-anchored – medial (close to addressee or medium distance)

Spanish (este~ese~aquel) Basque (hau~hori~hura)Addressee-anchored – medial is close to

addressee only – not verified on HPPortuguese (esto~esso~aquele)Also Sardinian and Tuscun

33

Demonstratives (da Milano)

da Milano then proceeds to build a similar typology for adverbs; her conclusions are as follows:

The map of adverbs is by and large isomorphic to the map of pronouns

Levinson 2004 “perhaps one can hazard the generalizations that speaker-centered degrees of distance are usually (more) fully represented in the adverbs than the pronominals” confirmed

“It has turned out to be fruitful to use parallel texts as a control test of data obtained through the questionnaire. The results from the parallel texts mainly confirmed the prior typological generalizations.”

34

‘Free’ corpora!

No translations – no risk of inert categories, closer to naturalistic

Massive amounts of texts Usually – literary

Vast playground for quantitative analysis

35

‘Free’ corpora!

Examples: Combinatorial statistics for

property words Lexical typology by LexTyp

Comparative occurrences May be useful – cf. temperature

domain

38

Comparison: texts in typology

Free corpora: No ‘meaning identity’, shift towards intensional typology Massive collections: almost all kinds of phenomena But a shift towards intensional typology Natural discourse

Elicited texts: Weak ‘meaning’ identity Massive effort for transcription, poor collections Only frequent phenomena Natural discourse (with provisos)

Parallel corpora: Strong ‘meaning’ identity Natural written discourse (with provisos)

39

Summary (obvious):

Corpora have their limitations and can not substitute conventional methods – but can go hand in hand with them

Recommended