Transcript

www.linguistik.fau.de | www.stefan-evert.de

Making sense of multivariate analyses of linguistic variation Stefan Evert

COMPUTATIONAL CORPUS LINGUISTICS GROUPPROFESSUR FÜR KORPUSLINGUISTIK

Multidimensional analysis (Biber 1988)§ 481 texts, 67 lexico-

grammatical features§ unsupervised FA§ validation: separation of

“known” genre categoriesProblems§ choice of features & texts§ interpretation of FA weights

Biber,Douglas(1988).VariationAcrossSpeechandWriting.CambridgeUniversityPress,Cambridge.Diwersy,Sascha;Evert,Stefan;Neumann,Stella(2014).Aweaklysupervisedmultivariateapproachtothestudyoflanguagevariation.In:Aggregating

Dialectology,Typology,andRegisterAnalysis.LinguisticVariationinTextandSpeech,pages174–204.DeGruyter,Berlin,Boston.Evert,Stefan&Neumann,Stella(2017).Theimpactoftranslationdirectiononcharacteristicsoftranslatedtexts.AmultivariateanalysisforEnglishand

German.In:EmpiricalTranslationStudies.NewTheoreticalandMethodologicalTraditions,TiLSM300,pages47–80.MoutondeGruyter,Berlin.Evert,Stefan;Proisl,Thomas;Jannidis,Fotis;Pielström,Steffen;Schöch,Christof;Vitt,Thorsten(2015).Towardsabetter understandingofBurrows'sDelta

inliteraryauthorshipattribution.InProceedingsoftheFourthWorkshoponComputationalLinguisticsforLiterature,pages79–88,Denver,CO.Evert,Stefan;Proisl,Thomas;Jannidis,Fotis;Reger,Isabella;Pielström,Steffen;Schöch,Christof;Vitt,Thorsten(2017). Understandingandexplaining

Deltameasuresforauthorshipattribution.DigitalScholarshipintheHumanities.Advanceaccesshttps://doi.org/10.1093/llc/fqw046.

Case study II: Evidence for shining-through in translationsminimally supervised PCA (linear discriminant analysis)§ 298 texts from CroCo corpus

(78× EN➞DE, 71× DE➞EN)§ 27 features grounded in SFL

§ LDA for DE vs. EN originals§ position of translations ➞

evidence for shining-through

Problems§ interpretation of LDA weights§ are weights stable or do they

depend on choice of texts?§ is our selection of features

crucial to the results?

Case study I: Authorship attribution with Burrows’s Deltaunsupervised clustering§ 25 authors × 3 novels for EN, DE, FR§ 200 – 5000 features§ Ward clustering / PAM

Problems§ only 75 texts§ how & why does ΔB

work so well?

�B(D1, D2) =nwX

i=1

��zi(D1)� zi(D2)��

0500

1000

1500

2000

2500

Ward clustering (English, z−scores, BD, n=1000)

thac

kera

y: v

irgin

ians

thac

kera

y: p

ende

nnis

thac

kera

y: e

smon

dm

ered

ith: r

ichm

ond

mer

edith

: mar

riage

mer

edith

: fev

erel

lytto

n: k

enel

mly

tton:

nov

elly

tton:

wha

t core

lli: i

nnoc

ent

core

lli: r

oman

ceco

relli

: sat

ancb

ront

e: s

hirle

ycb

ront

e: ja

necb

ront

e: v

illet

tebl

ackm

ore:

ere

ma

blac

kmor

e: s

prin

ghav

enbl

ackm

ore:

lorn

ael

iot:

felix

elio

t: da

niel

elio

t: ad

amga

skel

l: w

ives

gask

ell:

ruth

gask

ell:

love

rsdi

cken

s: b

leak

dick

ens:

exp

ecta

tions

dick

ens:

oliv

erst

even

son:

cat

riona

brad

don:

aud

ley

brad

don:

que

stbr

addo

n: fo

rtun

eha

rdy:

jude

hard

y: te

ssha

rdy:

mad

ding

war

d: a

she

war

d: h

arve

stco

llins

: wom

anco

llins

: bas

ilco

llins

: leg

acy

barc

lay:

rosa

ryba

rcla

y: p

oste

rnba

rcla

y: la

dies

fors

ter:

room

fors

ter:

how

ards

fors

ter:

ang

els

giss

ing:

war

burt

ongi

ssin

g: u

ncla

ssed

giss

ing:

wom

enja

mes

: am

bass

ador

sja

mes

: mus

eja

mes

: hud

son

trollo

pe: a

ngel

trollo

pe: p

hine

as trollo

pe: w

arde

ndo

yle:

mic

ahdo

yle:

hou

nddo

yle:

lost

hagg

ard:

she

alla

nha

ggar

d: m

ist

hagg

ard:

min

esst

even

son:

arr

owst

even

son:

isla

ndki

plin

g: c

apta

ins

kipl

ing:

kim

kipl

ing:

ligh

tch

este

rton

: thu

rsda

ych

este

rton

: inn

ocen

cech

este

rton

: nap

oleo

nbu

rnet

t: ga

rden

burn

ett:

prin

cess

burn

ett:

lord

war

d: m

illy

mor

ris: w

ater

mor

ris: w

ood

mor

ris: r

oots

10 20 50 100

200

500

1000

2000

5000

1000

0

020

4060

8010

0

English Corpus | L2 normalization | PAM clustering

number of mfw

adju

sted

Ran

d in

dex

(%)

Cosine DeltaL1 2−DeltaBurrows (L1) DeltaQuadratic (L2) DeltaL4−Delta

Evert et al. (2015, 2017)

●●

●●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●●

●●

●●

●●●

●●

● ●

● ●●

●●

●●

●●

●● ●

●●● ●

●●

●●●

●●●●●●

●●●●

●●

●●

●●●●●

●●●

● ●

●●●●●●●

●●●●●●

●●●●●●

●●

●●●●

●●

●●●●●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●●●

●●●●

●●

●●●

●●

●●

nn /

Tad

ja /

Tno

min

al /

Tfin

ites

/ Spa

st /

Fpa

ssive

/ V

mod

als

/ Vim

pera

tives

/ S

inte

rroga

tives

/ S

coor

dina

tion

/ Tsu

bord

inat

ion

/ Tpr

onou

ns /

Tpl

ace

adv

/ Ttim

e ad

v / T

adv

them

e / T

Hte

xt th

eme

/ TH

obj t

hem

e / T

Hve

rb th

eme

/ TH

subj

them

e / T

Hpr

ep /

Tm

odal

adv

/ T

cont

ract

ions

/ T

collo

quia

lism

/ T

title

s / T

lexi

cal d

ensi

tyle

xica

l TTR

toke

n / S

−5

0

5

z−sc

ore

= st

anda

rdize

d re

lativ

e fre

quen

cy

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

discriminant score

dens

ity

DE: origDE: transEN: origEN: trans

www.stefan-evert.de/PUB/EvertNeumann2017/Diwersy et al. (2014); Evert & Neumann (2017)

● DEEN

origtrans

−0.3

−0.1

0.1

0.2

0.3

standardized z−scores | L2 normalization

C. Brontë: Jane EyreC. Brontë: Shirley

Interpretation of dimension weights§ standard approach based on

magnitude and sign of weights(EN on positive side of axis)

§ interprets features as correlatedrather than complementary

§ better approach: what does each feature contribute to the LDA positions of texts?

§ reveals entirely different patterns

§ correlated features help LDA to reduce within-group variance

−0.2

0.0

0.2

EN / D

E discriminant

nn_Tad

ja_T

nomina

l_T

finite

s_Spa

st_F

passi

ve_V

modals

_V

impe

rative

s_S

interr

ogati

ves_

S

coord

inatio

n_T

subo

rdina

tion_

T

prono

uns_

T

place

.adv_

T

time.a

dv_T

adv.t

heme_

TH

text.th

eme_

TH

obj.th

eme_

TH

verb.

theme_

TH

subj.

theme_

THpre

p_T

modal.

adv_

T

contr

actio

ns_T

colloq

uialism

_T

titles_

T

lexica

l.den

sity

lexica

l.TTR

token

_S

norm

alize

d fe

atur

e we

ight

s

−0.2

0.0

0.2

weight

nn /

T

(−) a

dja

/ T

nom

inal

/ T

(−) f

inite

s / S

(−) p

ast /

F

(−) p

assi

ve /

V

(−) m

odal

s / V

(−) i

mpe

rativ

es /

S

(−) i

nter

roga

tives

/ S

(−) c

oord

inat

ion

/ T

subo

rdin

atio

n / T

(−) p

rono

uns

/ T

plac

e ad

v / T

time

adv

/ T

adv

them

e / T

H

text

them

e / T

H

(−) o

bj th

eme

/ TH

verb

them

e / T

H

subj

them

e / T

H

prep

/ T

(−) m

odal

adv

/ T

cont

ract

ions

/ T

collo

quia

lism

/ T

title

s / T

lexi

cal d

ensi

ty

lexi

cal T

TR

toke

n / S

−1

0

1

2

−1

0

1

2

DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN DE EN

cont

ribut

ion

to a

xis

scor

es

groupDEEN

DE / EN discriminant (original texts)What are the characteristic words?§ supervised recursive feature elimination➞ 233 words as features

§ not just mfw, but none unique to one author§ with, so, t, But, And, upon, don, head,

Then, looking, almost, indeed, nor, …,XXXVII (df=34), XLI (df=29), XLIII (df=26),hereabout (df=11), vilest (df=15), contours (df=9), Ecod (df=4), …

§ validation for DE: new novels from same authors: 97% accuracy

Work in progress§ contribution of features to

silhouette width of clustering§ assess relevance to each author§ identify features responsible for

mis-classifications

document frequency (# novels)

ward

: milly

ward

: har

vest

ward

: ash

eha

ggar

d: m

ines

hagg

ard:

mis

tha

ggar

d: s

heal

lan

giss

ing:

unc

lass

edgi

ssin

g: w

arbu

rton

giss

ing:

wom

ench

este

rton:

nap

oleo

nch

este

rton:

inno

cenc

ech

este

rton:

thur

sday

gask

ell:

love

rsga

skel

l: w

ives

gask

ell:

ruth

trollo

pe: w

arde

ntro

llope

: ang

eltro

llope

: phi

neas

burn

ett:

lord

burn

ett:

gard

enbu

rnet

t: pr

ince

ssja

mes

: hud

son

jam

es: m

use

jam

es: a

mba

ssad

ors

stev

enso

n: is

land

stev

enso

n: a

rrow

brad

don:

fortu

nebr

addo

n: a

udle

ybr

addo

n: q

uest

lytto

n: k

enel

mly

tton:

nov

elly

tton:

wha

tba

rcla

y: ro

sary

barc

lay:

ladi

esba

rcla

y: p

oste

rndi

cken

s: o

liver

stev

enso

n: c

atrio

nadi

cken

s: e

xpec

tatio

nsdi

cken

s: b

leak

hard

y: m

addi

ngha

rdy:

jude

hard

y: te

ssel

iot:

adam

elio

t: fe

lixel

iot:

dani

elco

relli:

sat

ancb

ront

e: s

hirle

yco

relli:

inno

cent

core

lli: ro

man

cecb

ront

e: ja

necb

ront

e: v

illette

collin

s: b

asil

collin

s: le

gacy

collin

s: w

oman

kipl

ing:

kim

kipl

ing:

ligh

tki

plin

g: c

apta

ins

mer

edith

: fev

erel

mer

edith

: mar

riage

mer

edith

: ric

hmon

dfo

rste

r: ho

ward

sfo

rste

r: an

gels

fors

ter:

room

blac

kmor

e: s

prin

ghav

enbl

ackm

ore:

lorn

abl

ackm

ore:

ere

ma

mor

ris: r

oots

mor

ris: w

ood

mor

ris: w

ater

doyl

e: m

icah

doyl

e: h

ound

doyl

e: lo

stth

acke

ray:

pen

denn

isth

acke

ray:

esm

ond

thac

kera

y: v

irgin

ians

Silh

ouet

te w

idth

si

0.0

0.2

0.4

0.6

0.8

1.0

Silh

ouet

te w

idth

s (E

nglis

h, z−s

core

s, B

D, n

=100

0, W

ard)

n =

7525

clu

ster

s C

j

j : n

j | av

e i∈C

j s i

1 :

3 |

0.3

3

2 :

3 |

0.1

1

3 :

3 |

0.5

1

4 :

3 |

0.3

3

5 :

3 |

0.2

4

6 :

3 |

0.2

4

7 :

3 |

0.0

6

8 :

3 |

0.1

6

9 :

6 |

0.0

7

10 :

3 |

0.1

7

11 :

3 |

0.2

8

12 :

4 |

0.0

4

13 :

3 |

0.0

6

14 :

3 |

0.4

0

15 :

3 |

0.1

016

: 2

| 0

.08

17 :

3 |

0.1

6

18 :

3 |

0.1

1

19 :

3 |

0.1

9

20 :

3 |

0.1

8

21 :

3 |

0.2

2

22 :

3 |

0.1

8

23 :

3 |

0.1

824

: 2

| 0

.16

25 :

1 |

0.0

0

Reliability of the clustering§ bootstrapping texts not applicable to

clustering & high-dimen. feature space§ bootstrapping features ➞ unclear§ biggest factor: choice of authors

(empirial study on Gutenberg archive)

Bootstrapping latent dimensions§ bootstrapping / cross-validation can be used to assess stability of

LDA &PCA dimensions (applicable because of small # of features)§ LDA axis “wobbles” by approx. 10° across folds§ moderate variability of feature weights: σ < 0.05§ but positions of texts on LDA axis are stable (r = .987)

Recommended