73
Natural Language Processing in Digital Humanities: application examples Presentation at IXA, January 2016 Pablo Ruiz Fabo LATTICE Lab

Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Natural Language Processing in Digital Humanities: application examples

Presentation at IXA, January 2016

Pablo Ruiz Fabo — LATTICE Lab

Page 2: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Summary

• Digital Humanities’ needs in terms of text

analysis tools

• Two examples of pipelines to address

these needs:

– Entity Linking and the PoliInformatics corpus

– Proposition Extraction and the Earth

Negotiations Bulletin Corpus

• Implementation choices, demo, evaluation

2

Page 3: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Digital Humanities (DH)

• Application of computational methods to

questions in fields like Humanities or

Social Sciences.

– A goal is to allow for asking questions and

attaining findings that would be impossible

without computational means (Berry, 2012)

– Critical reflection: “as much of a focus on what

the computational techniques obscure as

reveal” (Meeks and Weingart, 2012)

3

Page 4: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

DH and Text: Topic Models

• Very popular tool in DH: LDA (MALLET) (discussion in Meeks and Weingart, 2012)

4

Europarl Corpus Topics for all speeches by the French Green Party (Les Verts), 1999-2004 session

Page 5: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

DH and Text …

• But there are many other Text Analysis /

Language Technology / Natural Language

Processing applications potentially useful

for DH questions …

5

Page 6: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

DH methods: Network analysis

• Transforming a collection of texts into a

network (a graph)

• Network’s nodes:

– Actors in the corpus: People, Institutions …

– Concepts in the corpus

6

Page 7: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

DH methods: Network analysis

• Transforming a collection of texts into a

network (a graph)

• Network’s nodes:

– Actors in the corpus: People, Institutions …

– Concepts in the corpus

7

Page 8: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

DH methods: Network analysis

• Transforming a collection of texts into a

network (a graph)

• Network’s nodes:

– Actors in the corpus: People, Institutions …

– Concepts in the corpus

Named Entity Recognition / Disambiguation

Entity and Concept Linking

8

Page 9: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Text => Network via EL: User Needs

9

Venturini et al. (2012) Once Upon a Text

[médialab at Paris SciencesPo]

The careful use of natural language processing algorithms could provide better filtering metrics and support in expression merging

The manual filtering is crucial because it allows entities to be reduced to a set size appropriate for analysis, but also recovering important entities that could have been excluded by the automatic filtering.

Page 10: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Though AlchemyAPI offers a trustworthy

service, we don’t like relying on it. In

particular, we don’t like that the service is

offered as a “black box” and that the exact

extraction algorithm is secret.

10

Venturini and Guido (2012). Once upon a text:

An Actor-Network Theory tale in text analytics.

Text => Network via EL: User needs

https://github.com/medialab/ANTA

Page 11: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Entity Linking in DH: User Needs

• Variety of corpora needs to be treated

• EL literature shows that tools’ performance

varies according to the corpus:

11

TOOL CORPUS

AIDA/CoNLL (news, sports) IITB (web, various topics)

P R F1 P R F1

Spotlight 31.2 40.4 35.2 46.2 50.0 48.0

TagMe 61.4 55.5 58.3 45.2 42.0 43.6

WikipediaMiner 46.9 52.8 49.7 56.8 48.2 43.6

AIDA 63.3 29.1 39.8 65.7 4.1 7.6

Data from Cornolti et al. (2013) Metric: Weak Annotation Match

Page 12: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

• Correlations between number of

occurrences of a textual feature and tool

performance (Usbeck et al. 2015)

12

CORRELATIONS Nbr. PER Nbr. ORG Nbr. LOC Nbr. OTHER

Babelfy 0.769 -0.376 0.254 -0.431

Spotlight 0.217 -0.480 -0.461 0.26

TagMe 0.257 -0.272 -0.194 0.036

WikipediaMiner 0.082 -0.679 -0.632 0.497

Data from 20 Nov 2015, GERBIL platform (gerbil.aksw.org/gerbil/overview), A2KB/Ma task

Entity Linking in DH: User Needs

Page 13: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Entity Linking approach given User needs

NEED APPROACH

• Avoiding “black boxes” • Open source tools

• Treating a variety of corpora, knowing that tools’ performance varies with the corpora

• Combining tools to get complementary results

• Manual filtering of entities

• Information to guide the filtering

• Providing annotation quality metrics to users

• Simultaneous acess to metrics and text to validate annotations

• Optional automatic annotation selection 13

Page 14: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Open Source Tool Combination

• Open-source Public-domain tools which link

to generic ontologies (DBpedia, YAGO, Babelnet)

[P.S.: OK maybe Babelfy is not exactly open source … we

could perhaps use AGDISTIS:

https://github.com/AKSW/AGDISTIS]

14

2010 2011 2008 2011 2014

Page 15: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Metrics to guide filtering?

15

• Confidence scores

• Coherence scores

Page 16: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

EL: Annotation confidence

SOCCER –JAPAN GET LUCKY WIN,

CHINA IN SURPRISE DEFEAT

16

CONFIDENCE

Page 17: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

EL: Annotation coherence

• Wikipedia Link Based Measure: Relies on

common inbound links

• Milne & Witten (2008) [original proposal]

• Ferragina et al. (2010) [optimizations]

• Hoffart et al. (2011) [among other measures]

• Moro et al. (2014) [other measures]

17

Milne-Witten coherence between entities e1 and e2 (as in Hoffart et al. 2011)

Page 18: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Entity Linking Demo

18

Page 19: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Demo: PoliInformatics Corpus

19

Page 20: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Demo: PoliInformatics Corpus

20

• NLP Unshared Task in PoliInformatics

2014

– Who was the financial crisis?

• Participants: individuals, industries, …

– What was the financial crisis?

• Causes, proposals for reform, ….

• Heterogeneous corpus containing

– Congress Hearings transcripts

– Official reports on the crisis by Congress

– Bills

– Etc.

Page 21: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Demo: PoliInformatics Corpus

21

• NLP Unshared Task in PoliInformatics

2014

– Who was the financial crisis?

• Participants: individuals, industries, …

– What was the financial crisis?

• Causes, proposals for reform, ….

• Heterogeneous corpus containing

– Congress Hearings transcripts

– Official reports on the crisis by Congress

– Bills

– Etc.

Page 22: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

EL Demo: Corpus

22

Congress Hearings: Interviews to witnesses

Page 23: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

EL Demo: Corpus

23

Official report by Congress about the causes

for the crisis

Page 24: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Entity Linking Demo

2010 2011 2008 2011 2013

results displayed on demo not displayed

Description: Ruiz, Poibeau & Mélanie (Demo at NAACL 2015).

Page 25: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Evaluation

• … as an NLP-related system

• … in terms of Digital Humanities

25

Page 26: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

• Does a selection among the annotations

provided by several systems outperform

each of those systems’ annotations taken

individually?

• Combination method: ROVER. Each

system is weighted by its precision on a

test-corpus – Fiscus, 1997 for ASR

– De la Clergerie et al. 2008, for parsing

– Ruiz and Poibeau, 2015 (*SEM poster)

26

Evaluation as an NLP system

Page 27: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

ROVER: system weights

• Assume two application scenarios:

– User wants Entities only

– User wants Entities and Concepts

• For Entities only:

– Weight according to Precision on

AIDA/CoNLL B corpus (no concepts in gold)

• For Entities and Concepts:

– Weight according to P on IITB corpus (many

concepts in gold)

27

Corpora: Cornolti et al. 2013, BAT Framework https://github.com/marcocor/bat-framework/tree/master/benchmark/datasets

Page 28: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

ROVER: testing

• Entities only was tested on the MSNBC

corpus (reference set has no concepts)

• Entities and Concepts was tested on the

AQUAINT corpus (ref set has concepts)

28

Page 29: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

29

Evaluation as an NLP system M

icro

-ave

rage

d S

tro

ng

An

no

tati

on

Mat

ch

t =

too

l’s o

pti

mal

co

nfi

de

nce

th

resh

old

*p

<0

.05

(ra

nd

om

per

mu

tati

on

)

Page 30: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

30

Evaluation as an NLP system M

icro

-ave

rage

d S

tro

ng

An

no

tati

on

Mat

ch

t =

too

l’s o

pti

mal

co

nfi

de

nce

th

resh

old

*p

<0

.05

(ra

nd

om

per

mu

tati

on

)

Page 31: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

31

Evaluation as an NLP system M

icro

-ave

rage

d S

tro

ng

An

no

tati

on

Mat

ch

t =

too

l’s o

pti

mal

co

nfi

de

nce

th

resh

old

*p

<0

.05

(ra

nd

om

per

mu

tati

on

)

Page 32: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Evaluation for DH

• Possible researcher objectives:

– Finding evidence about a research question

– Doing that faster or with less manual work

– Obtaining networks that confirm (or challenge)

previously available knowledge

– Obtaining quantitative evidence where only

qualitative one was available

• Do the following elements help ?

– The UI and workflows it allows

• The confidence and coherence scores

• The automatic annotation selection

32

Page 33: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Evaluation for DH

• Possible researcher objectives:

– Finding evidence about a research question

– Doing that faster or with less manual work

– Obtaining networks that confirm (or challenge)

previously available knowledge

– Obtaining quantitative evidence where only

qualitative one was available

• Do the following elements help ?

– The UI and workflows it allows

• The confidence and coherence scores

• The automatic annotation selection

33

Page 34: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

34

Page 35: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

35

Page 36: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

36

Proposition Extraction

Page 37: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

• Daily reports on international climate

conferences (Conference of the Parties or

COP), like COP-21 which took place in

December 2015.

• Summary of participant countries’

proposals.

37

Prop Ext: Corpus

Page 38: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

• Identify different countries’ participations

• Identify negotiation points supported or

opposed by participants

• Help researchers compare countries’

positions via keyphrase and entity

extraction on the negotiation points

• Provide more detailed analysis than

prior work on this corpus, based on

word co-occurrence methods (Venturini et al., 2014)

38

Pipeline Objectives

Page 39: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Typical corpus sentence

The EU, with NEW ZEALAND and opposed

by CHINA, MALAYSIA and BHUTAN,

supported including the promotion of natural

regeneration within the definitions of

"afforestation" and "reforestation."

39

Page 40: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Actors (Countries)

The EU, with NEW ZEALAND and opposed

by CHINA, MALAYSIA and BHUTAN,

supported including the promotion of natural

regeneration within the definitions of

"afforestation" and "reforestation."

40

Page 41: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Messages (negotiation points)

The EU, with NEW ZEALAND and opposed

by CHINA, MALAYSIA and BHUTAN,

supported including the promotion of

natural regeneration within the

definitions of "afforestation" and

"reforestation."

41

Page 42: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Predicates (support/opposition)

The EU, with NEW ZEALAND and opposed

by CHINA, MALAYSIA and BHUTAN,

supported including the promotion of

natural regeneration within the definitions of

"afforestation" and "reforestation."

42

Page 43: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Actor + Predicate + Message =

Proposition

43

Page 44: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Propositions

ACTORS PREDICATES MESSAGES

European_Union supported including the promotion of

natural regeneration within the definitions of "afforestation" and "reforestation."

New_Zealand

China

~supported Malaysia

Bhutan

44

Page 45: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Propositions

ACTORS PREDICATES MESSAGES

1 European_Union supported including the promotion of

natural regeneration within the definitions of "afforestation" and "reforestation."

2 New_Zealand

3 China

~supported 4 Malaysia

5 Bhutan

45

Page 46: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Propositions

ACTORS VERBAL

PREDICATES MESSAGES

1 European_Union supported including the promotion of

natural regeneration within the definitions of "afforestation" and "reforestation."

2 New_Zealand

3 China

~supported 4 Malaysia

5 Bhutan

46

ACTORS NOMINAL

PREDICATES MESSAGES

1 Group_of_77 / China

proposal

to include research and development in the transport and energy sectors in the priority areas to be financed by the SCCF.

Page 47: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Sources of proposition info?

• Open Relation Extraction

– OLLIE (Mausam et al., 2012 EMNLP)

• https://github.com/knowitall/ollie

– Open Information Extraction 4.0

• https://github.com/knowitall/openie

• Traditional sources

– Syntactic dependency parsing

– Semantic Role Labeling

(CONLL 08/09 for both)

47

Page 48: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Sources of proposition info?

• Open Relation Extraction

– OLLIE (Mausam et al., 2012 EMNLP)

• https://github.com/knowitall/ollie

– Open Information Extraction 4.0

• https://github.com/knowitall/openie

• Traditional sources

– Syntactic dependency parsing

– Semantic Role Labeling

• IXA Pipes wrapper for MATE-tools

48

Page 49: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Can we do this with patterns?

The EU, with NEW ZEALAND and

opposed by CHINA, MALAYSIA and

BHUTAN, supported including the

promotion of natural regeneration within

the definitions of "afforestation" and

"reforestation."

49

Page 50: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Can we do this with patterns?

[ The EU, with NEW ZEALAND ] and

[ opposed by CHINA, MALAYSIA and

BHUTAN ], [ supported including the

promotion of natural regeneration within

the definitions of "afforestation" and

"reforestation." ]

50

Cf. Salway et al., 2014, ACL, Grammar Induction approach exploiting the ADIOS algorithm from Solan et al. 2005, PNAS

Page 51: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Can we do this with patterns?

• Maybe, but …

– What about anaphora resolution?

– What about negation?

• An NLP pipeline deals with these

phenomena in a uniform way (unlike

“linguistically-agnostic” patterns)

51

Page 52: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Can we do this with patterns?

• Maybe, but …

– What about anaphora resolution?

– What about negation?

• An NLP pipeline deals with these

phenomena in a uniform way (unlike

“linguistically-agnostic” patterns)

• IXA Pipes provides SRL info, coreference

chains (and syntactic dependencies in

case needed)

52

Page 53: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Using SRL info

• We have propositions (events) involving:

– speakers

– reporting verbs/reporting-related nouns

– messages communicated by the speakers

53

The EU, with NEW ZEALAND and opposed by

CHINA, MALAYSIA and BHUTAN, supported

including the promotion of natural

regeneration within the definitions of

"afforestation" and "reforestation."

Page 54: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Using SRL info

• We have propositions (events) involving:

– speakers: generally in the predicate’s A0 role

• List of countries and other actors created manually

from specialized sources (UNFCC site)

– reporting verbs/reporting-related nouns

• Created a list based on VerbNet and NomBank

(using NLTK interface).

– messages communicated by the speakers:

generally the predicate’s A1 role

(complemented with adjunct roles …)

• What about “opposed by …”? (below)

54

Page 55: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Using SRL Info: Generic rule

55

Page 56: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

“Opposed by” spans

• A rule verifies if roles contains a span

introduced by “opposed by”. If yes:

• The main verb, and the negotiation point

related to it are found

• A proposition is created with:

– Each actor in the “opposed by” span

– A negated form of the main verb

– The main verb’s message

56

Page 57: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Treating Pronominal anaphora

• IXA Pipes’ CorefGraph coreference chains

A small subset of possible cases is treated

• Note: He/She can refer to a country (acc.

to country representative’s gender)

- Pronoun antecedents only searched in

their same sentence or the preceding one

- Actor in main verb’s subject (dep-

parsing) is the antecedent of a sentence-

initial he/she in the following sentence.

• Evaluation: Accurate. But coverage … ? 57

Ru

les

Page 58: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Treating negation

• AM-NEG roles from SRL

• Surface cues: negative items in a small

window preceding a predicate

– Not, no, lack of, …

• Problems …

– There was no lack of acceptance by …

58

Page 59: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Overall goal

• Help researchers to compare countries’

positions, via keyphrases and linked

entities in the messages supported or

opposed by countries

• Who agrees with whom?

59

Page 60: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Proposition Extraction Demo

60

Page 61: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

KeyPhrase Extraction

• YaTeA (Aubin et Hamon, 2006)

61

Page 62: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Evaluation as an NLP system

• 311 propositions from ENB corpus: F1 .69

• 631 propositions from IPCC corpus (official

scientific report creation negotiations): F1 .72

• Exact match of all three proposition elements

62

ENB

res

ult

s

Page 63: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

• Is this pipeline (and UI) helping

researchers analyze climate negotiations?

– Comparing actors’ positions

– Looking for answers to research questions

– Drawing attention to overlooked evidence

– Confirming one view on a controversial point

– …

63

Evaluation for DH

Page 64: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Future work

• More implementation

• Actual domain-expert evaluation for DH

purposes

• Assessing contribution of the work based

on domain-expert evaluation

64

Page 65: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Other technologies

• Discourse Analysis (e.g. Xue et al., 2015 (CONLL Task))

– Testing Lin et al. 2014 PDTB parser on the

subcorpora of the IPCC corpus

• Assessment Report

• Summary for Policymakers

• Technical summary

• Semantic Textual Similarity (Agirre et al., 2012

onwards, SemEval)

– Testing TakeLab system (from SemEval 2012)

– Comparing to “non-semantic” similarity

• Perhaps Interpretable STS (Agirre et al. SemEv

2015 Pilot) 65

Page 66: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Summary

• Proposing to researchers in Humanities or

Social Sciences:

– Language technologies that help move

beyond word co-occurrence methods

• Making researchers in other domains

familiar with a broader variety of

Language Technologies

• Evaluating impact of the tools on those

researchers’ activities

66

Page 67: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Some references (1)

Sophie Aubin and Thierry Hamon. (2006) Improving Term

Extraction with Terminological Resources. In Advances

in Natural Language Processing: 5th International

Conference on NLP, FinTAL 2006, pp. 380-387. LNAI

4139. Springer.

David Berry (2012). Understanding Digital Humanities.

Palgrave

Marco Cornolti, Paolo Ferragina, and Massimiliano

Ciaramita. (2013). A framework for benchmarking

entity-annotation systems. In Proc. of WWW, 249–260.

Éric V. De La Clergerie, Olivier Hamon, Djamel Mostefa,

Christelle Ayache, Patrick Paroubek, and Anne Vilnat.

(2008). Passage: from French parser evaluation to

large sized treebank. In Proc. of LREC 2008

Paolo Ferragina and Ugo Scaiella. (2010). Tagme: on-the-fly

annotation of short text fragments (by wikipedia

entities). In Proc. of CIKM’10, 1625–1628.

Jonathan G. Fiscus. (1997). A post-processing system to

yield reduced word error rates: Recognizer output

voting error reduction (ROVER). In Proc. of the IEEE

ASRU Workshop, 1997, 347–354.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino,

Hagen Fürstenau, Manfred Pinkal, Marc Spaniol,

Bilyana Taneva, Stefan Thater, and Gerhard Weikum.

(2011). Robust disambiguation of named entities in

text. In Proc. of EMNLP, 782–792.

Heng Ji, Joel Nothman and Ben Hachey. (2014). Overview

of TAC-KBP2014 Entity Discovery and Linking Tasks. In

Proc. Text Analysis Conference.

Lin, Ziheng, Hwee Tou Ng, and Min-Yen Kan. 2014. A PDTB-

Styled End-to-End Discourse Parser. Natural Language

Engineering 20 (02): 151–84.

Mausam, Schmitz, Bart, Soderland, Etzioni, and others.

(2012). Open Language Learning for Information

Extraction. In Proc. EMNLP / CoNLL, 523–34.

Elijah Meeks and Scott B. Weingart. (2012). The Digital

Humanities Contribution to Topic Modeling. Journal of

Digital Humanities, 2:1

Pablo N. Mendes, Max Jakob, Andrés García-Silva, and

Christian Bizer. (2011). DBpedia spotlight: shedding

light on the web of documents. In Proc. of the 7th Int.

Conf. on Semantic Systems, I-SEMANTICS’11, 1–8.

David Milne and Ian H. Witten. (2008a). An effective, low-

cost measure of semantic relatedness obtained from

Wikipedia links. In Proc. of AAAI Workshop on

Wikipedia and Artificial Intelligence: an Evolving

Synergy, 25–30.

Andrea Moro, Alessandro Raganato, and Roberto Navigli.

(2013). Entity Linking meets Word Sense

Disambiguation: A Unified Approach. Transactions of

the ACL, 2, 231–244.

Thierry Poibeau, Horacio Saggion, Jakub Piskorski, and

Roman Yangarber (eds.). (2012). Multi-source,

Multilingual Information Extraction and Summarization.

Springer Science & Business Media.

P. Ruiz, T. Poibeau, F. Mélanie. (2015). Entity Linking with

corpus coherence combining open source annotators.

In Proc. NAACL-HLT Demos

P. Ruiz, T. Poibeau. (2015). Combining Open Source

Annotators for Entity Linking through Weighted Voting.

In Proc. of *SEM. Denver, U.S..

67

Page 68: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Some references (2)

Satoshi Sekine, Kiyoshi Sudo and Chikashi Nobata. (2002).

Extended Named Entity Hierarchy. In Proc. LREC.

Eric F. Tjong Kim Sang and Fien De Meulder. (2003).

Introduction to the CoNLL-2003 Shared Task:

Language-Independent Named Entity Recognition. In

Proc. CoNLL. (ACL)

Ricardo Usbeck et al. (2015). GERBIL – General Entity

Annotator Benchmarking Framework. In Proc. of

WWW.

Venturini, Tommaso, and Daniele Guido. 2012. Once Upon a

Text: An ANT Tale in Text Analysis. Sociologica 6 (3).

[Note: ANT=Actor-Network Theory]

Venturini, T., N. Baya Laffite, J.-P. Cointet, I. Gray, V.

Zabban, and K. De Pryck. 2014. Three Maps and Three

Misunderstandings: A Digital Mapping of Climate

Diplomacy. Big Data & Society 1 (2).

… plus several references to work by IXA

68

Page 69: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Thank you !

[email protected] http://www.lattice.cnrs.fr/Pablo-Ruiz-Fabo,541

Page 70: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Supplemental Slides

70

Page 71: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Cooccurrence methods:

Wordfish

71

Page 72: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Wikipedia-Link-Based Relatedness

72

Witten, I., and David Milne. 2008. “An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links.”

Page 73: Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz Fabo — LATTICE Lab Summary • Digital Humanities’ needs in terms of text analysis

Coherence: other examples

73

Thomas and Mario are strikers playing in

Munich (Moro and Navigli, 2014)