Natural Language Processing in Digital Humanities: …Presentation at IXA, January 2016 Pablo Ruiz...

Preview:

Citation preview

Natural Language Processing in Digital Humanities: application examples

Presentation at IXA, January 2016

Pablo Ruiz Fabo — LATTICE Lab

Summary

• Digital Humanities’ needs in terms of text

analysis tools

• Two examples of pipelines to address

these needs:

– Entity Linking and the PoliInformatics corpus

– Proposition Extraction and the Earth

Negotiations Bulletin Corpus

• Implementation choices, demo, evaluation

2

Digital Humanities (DH)

• Application of computational methods to

questions in fields like Humanities or

Social Sciences.

– A goal is to allow for asking questions and

attaining findings that would be impossible

without computational means (Berry, 2012)

– Critical reflection: “as much of a focus on what

the computational techniques obscure as

reveal” (Meeks and Weingart, 2012)

3

DH and Text: Topic Models

• Very popular tool in DH: LDA (MALLET) (discussion in Meeks and Weingart, 2012)

4

Europarl Corpus Topics for all speeches by the French Green Party (Les Verts), 1999-2004 session

DH and Text …

• But there are many other Text Analysis /

Language Technology / Natural Language

Processing applications potentially useful

for DH questions …

5

DH methods: Network analysis

• Transforming a collection of texts into a

network (a graph)

• Network’s nodes:

– Actors in the corpus: People, Institutions …

– Concepts in the corpus

6

DH methods: Network analysis

• Transforming a collection of texts into a

network (a graph)

• Network’s nodes:

– Actors in the corpus: People, Institutions …

– Concepts in the corpus

7

DH methods: Network analysis

• Transforming a collection of texts into a

network (a graph)

• Network’s nodes:

– Actors in the corpus: People, Institutions …

– Concepts in the corpus

Named Entity Recognition / Disambiguation

Entity and Concept Linking

8

Text => Network via EL: User Needs

9

Venturini et al. (2012) Once Upon a Text

[médialab at Paris SciencesPo]

The careful use of natural language processing algorithms could provide better filtering metrics and support in expression merging

The manual filtering is crucial because it allows entities to be reduced to a set size appropriate for analysis, but also recovering important entities that could have been excluded by the automatic filtering.

Though AlchemyAPI offers a trustworthy

service, we don’t like relying on it. In

particular, we don’t like that the service is

offered as a “black box” and that the exact

extraction algorithm is secret.

10

Venturini and Guido (2012). Once upon a text:

An Actor-Network Theory tale in text analytics.

Text => Network via EL: User needs

https://github.com/medialab/ANTA

Entity Linking in DH: User Needs

• Variety of corpora needs to be treated

• EL literature shows that tools’ performance

varies according to the corpus:

11

TOOL CORPUS

AIDA/CoNLL (news, sports) IITB (web, various topics)

P R F1 P R F1

Spotlight 31.2 40.4 35.2 46.2 50.0 48.0

TagMe 61.4 55.5 58.3 45.2 42.0 43.6

WikipediaMiner 46.9 52.8 49.7 56.8 48.2 43.6

AIDA 63.3 29.1 39.8 65.7 4.1 7.6

Data from Cornolti et al. (2013) Metric: Weak Annotation Match

• Correlations between number of

occurrences of a textual feature and tool

performance (Usbeck et al. 2015)

12

CORRELATIONS Nbr. PER Nbr. ORG Nbr. LOC Nbr. OTHER

Babelfy 0.769 -0.376 0.254 -0.431

Spotlight 0.217 -0.480 -0.461 0.26

TagMe 0.257 -0.272 -0.194 0.036

WikipediaMiner 0.082 -0.679 -0.632 0.497

Data from 20 Nov 2015, GERBIL platform (gerbil.aksw.org/gerbil/overview), A2KB/Ma task

Entity Linking in DH: User Needs

Entity Linking approach given User needs

NEED APPROACH

• Avoiding “black boxes” • Open source tools

• Treating a variety of corpora, knowing that tools’ performance varies with the corpora

• Combining tools to get complementary results

• Manual filtering of entities

• Information to guide the filtering

• Providing annotation quality metrics to users

• Simultaneous acess to metrics and text to validate annotations

• Optional automatic annotation selection 13

Open Source Tool Combination

• Open-source Public-domain tools which link

to generic ontologies (DBpedia, YAGO, Babelnet)

[P.S.: OK maybe Babelfy is not exactly open source … we

could perhaps use AGDISTIS:

https://github.com/AKSW/AGDISTIS]

14

2010 2011 2008 2011 2014

Metrics to guide filtering?

15

• Confidence scores

• Coherence scores

EL: Annotation confidence

SOCCER –JAPAN GET LUCKY WIN,

CHINA IN SURPRISE DEFEAT

16

CONFIDENCE

EL: Annotation coherence

• Wikipedia Link Based Measure: Relies on

common inbound links

• Milne & Witten (2008) [original proposal]

• Ferragina et al. (2010) [optimizations]

• Hoffart et al. (2011) [among other measures]

• Moro et al. (2014) [other measures]

17

Milne-Witten coherence between entities e1 and e2 (as in Hoffart et al. 2011)

Entity Linking Demo

18

Demo: PoliInformatics Corpus

19

Demo: PoliInformatics Corpus

20

• NLP Unshared Task in PoliInformatics

2014

– Who was the financial crisis?

• Participants: individuals, industries, …

– What was the financial crisis?

• Causes, proposals for reform, ….

• Heterogeneous corpus containing

– Congress Hearings transcripts

– Official reports on the crisis by Congress

– Bills

– Etc.

Demo: PoliInformatics Corpus

21

• NLP Unshared Task in PoliInformatics

2014

– Who was the financial crisis?

• Participants: individuals, industries, …

– What was the financial crisis?

• Causes, proposals for reform, ….

• Heterogeneous corpus containing

– Congress Hearings transcripts

– Official reports on the crisis by Congress

– Bills

– Etc.

EL Demo: Corpus

22

Congress Hearings: Interviews to witnesses

EL Demo: Corpus

23

Official report by Congress about the causes

for the crisis

Entity Linking Demo

2010 2011 2008 2011 2013

results displayed on demo not displayed

Description: Ruiz, Poibeau & Mélanie (Demo at NAACL 2015).

Evaluation

• … as an NLP-related system

• … in terms of Digital Humanities

25

• Does a selection among the annotations

provided by several systems outperform

each of those systems’ annotations taken

individually?

• Combination method: ROVER. Each

system is weighted by its precision on a

test-corpus – Fiscus, 1997 for ASR

– De la Clergerie et al. 2008, for parsing

– Ruiz and Poibeau, 2015 (*SEM poster)

26

Evaluation as an NLP system

ROVER: system weights

• Assume two application scenarios:

– User wants Entities only

– User wants Entities and Concepts

• For Entities only:

– Weight according to Precision on

AIDA/CoNLL B corpus (no concepts in gold)

• For Entities and Concepts:

– Weight according to P on IITB corpus (many

concepts in gold)

27

Corpora: Cornolti et al. 2013, BAT Framework https://github.com/marcocor/bat-framework/tree/master/benchmark/datasets

ROVER: testing

• Entities only was tested on the MSNBC

corpus (reference set has no concepts)

• Entities and Concepts was tested on the

AQUAINT corpus (ref set has concepts)

28

29

Evaluation as an NLP system M

icro

-ave

rage

d S

tro

ng

An

no

tati

on

Mat

ch

t =

too

l’s o

pti

mal

co

nfi

de

nce

th

resh

old

*p

<0

.05

(ra

nd

om

per

mu

tati

on

)

30

Evaluation as an NLP system M

icro

-ave

rage

d S

tro

ng

An

no

tati

on

Mat

ch

t =

too

l’s o

pti

mal

co

nfi

de

nce

th

resh

old

*p

<0

.05

(ra

nd

om

per

mu

tati

on

)

31

Evaluation as an NLP system M

icro

-ave

rage

d S

tro

ng

An

no

tati

on

Mat

ch

t =

too

l’s o

pti

mal

co

nfi

de

nce

th

resh

old

*p

<0

.05

(ra

nd

om

per

mu

tati

on

)

Evaluation for DH

• Possible researcher objectives:

– Finding evidence about a research question

– Doing that faster or with less manual work

– Obtaining networks that confirm (or challenge)

previously available knowledge

– Obtaining quantitative evidence where only

qualitative one was available

• Do the following elements help ?

– The UI and workflows it allows

• The confidence and coherence scores

• The automatic annotation selection

32

Evaluation for DH

• Possible researcher objectives:

– Finding evidence about a research question

– Doing that faster or with less manual work

– Obtaining networks that confirm (or challenge)

previously available knowledge

– Obtaining quantitative evidence where only

qualitative one was available

• Do the following elements help ?

– The UI and workflows it allows

• The confidence and coherence scores

• The automatic annotation selection

33

34

35

36

Proposition Extraction

• Daily reports on international climate

conferences (Conference of the Parties or

COP), like COP-21 which took place in

December 2015.

• Summary of participant countries’

proposals.

37

Prop Ext: Corpus

• Identify different countries’ participations

• Identify negotiation points supported or

opposed by participants

• Help researchers compare countries’

positions via keyphrase and entity

extraction on the negotiation points

• Provide more detailed analysis than

prior work on this corpus, based on

word co-occurrence methods (Venturini et al., 2014)

38

Pipeline Objectives

Typical corpus sentence

The EU, with NEW ZEALAND and opposed

by CHINA, MALAYSIA and BHUTAN,

supported including the promotion of natural

regeneration within the definitions of

"afforestation" and "reforestation."

39

Actors (Countries)

The EU, with NEW ZEALAND and opposed

by CHINA, MALAYSIA and BHUTAN,

supported including the promotion of natural

regeneration within the definitions of

"afforestation" and "reforestation."

40

Messages (negotiation points)

The EU, with NEW ZEALAND and opposed

by CHINA, MALAYSIA and BHUTAN,

supported including the promotion of

natural regeneration within the

definitions of "afforestation" and

"reforestation."

41

Predicates (support/opposition)

The EU, with NEW ZEALAND and opposed

by CHINA, MALAYSIA and BHUTAN,

supported including the promotion of

natural regeneration within the definitions of

"afforestation" and "reforestation."

42

Actor + Predicate + Message =

Proposition

43

Propositions

ACTORS PREDICATES MESSAGES

European_Union supported including the promotion of

natural regeneration within the definitions of "afforestation" and "reforestation."

New_Zealand

China

~supported Malaysia

Bhutan

44

Propositions

ACTORS PREDICATES MESSAGES

1 European_Union supported including the promotion of

natural regeneration within the definitions of "afforestation" and "reforestation."

2 New_Zealand

3 China

~supported 4 Malaysia

5 Bhutan

45

Propositions

ACTORS VERBAL

PREDICATES MESSAGES

1 European_Union supported including the promotion of

natural regeneration within the definitions of "afforestation" and "reforestation."

2 New_Zealand

3 China

~supported 4 Malaysia

5 Bhutan

46

ACTORS NOMINAL

PREDICATES MESSAGES

1 Group_of_77 / China

proposal

to include research and development in the transport and energy sectors in the priority areas to be financed by the SCCF.

Sources of proposition info?

• Open Relation Extraction

– OLLIE (Mausam et al., 2012 EMNLP)

• https://github.com/knowitall/ollie

– Open Information Extraction 4.0

• https://github.com/knowitall/openie

• Traditional sources

– Syntactic dependency parsing

– Semantic Role Labeling

(CONLL 08/09 for both)

47

Sources of proposition info?

• Open Relation Extraction

– OLLIE (Mausam et al., 2012 EMNLP)

• https://github.com/knowitall/ollie

– Open Information Extraction 4.0

• https://github.com/knowitall/openie

• Traditional sources

– Syntactic dependency parsing

– Semantic Role Labeling

• IXA Pipes wrapper for MATE-tools

48

Can we do this with patterns?

The EU, with NEW ZEALAND and

opposed by CHINA, MALAYSIA and

BHUTAN, supported including the

promotion of natural regeneration within

the definitions of "afforestation" and

"reforestation."

49

Can we do this with patterns?

[ The EU, with NEW ZEALAND ] and

[ opposed by CHINA, MALAYSIA and

BHUTAN ], [ supported including the

promotion of natural regeneration within

the definitions of "afforestation" and

"reforestation." ]

50

Cf. Salway et al., 2014, ACL, Grammar Induction approach exploiting the ADIOS algorithm from Solan et al. 2005, PNAS

Can we do this with patterns?

• Maybe, but …

– What about anaphora resolution?

– What about negation?

• An NLP pipeline deals with these

phenomena in a uniform way (unlike

“linguistically-agnostic” patterns)

51

Can we do this with patterns?

• Maybe, but …

– What about anaphora resolution?

– What about negation?

• An NLP pipeline deals with these

phenomena in a uniform way (unlike

“linguistically-agnostic” patterns)

• IXA Pipes provides SRL info, coreference

chains (and syntactic dependencies in

case needed)

52

Using SRL info

• We have propositions (events) involving:

– speakers

– reporting verbs/reporting-related nouns

– messages communicated by the speakers

53

The EU, with NEW ZEALAND and opposed by

CHINA, MALAYSIA and BHUTAN, supported

including the promotion of natural

regeneration within the definitions of

"afforestation" and "reforestation."

Using SRL info

• We have propositions (events) involving:

– speakers: generally in the predicate’s A0 role

• List of countries and other actors created manually

from specialized sources (UNFCC site)

– reporting verbs/reporting-related nouns

• Created a list based on VerbNet and NomBank

(using NLTK interface).

– messages communicated by the speakers:

generally the predicate’s A1 role

(complemented with adjunct roles …)

• What about “opposed by …”? (below)

54

Using SRL Info: Generic rule

55

“Opposed by” spans

• A rule verifies if roles contains a span

introduced by “opposed by”. If yes:

• The main verb, and the negotiation point

related to it are found

• A proposition is created with:

– Each actor in the “opposed by” span

– A negated form of the main verb

– The main verb’s message

56

Treating Pronominal anaphora

• IXA Pipes’ CorefGraph coreference chains

A small subset of possible cases is treated

• Note: He/She can refer to a country (acc.

to country representative’s gender)

- Pronoun antecedents only searched in

their same sentence or the preceding one

- Actor in main verb’s subject (dep-

parsing) is the antecedent of a sentence-

initial he/she in the following sentence.

• Evaluation: Accurate. But coverage … ? 57

Ru

les

Treating negation

• AM-NEG roles from SRL

• Surface cues: negative items in a small

window preceding a predicate

– Not, no, lack of, …

• Problems …

– There was no lack of acceptance by …

58

Overall goal

• Help researchers to compare countries’

positions, via keyphrases and linked

entities in the messages supported or

opposed by countries

• Who agrees with whom?

59

Proposition Extraction Demo

60

KeyPhrase Extraction

• YaTeA (Aubin et Hamon, 2006)

61

Evaluation as an NLP system

• 311 propositions from ENB corpus: F1 .69

• 631 propositions from IPCC corpus (official

scientific report creation negotiations): F1 .72

• Exact match of all three proposition elements

62

ENB

res

ult

s

• Is this pipeline (and UI) helping

researchers analyze climate negotiations?

– Comparing actors’ positions

– Looking for answers to research questions

– Drawing attention to overlooked evidence

– Confirming one view on a controversial point

– …

63

Evaluation for DH

Future work

• More implementation

• Actual domain-expert evaluation for DH

purposes

• Assessing contribution of the work based

on domain-expert evaluation

64

Other technologies

• Discourse Analysis (e.g. Xue et al., 2015 (CONLL Task))

– Testing Lin et al. 2014 PDTB parser on the

subcorpora of the IPCC corpus

• Assessment Report

• Summary for Policymakers

• Technical summary

• Semantic Textual Similarity (Agirre et al., 2012

onwards, SemEval)

– Testing TakeLab system (from SemEval 2012)

– Comparing to “non-semantic” similarity

• Perhaps Interpretable STS (Agirre et al. SemEv

2015 Pilot) 65

Summary

• Proposing to researchers in Humanities or

Social Sciences:

– Language technologies that help move

beyond word co-occurrence methods

• Making researchers in other domains

familiar with a broader variety of

Language Technologies

• Evaluating impact of the tools on those

researchers’ activities

66

Some references (1)

Sophie Aubin and Thierry Hamon. (2006) Improving Term

Extraction with Terminological Resources. In Advances

in Natural Language Processing: 5th International

Conference on NLP, FinTAL 2006, pp. 380-387. LNAI

4139. Springer.

David Berry (2012). Understanding Digital Humanities.

Palgrave

Marco Cornolti, Paolo Ferragina, and Massimiliano

Ciaramita. (2013). A framework for benchmarking

entity-annotation systems. In Proc. of WWW, 249–260.

Éric V. De La Clergerie, Olivier Hamon, Djamel Mostefa,

Christelle Ayache, Patrick Paroubek, and Anne Vilnat.

(2008). Passage: from French parser evaluation to

large sized treebank. In Proc. of LREC 2008

Paolo Ferragina and Ugo Scaiella. (2010). Tagme: on-the-fly

annotation of short text fragments (by wikipedia

entities). In Proc. of CIKM’10, 1625–1628.

Jonathan G. Fiscus. (1997). A post-processing system to

yield reduced word error rates: Recognizer output

voting error reduction (ROVER). In Proc. of the IEEE

ASRU Workshop, 1997, 347–354.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino,

Hagen Fürstenau, Manfred Pinkal, Marc Spaniol,

Bilyana Taneva, Stefan Thater, and Gerhard Weikum.

(2011). Robust disambiguation of named entities in

text. In Proc. of EMNLP, 782–792.

Heng Ji, Joel Nothman and Ben Hachey. (2014). Overview

of TAC-KBP2014 Entity Discovery and Linking Tasks. In

Proc. Text Analysis Conference.

Lin, Ziheng, Hwee Tou Ng, and Min-Yen Kan. 2014. A PDTB-

Styled End-to-End Discourse Parser. Natural Language

Engineering 20 (02): 151–84.

Mausam, Schmitz, Bart, Soderland, Etzioni, and others.

(2012). Open Language Learning for Information

Extraction. In Proc. EMNLP / CoNLL, 523–34.

Elijah Meeks and Scott B. Weingart. (2012). The Digital

Humanities Contribution to Topic Modeling. Journal of

Digital Humanities, 2:1

Pablo N. Mendes, Max Jakob, Andrés García-Silva, and

Christian Bizer. (2011). DBpedia spotlight: shedding

light on the web of documents. In Proc. of the 7th Int.

Conf. on Semantic Systems, I-SEMANTICS’11, 1–8.

David Milne and Ian H. Witten. (2008a). An effective, low-

cost measure of semantic relatedness obtained from

Wikipedia links. In Proc. of AAAI Workshop on

Wikipedia and Artificial Intelligence: an Evolving

Synergy, 25–30.

Andrea Moro, Alessandro Raganato, and Roberto Navigli.

(2013). Entity Linking meets Word Sense

Disambiguation: A Unified Approach. Transactions of

the ACL, 2, 231–244.

Thierry Poibeau, Horacio Saggion, Jakub Piskorski, and

Roman Yangarber (eds.). (2012). Multi-source,

Multilingual Information Extraction and Summarization.

Springer Science & Business Media.

P. Ruiz, T. Poibeau, F. Mélanie. (2015). Entity Linking with

corpus coherence combining open source annotators.

In Proc. NAACL-HLT Demos

P. Ruiz, T. Poibeau. (2015). Combining Open Source

Annotators for Entity Linking through Weighted Voting.

In Proc. of *SEM. Denver, U.S..

67

Some references (2)

Satoshi Sekine, Kiyoshi Sudo and Chikashi Nobata. (2002).

Extended Named Entity Hierarchy. In Proc. LREC.

Eric F. Tjong Kim Sang and Fien De Meulder. (2003).

Introduction to the CoNLL-2003 Shared Task:

Language-Independent Named Entity Recognition. In

Proc. CoNLL. (ACL)

Ricardo Usbeck et al. (2015). GERBIL – General Entity

Annotator Benchmarking Framework. In Proc. of

WWW.

Venturini, Tommaso, and Daniele Guido. 2012. Once Upon a

Text: An ANT Tale in Text Analysis. Sociologica 6 (3).

[Note: ANT=Actor-Network Theory]

Venturini, T., N. Baya Laffite, J.-P. Cointet, I. Gray, V.

Zabban, and K. De Pryck. 2014. Three Maps and Three

Misunderstandings: A Digital Mapping of Climate

Diplomacy. Big Data & Society 1 (2).

… plus several references to work by IXA

68

Thank you !

pablo.ruiz.fabo@ens.fr http://www.lattice.cnrs.fr/Pablo-Ruiz-Fabo,541

Supplemental Slides

70

Cooccurrence methods:

Wordfish

71

Wikipedia-Link-Based Relatedness

72

Witten, I., and David Milne. 2008. “An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links.”

Coherence: other examples

73

Thomas and Mario are strikers playing in

Munich (Moro and Navigli, 2014)

Recommended