Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Methods and Techniques for the
Analysis of Parliamentary Records:
Two Case Studies on Italian
Simonetta Montemagni
Istituto di Linguistica Computazionale “A. Zampolli”
ILC-CNR (Pisa, Italy)
Natural Language Processing and
Knowledge extraction
Extraction of
Named Entities
Extraction of
semantic relations Extraction of
domain-
relevant entities
Extraction of
temporal
expressions
Graph-based
Knowledge
Representation
Linguistic
profiling of texts
Textual genre
assessment
Readability level
assessment
Native Language
Identification
Monitoring of
variation across
language varieties
Lin
gu
istic K
no
wle
dge
Ex
tractio
nD
om
ain
Kn
ow
eld
ge
Ex
tracti
on
T2K system
Relation extractor
Domain-specificEntities extractor
Named Entitytagger
LinguisticAnalysis
Tools
InformationExtraction
Tools
KnowledgeGraph
Tools
Graph Visualizer
Semanticannotator
Indexer
Graph creator
Knowledge graph
Index of Content
Semantic annotation
LinguisticProfiling
Annotated corpus
Linguistic pre-processing Knowledge extraction
Text-to-Knowledge (T2K)
T2K combines a battery of tools for Natural Language Processing (NLP),
statistical text analysis and machine language learning which are dynamicallyintegrated to provide an accurate representation of the domain-specific
context of text corpora in different domains.
Case Study 1-
Knowledge Extraction and Semantic Indexing
Focus on different types of parliamentary data:
◦ legislative texts (draft bills)
◦ parliamentary reports (starting)
The challenges
◦ The peculiarity of legal language and its impact on NLP tools
Legal syntax is “convoluted and unnatural” (McCarty, NaLEA
2009) with respect to ordinary language
Much lower performance of state-of-the-art NLP tools on
legislative texts
Need for Domain Adaptation of NLP tools
◦ “Twofold” terminology
domain-specific terms of law or parliamentary procedures are
tainted with regulated / discussed domain knowledge (world
knowledge)
e.g. autorità competente vs sostanza pericolosa
Methods and Techniques for the
Analysis of Parliamentary Records:
Case Study 2 –
Stylistic analysis of Italian Political
Speeches
A Readability Analysis of Campaign Speeches
from the 2016 US Presidential CampaignElliot Schumacher, Maxine Eskenazi
From linguistic annotation …
Morpho-syntactic annotation (PoS tagger developed by Dell’Orletta, 2009)
◦ Evalita 2009: accuracy = 96,34% (without reference lexicon)
◦ State-of-the-art for Italian
Dependency syntactic annotation (DeSR parser, Attardi & Dell’Orletta, 2009)
◦ Conll-2007: 81.3% LAS
◦ Evalita 2009: 83.38% LAS
◦ State-of-the-art for Italian
… to linguistic profiling and readability
assessment
Automatically parsed corpus
Automatic extraction of linguistic
features (linguistic profiling)
Automatic readability assessment and
detection of complex text passages
READ-IT
http://www.italianlp.it/demo/read-it/
NLP-based automatic
readability assessment
software for the Italian
language (Dell’Orletta et al.
2011)
READ-IT results
Italian political speeches: yesterday
1977
1953
1993
De Gasperi
Berlinguer
Craxi
Italian political speeches: yesterday
1977
1953
1993
De Gasperi
Berlinguer
Craxi
A selection of grammatical features
Italian political speeches: yesterday
1977
1953
1993
De Gasperi
Berlinguer
Craxi
A selection of grammatical features
Usage of verbal morho-
syntactic features
(number and person)
Usage of verbal morho-
syntactic features
(number and person)
Italian political speeches: yesterday
1977
1953
1993
De Gasperi
Berlinguer
Craxi
% of word types
on the Basic
Italian
Vocabulary
(De Mauro) Type/Token
ratio
De Gasperi 73,88 0,60
Berlinguer 77,60 0,68
Craxi 61,93 0,72
Italian political speeches: today
Discorso alla Leopolda (28 ottobre 2013)
Piazza del Popolo (23 marzo 2013)
Porta a Porta (21 maggio 2014)
Renzi
Berlusconi
Grillo
What if the audience changes?
Discorso al senato (24 febbraio 2014)
E-news: job act (gennaio 2014) Discorso alla Leopolda
(28 ottobre 2013)
What if the audience changes?
Alta leggibilità
Bassa leggibilità
Discorso al senato (24 febbraio 2014)
E-news: job act (gennaio 2014)
Discorso alla Leopolda (28 ottobre 2013)
What if the audience changes?
A selection of grammatical features
What if the audience changes?
Usage of verbal morho-
syntactic features
(number and person)
What if the audience changes?
Usage of verbal morho-
syntactic features
(number and person)
% of word types
on the Basic
Italian
Vocabulary (De
Mauro)
Type/Token ratio
Renzi-Leopolda 73,37 0,63
Renzi-Jobsact 65,78 0,73
Renzi-Senate 76,01 0,63
Within the same text
Slogans
Rethoric
Conclusion
Promising results
To be refined through
◦ Domain Adaptation of NLP tools
To be extended in different directions
◦ By widening
the set of linguistic monitored features
the dimensions of variation (to genre, political parties, etc.)
◦ By taking into account the characterizing conceptsand their evolution across speakers, politicalparties, genre, time
◦ By dealing with other languages besides Italian
The ItaliaNLP Lab
People:
Dominique Brunato
Andrea Cimino
Felice Dell’Orletta
Simonetta Montemagni
Giulia Venturi