BiographyNet BiographyNet Project review, year-1 September,
18th, 2013 eScience Center 18 September 2013 BiographyNet Review
Meeting, eScience centre, September 18th, 2013
Slide 2
Agenda Project objectives and first year results (Piek)
Methodology and historian perspective (Serge) Model, conversions
and interface (Niels) NLP tools and research (Antske) Discussion
BiographyNet Review Meeting, eScience centre, September 18th,
2013
Slide 3
Starting point http://www.biografischportaal.nl Academic
discipline of writing histories: computational tools marginally
used, long scholarly tradition of study by reading, single authored
historical narratives, while more and more historical sources
digitally available. Project challenges: Computational thinking in
history: Narrative historians not used to frame research problems
in computational terms, while computer-science researchers
understand little of the subtleties of historical analysis Strong
multi-disciplinary cooperation of front runners in both fields
& demonstrator development to achieve common understanding.
Methodological and tool support BiographyNet Review Meeting,
eScience centre, September 18th, 2013
Slide 4
Contribution to historical research New research on the Dutch
nation building and a revaluation of biographical information.
Bridging a gap between life histories, qualitative historical
research, and quantitative historical research. Open research on
less static objects and relations such as events: most important
pieces of information capturing changes and processes that matter.
Capture historiographic perspective: Requires a model that takes
different framings of the same event into account. Adds to the
who-knows-who, when, where and how did the lives of people cross;
how did they affect each others lives and the world they lived in.
How do and did we conceive historic events, how are different
narratives created around the same history? BiographyNet Review
Meeting, eScience centre, September 18th, 2013
Slide 5
Expected outcome Demonstrator on top of the Biography Portal.
Cyclic development. links within the Biography Portal among the
various (textual and visual) datasets Open-source release of the
e-science platform for analyzing biographical texts about people.
Adherence to all relevant Web standards and APIs, maximizing
reusability. Proposal for methodology for extraction of a network
of relations between people and (historic) events. BiographyNet
Review Meeting, eScience centre, September 18th, 2013
Slide 6
Short term goals 1.Building a richer data repository by
connecting different distributed sources of data through formalized
links and metadata. 2.Detection of (co-referenced) named-entities
(persons, places and dates) and events. 3.Harmonize the texts that
vary from 19th century Dutch to contemporary Dutch, where the
OCR-ed texts also contain errors. 4.Development of visualization,
analytic tools, as well as computational historiographical methods
on the structured data that is generated for 1. through 3.
BiographyNet Review Meeting, eScience centre, September 18th,
2013
Slide 7
Results first year Methodology: Use cases and the anticipation
of data- and process-driven biases Formal modeling of provenance
Sustainability, replication, reproducibility Software: Design of
interfaces and analytic tools Text mining and evaluation Linked
Data conversion scripts Data: Linked Data version of the Portal
Linking to Agora Discussions with Wikimedia/Wikipedia/Dbpedia &
Bibliotheek.nl Verrijkt Koninkrijk HuygensING exploitation to
extend the Portal with enriched data produced 6 accepted papers
BiographyNet Review Meeting, eScience centre, September 18th,
2013
Slide 8
BiographyNet and historical approaches to big and heterogeneous
data eScience Center 18 September 2013 BiographyNet Review Meeting,
eScience centre, September 18th, 2013
Slide 9
The historians role 1.Methodology: Work on a methodology to
extract information, relationships and events from short
biographical texts 2.Question the data: develop use cases
3.Contribute to the design of a user interface that challenges
historians to dig deeper into the data 4.Sensitize target user
groups (historians) for both the possibilities and the limitations
of computational methods in historical research.
Slide 10
1: Methodology Year 1 - Historians focus: how reliable and
representative are the texts from this particular dataset? Which
questions can and cannot be answered? How well do tools perform, as
compared to the performance of a real historian? See also
publications (below). Year 1 - Interdisciplinary focus: what is the
provenance of the information, how is it manipulated in order to
arrive at the answer to a query, and who are responsible for the
tools that manipulate those data?
Slide 11
2: Use Cases 12 cases developed, ranging from simple to highly
complex Simple: Group analysis of Governors-general of the Dutch
Indies More complex: when did Dutch elites get involved with the
New World? Complex: What can we say about nationalism in
biographical dictionaries from the nineteenth and twentieth
century?
Slide 12
Governors-General of the Dutch Indies Highest Official in the
Dutch Indies 1610-1949 71 men What can we say about these men as a
group? Who was appointed and what qualities did he have to have?
Etc .
Slide 13
3: User friendly interface Mainly work in progress, Discussion
about the impact of a design metaphor (like time line , house of,
building blocks for, family tree) on the type of questions raised
by the user presentation Niels.
Slide 14
The House of History
Slide 15
Time line
Slide 16
Family Tree
Slide 17
4: Sensitize target user groups Publication in Tijdschrift voor
Biografie (reaching the nearest target user group of the
demonstrator): Serge ter Braake, Het individu en zijn tijdgenoten.
Wat een biograaf kan doen met prosopografie en biografische
woordenboeken, Tijdschrift voor Biografie 2 (summer 2013) vol. 2,
52-61. Biography and Computational Methods, joint paper in
preparation (to be submitted before the end of the month to Journal
for Historical Biography (Ter Braake, Ockeloen and Fokkens)
Research on nationalism and national biographies, to be published
in 2014
Slide 18
4: Sensitize target user groups Presentation at Huygens ING, 10
October 2013 (for circa 50 professional historians) Presentation on
provenance at KNAW Digital Humanities Workshop, 14-15 November 2013
Introduction in e-Humanities in the current curriculum of BA1
students at the Vrije Universiteit (what is e- Humanities, how does
one use a source like the Oxford Dictionary of National Biography?)
Design and development of a series of electives and a minor on
e-history and an e-humanities (BA 2-3; starting 2014/2015). Dataset
of BiographyNet will be used in a lab for history bachelor
students.
Slide 19
BiographyNet Towards the demonstrator eScience Center 18
September 2013 BiographyNet Review Meeting, eScience centre,
September 18th, 2013
Slide 20
Main components of the demonstrator Schema to structure the
data Conversion of the BP to Linked Data NLP system setup Interface
Overview
Slide 21
Online machine readable data with links Simple facts called RDF
Triples Thorbecke > hasBirthPlace > Zwolle Some technology
concepts: Schemas: To structure LD RDF Stores: To store LD SPARQL:
To access LD Huge growth in the past years: More than 300 data
sources More than 30 billion triples A crash course on Linked
Data
Slide 22
Purely syntactic conversion Preserve the original structure of
the data Prevent los of information Allow for reinterpretation of
the original data in the future The conversion process
Slide 23
Conversion steps: Retrieval of XML dump of the Biography Portal
Initial conversion to crude RDF Using ClioPatria and the XMLRDF
tool for ClioPatria RDF restructuring Linking to other sources
Essential step in the Linked Data philosophy The conversion
process
Slide 24
Data schema: Based on the structure of the original XML files
Needs to facilitate the coupling of different biographies of the
same person, without compromising the original data Needs to
facilitate the incorporation of several enrichments, following from
NLP, Entity Reconciliation, etc. Compatible with existing schemas
such as the Europeana Data Model, PROV, P-PLAN, DC terms, etc. The
conversion process
Provenance information is information on how Entities come into
existence What are entities? Documents, Articles, Pictures, etc.
Basically anything that can be produced by something or someone
What kind of information? Who did what? Using which entities? In
which processes? Provenance: What is it?
Slide 27
For the demonstrator, provenance needs to be modeled: From
several perspectives: Information involved Processes involved
People involved At multiple levels: An aggregated level, i.e. per
enrichment Detailed level, i.e. all individual processes Provenance
in BiographyNet
Slide 28
Needed to ensure credibility of the demonstrator, to evaluate
its performance and to improve the academic status of the tool
Historians need to be able to validate results Replication:
Retrieving the same results later using the demonstrator
Reproducibility: Manually by the historian The aggregated level
Targeted at the historian Which original sources where involved?
Who to contact in case results are pulled into question? The
detailed level Targeted at the computer scientist Detailed
information on each individual step Allows for debugging the
internal processing pipeline Why is provenance info important for
BiographyNet?
Slide 29
Johan Rudolph Thorbecke werd in 1798 geboren op 14 januari in
Zwolle en komt uit een half-Duitse BiographyNet Enrichment example
Thorbecke Biographical Description Provenance Meta Data NNBW Person
Meta Data Thorbecke Biography Parts Birth 1798 Event Biographical
Description EnrichmentNLP Tool Person Meta Data Event Birth Johan
Rudolph Thorbecke werd in 1798 geboren op 14 januari in Zwolle en
komt uit een half-Duitse Zwolle 1798-01-14
Slide 30
P-PLAN is not only used to model what actually happened, but
also what was supposed to happen Plans describe the original idea
behind an activity Describe what should happen in a certain
activity Each Plan corresponds with an Activity Variables describe
the input/output of an activity Structure, format, quantity, etc.
Each Variable corresponds with an input/output Entity of an
Activity Plans have their own provenance info E.g. who was
responsible for the creation of a plan? More than just
Provenance
Slide 31
The benefits of modeling plans: Forces the recording of what an
activity and its input/output should look like Provides information
on the original idea behind an activity As such, can provide info
on possible assumptions and biases Allows for comparing between the
actual activity and its input/output and the original plan and its
variables Do they differ from each other and to what extend? Makes
finding errors much easier, as more information is available about
what the input/output should look like Why model plans besides
provenance?
Slide 32
BiographyNet: Schema illustration
Slide 33
Activity Plan Entity Variable Agent Association Activity Plan
Person NLP Tool
Slide 34
Main components of the demonstrator Initial schema available
(publication LISC @ISWC 2013) Schema models enrichments and
aggregations alongside original sources Allows for storing various
levels of provenance information Model will be adapted while
progressing with building the demonstrator Initial conversion to
Linked Data available Structure according to schema presented Next
step is linking to external sources NLP system setup available
(Antske) Interface Presentation of general outline and ideas Recap
/ Current Status
Slide 35
The interface should be easy to use The demonstrator should
inspire historians to undertake new research and give direction,
rather than being the closing factor in their research The
interface should allow users to fine tune results returned upon an
initial action Interface: Focus
Slide 36
Query composition Faceted browsing A combination Interface:
Options
Slide 37
Drop down boxes to select Verbs, data elements and relations
Interface: Query composition
Slide 38
No explicit querying, but convergence of the data through
browsing and selecting Provides better feedback to the user Allows
for more direct and easier adjustment of the selected data
Interface: Faceted browsing
Slide 39
Slide 40
Query composition combined with faceted browsing Create new
facets by defining a query The result of the query is available as
a subset of the data by selecting the defined facet As such,
combinable with other facets Method to integrate open querying of
the data into a general interface and visualization Interface: A
combination
Slide 41
Question Analysis Selection Process Results Data Facets
Slide 42
Time and place are primary elements Interface: Demonstrator
Results ?
Slide 43
Slide 44
BiographyNet BiographyNet Text Mining eScience Center 18
September 2013 BiographyNet Review Meeting, eScience centre,
September 18th, 2013
Slide 45
First year goals for Text Mining Methodology Requirements
Approach Basic System for data enrichment in text Identify metadata
in text Setup that can easily be improved and extended
(co-referenced) named entities, events Deal with alternative
spelling BiographyNet Review Meeting, eScience centre, September
18th, 2013
Slide 46
Methodology Requirements Reproducing results in Natural
Language Processing is non-trivial Details in implementations or
experimental setup can influence results up to a point where they
tell a different story BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Slide 47
Reproducing results Example: Performance of WordNet similarity
scores compared to human ranking: BiographyNet Review Meeting,
eScience centre, September 18th, 2013
Slide 48
Reproducing results Clear registration of all steps involved
and storage of (intermediate) system output can improve
reproducibility Systematic testing can help to gain insight into
the variation of the outcome of our systems and hence lead to more
insight in their performance Antske Fokkens, Marieke van Erp,
Marten Postma, Ted Pedersen, Piek Vossen and Nuno Freire (2013)
Offspring from Reproduction Problems: What Replication Failure
Teaches Us. In: Proceedings of ACL 2013, Sofia, Bulgaria, August
2013. BiographyNet Review Meeting, eScience centre, September 18th,
2013
Slide 49
Methodology requirements The method used to extract information
may introduce a bias that has unintended influence on the outcome
of the historians questions For example: location identification
with GeoNames Heuristic: when multiple locations with the same
name, take the one in or closest to the Netherlands High precision,
but `America, `Willemstad: what if the historian investigates trips
to the Netherlands by officials overseas? BiographyNet Review
Meeting, eScience centre, September 18th, 2013
Slide 50
Methodology requirements Maximize reuse of existing tools for
BiographyNet Maximize reuse of tools developed within BiographyNet
by other researchers How can we create a setup that facilitates
this? BiographyNet Review Meeting, eScience centre, September 18th,
2013
Slide 51
Methodology approach Provenance modeling: Can help to improve
reproducibility of research Can support systematic testing Can
model the exact steps taken Flexible formats that support this: NLP
Annotation Format (NAF) to manage output and input of NLP tools
Grounded Annotation Framework (GAF) for the final output of the NLP
pipeline BiographyNet Review Meeting, eScience centre, September
18th, 2013
Slide 52
NLP Annotation Format Sustainable, because close to existing
linguistic formats (e.g. LAF, GRAF, NIF) Joint work across projects
and with other institutes (notably University of the Basque
Country, Fondazione Bruno Kessler) Flexible, because the output of
individual tools is added in separate layers BiographyNet Review
Meeting, eScience centre, September 18th, 2013
Slide 53
Grounded Annotation Framework RDF compliant framework
Introduces the denotedBy relation that links mentions in text to
formal representations of their instances Provenance is marked
using Named Graphs This allows us to accumulate information from
different sources and represent alternative perspectives
BiographyNet Review Meeting, eScience centre, September 18th,
2013
Slide 54
Slide 55
Provenance Modeling It must be clear where information comes
from (original source, opinion holder, automatically retrieved or
from metadata) For NLP research: Model each step of the process
Resources used (preprocessing + version), system output For
historic research: What may introduce biases? How can the process
be represented in an understandable manner? BiographyNet Review
Meeting, eScience centre, September 18th, 2013
Slide 56
Basic System Identifying metadata in text Linguistically nave
supervised machine learning Linguistic processing: Named Entity
recognition (time and location) Concept identification BiographyNet
Review Meeting, eScience centre, September 18th, 2013
Slide 57
First Evaluation Use case: Governor Generals of the Dutch
Indies 129 Biographies describing 71 individuals Serge ter Braake
extracted information manually BiographyNet Review Meeting,
eScience centre, September 18th, 2013
Slide 58
Metadata versus text mining BiographyNet Review Meeting,
eScience centre, September 18th, 2013
Slide 59
Preliminary outcome of text mining
CategoryCorrectIncorrectBothCorrect textIncorrect Text Education202
Father0092 Mother0125 Occupation1462214 Birthdate212359
BiographyNet Review Meeting, eScience centre, September 18th, 2013
Recall problems (for birthdate): 1.Sentence not found (35): typical
for wikipedia, bwn, vdaa 2.Value not found (7) 3.Wrong sentence
(1), wrong date (1): date of marriage, date of death
Slide 60
Observations Recall problems (for birthdate): Sentence
identification Easy ways to improve: Parents: named entity
recognition Occupation, Education: concept tagged corpus Source
specific training More difficult problems: Relations, functions of
other people Negations or factuality (e.g. refused positions for
occupations) BiographyNet Review Meeting, eScience centre,
September 18th, 2013
Slide 61
NLP outlook Evaluation: Text based annotations Metadata
extraction: Supervised with linguistically rich features Rule-based
approaches Beyond Metadata: Time lines of peoples lives (2 nd year)
Networks between people (2 nd year) Complex event modeling (3 rd
year) BiographyNet Review Meeting, eScience centre, September 18th,
2013
Slide 62
Questions? http://www.biographynet.nl/
http://www.biographynet.nl/ eScience Center 18 September 2013
BiographyNet Review Meeting, eScience centre, September 18th,
2013