23
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 Roundtripping of NIF based Linguistic Linked Data with non linked data sources Felix Sasaki DFKI / W3C Fellow Slides: http://de.slideshare.net/atcfsenzoku/sasaki- datathonmadrid2015 1

Sasaki datathon-madrid-2015

Embed Size (px)

Citation preview

Page 1: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 1

Roundtripping of NIF based Linguistic Linked Data with non linked data sources

Felix SasakiDFKI / W3C Fellow

Slides:http://de.slideshare.net/atcfsenzoku/sasaki-datathonmadrid2015

Page 2: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 2

What is NIF?

• Natural Language Processing Interchange Format– See http://nlp2rdf.org/

• LLD format to store annotations & to organize NLP pipelines

• API specification to create NIF workflows• More details: after the coffee break • Following slides: main roles for NIF

Page 3: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 3

Example (Partial; JSON-LD Syntax){ "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }

Page 4: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 4

Example (Partial; JSON-LD Syntax){ "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }

• Identifying and typing annotations

• Identifying annotation offsets

• Adding additional knowledge, e.g. named entity identifier

• Interrelating annotations

Page 5: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 5

Example (Partial; JSON-LD Syntax){ "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }

• Identifying and typing annotations

• Identifying annotation offsets

• Adding additional knowledge, e.g. named entity identifier

• Interrelating annotations

Page 6: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 6

Example (Partial; JSON-LD Syntax){ "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }

• Identifying and typing annotations

• Identifying annotation offsets

• Adding additional knowledge, e.g.named entity identifier

• Interrelating annotations

Page 7: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 7

Example (Partial; JSON-LD Syntax){ "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }

• Identifying and typing annotations

• Identifying annotation offsets

• Adding additional knowledge, e.g.named entity identifier

• Interrelating annotations

Page 8: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 8

A NIF workflow

Existing content

Content analytics, e.g. named entity recognition

Conversion to NIF

Deploying knowledge from the LLD cloud

Page 9: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 9

Potential scenario: roundtripping

Existing content

Content analytics, e.g. named entity recognition

Conversion to NIF

Storing annotations in original content

Deploying knowledge from the LLD cloud

Page 10: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 10

Roundtripping

• Roundtripping: Storing the outcome of content processing (analytics) tasks in the original content

• Not always needed, but sometimes – examples:– Enriching Web content with named entity

information; generating Schema.org markup via NIF pipelines. Format: HTML

– Enriching localisation content, to add value beyond translation: Format: XLIFF

Page 11: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 11

Example: HTML

Example roundtripping workflow… <p>Welcome to Prague!</p>…

…<p>Welcome to <span …itemtype="http://schema.org/Place">Prague</span>!</p>…

1) Conversion to NIF 2) NER processing3) Back conversion to HTML

Page 12: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 12

Example: XLIFF

Example roundtripping workflow… <xlf:source>Welcome to Prague!</xlf:source> …

… <xlf:source>Welcome to <mrk …its:taClassRef="http://schema.org/Place">Prague</mrk>!</xlf:source> …

1) Conversion to NIF 2) NER processing3) Back conversion to HTML

Page 13: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 13

Example usage scenario:FREME project

• See http://www.freme-project.eu/ • Developing interfaces for multilingual and semantic

enrichment of digital content• Relies on NIF based enrichment workflows

– See FREME API version 0.1http://api.freme-project.eu/doc/0.1/

• Deploys aspects of the LIDER reference architecture for LLD processing– See D3.1.1 at http://lider-project.eu/?q=doc/deliverables

• Focuses on four business cases– Localization BC requires XLIFF roundtripping– Web content personalisation BC requires HTML roundtripping

Page 14: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 14

Challenges for roundtripping

• Source format– How to store enrichment information

(annotations)– How to handle existing information

• Annotation model– NIF = a general graph-based annotation model– Sources format and annotation motivation may

require restriction of the model

Page 15: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 15

How to store annotations in various source formats

• Solvable for markup languages like HTML or XLIFF

• Challenge to preserve existing markup“<p>Welcome to <b>Prague</b>!</p>”

• General issue with complex and proprietary formats:– “My own” storage mechanism = no tool support– Using existing storage mechanisms may mean:

overloading semantics

Page 16: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 16

Source format example: Word… <w:t>Welcome to Prague!</w:t> …

… <w:commentRangeStart w:id="0"/><w:t>Prague</w:t><w:commentRangeEnd w:id="0"/><w:r w:rsidR="00987079"> …

<w:p w:rsidRPr="00987079">… Enrichment: type "http://schema.org/Place"…</w:p>

Enrichment process; storing enrichment as comments

Change of original content: creation of anchorComment stored separately; refers to anchor: “standoff approach”

Content storage

Comment storage

Content storage (Word file unzipped)

Page 17: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 17

Annotation models

• NIF: like RDF = general graph model– Consisting of nodes and arcs

p:char=11,17 dbp:PraguetaIdentRef

Page 18: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 18

Restricting graphs: Tree structured annotations on several layers

• Tree structures for syntactic annotations

• Several annotation layers for the same text

• Concurrent hierarchies

• Representation only of one of these in roundtripping with XML

Example taken from TEI http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html

Page 19: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 19

Representing overlapping hierarchies with markup (1/2)

Solutions advertised by the TEI• Multiple encoding of the same information– One XML document per annotation

• Boundary marking with empty “milestone” elements– Also used by XLIFF

Page 20: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 20

Representing overlapping hierarchies with markup (2/2)

Solutions advertised by the TEI• Fragmentation and reconstitution of virtual

elements– One hierarchy explicit, others with interrelated

marked-up spans• Stand-off markup– Separation of text and annotations, interlinked via

anchor and reference– Cf. Word example

Page 21: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 21

Representing overlapping hierarchies in RDF

POWLA (cf. Chiarcos, 2012)• RDF representation for corpus annotation,

based on PAULA XML Standoff format• Allows to represent hierarchical, multi-layer

corpora in RDF and query in SPARQL• Not relevant for roundtripping, but for

linguistic annotation representation and processing in RDF

Page 22: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 22

Lessons learned

• Choose the overlap solution that fits your roundtripping modelling and processing needs

• Consider off-the-shelf tooling– For 100% hierarchical data: XPath / CSS selectors, DOM, …

• Consider libraries– For extraction only: Tika http://tika.apache.org/– For roundtripping: Okapi http://okapi.opentag.com/ - in FREME

currently being adapted for roundtripping in selected formats • Make sure the annotation survives in the original format –

cf. Word example– Soon to be made easier by using Okapi

Page 23: Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 23

Roundtripping of NIF based Linguistic Linked Data with non linked data sources

Felix SasakiDFKI / W3C Fellow