Upload
felix-sasaki
View
172
Download
0
Embed Size (px)
Citation preview
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 1
Roundtripping of NIF based Linguistic Linked Data with non linked data sources
Felix SasakiDFKI / W3C Fellow
Slides:http://de.slideshare.net/atcfsenzoku/sasaki-datathonmadrid2015
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 2
What is NIF?
• Natural Language Processing Interchange Format– See http://nlp2rdf.org/
• LLD format to store annotations & to organize NLP pipelines
• API specification to create NIF workflows• More details: after the coffee break • Following slides: main roles for NIF
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 3
Example (Partial; JSON-LD Syntax){ "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 4
Example (Partial; JSON-LD Syntax){ "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }
• Identifying and typing annotations
• Identifying annotation offsets
• Adding additional knowledge, e.g. named entity identifier
• Interrelating annotations
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 5
Example (Partial; JSON-LD Syntax){ "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }
• Identifying and typing annotations
• Identifying annotation offsets
• Adding additional knowledge, e.g. named entity identifier
• Interrelating annotations
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 6
Example (Partial; JSON-LD Syntax){ "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }
• Identifying and typing annotations
• Identifying annotation offsets
• Adding additional knowledge, e.g.named entity identifier
• Interrelating annotations
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 7
Example (Partial; JSON-LD Syntax){ "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }
• Identifying and typing annotations
• Identifying annotation offsets
• Adding additional knowledge, e.g.named entity identifier
• Interrelating annotations
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 8
A NIF workflow
Existing content
Content analytics, e.g. named entity recognition
Conversion to NIF
Deploying knowledge from the LLD cloud
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 9
Potential scenario: roundtripping
Existing content
Content analytics, e.g. named entity recognition
Conversion to NIF
Storing annotations in original content
Deploying knowledge from the LLD cloud
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 10
Roundtripping
• Roundtripping: Storing the outcome of content processing (analytics) tasks in the original content
• Not always needed, but sometimes – examples:– Enriching Web content with named entity
information; generating Schema.org markup via NIF pipelines. Format: HTML
– Enriching localisation content, to add value beyond translation: Format: XLIFF
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 11
Example: HTML
Example roundtripping workflow… <p>Welcome to Prague!</p>…
…<p>Welcome to <span …itemtype="http://schema.org/Place">Prague</span>!</p>…
1) Conversion to NIF 2) NER processing3) Back conversion to HTML
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 12
Example: XLIFF
Example roundtripping workflow… <xlf:source>Welcome to Prague!</xlf:source> …
… <xlf:source>Welcome to <mrk …its:taClassRef="http://schema.org/Place">Prague</mrk>!</xlf:source> …
1) Conversion to NIF 2) NER processing3) Back conversion to HTML
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 13
Example usage scenario:FREME project
• See http://www.freme-project.eu/ • Developing interfaces for multilingual and semantic
enrichment of digital content• Relies on NIF based enrichment workflows
– See FREME API version 0.1http://api.freme-project.eu/doc/0.1/
• Deploys aspects of the LIDER reference architecture for LLD processing– See D3.1.1 at http://lider-project.eu/?q=doc/deliverables
• Focuses on four business cases– Localization BC requires XLIFF roundtripping– Web content personalisation BC requires HTML roundtripping
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 14
Challenges for roundtripping
• Source format– How to store enrichment information
(annotations)– How to handle existing information
• Annotation model– NIF = a general graph-based annotation model– Sources format and annotation motivation may
require restriction of the model
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 15
How to store annotations in various source formats
• Solvable for markup languages like HTML or XLIFF
• Challenge to preserve existing markup“<p>Welcome to <b>Prague</b>!</p>”
• General issue with complex and proprietary formats:– “My own” storage mechanism = no tool support– Using existing storage mechanisms may mean:
overloading semantics
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 16
Source format example: Word… <w:t>Welcome to Prague!</w:t> …
… <w:commentRangeStart w:id="0"/><w:t>Prague</w:t><w:commentRangeEnd w:id="0"/><w:r w:rsidR="00987079"> …
<w:p w:rsidRPr="00987079">… Enrichment: type "http://schema.org/Place"…</w:p>
Enrichment process; storing enrichment as comments
Change of original content: creation of anchorComment stored separately; refers to anchor: “standoff approach”
Content storage
Comment storage
Content storage (Word file unzipped)
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 17
Annotation models
• NIF: like RDF = general graph model– Consisting of nodes and arcs
p:char=11,17 dbp:PraguetaIdentRef
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 18
Restricting graphs: Tree structured annotations on several layers
• Tree structures for syntactic annotations
• Several annotation layers for the same text
• Concurrent hierarchies
• Representation only of one of these in roundtripping with XML
Example taken from TEI http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 19
Representing overlapping hierarchies with markup (1/2)
Solutions advertised by the TEI• Multiple encoding of the same information– One XML document per annotation
• Boundary marking with empty “milestone” elements– Also used by XLIFF
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 20
Representing overlapping hierarchies with markup (2/2)
Solutions advertised by the TEI• Fragmentation and reconstitution of virtual
elements– One hierarchy explicit, others with interrelated
marked-up spans• Stand-off markup– Separation of text and annotations, interlinked via
anchor and reference– Cf. Word example
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 21
Representing overlapping hierarchies in RDF
POWLA (cf. Chiarcos, 2012)• RDF representation for corpus annotation,
based on PAULA XML Standoff format• Allows to represent hierarchical, multi-layer
corpora in RDF and query in SPARQL• Not relevant for roundtripping, but for
linguistic annotation representation and processing in RDF
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 22
Lessons learned
• Choose the overlap solution that fits your roundtripping modelling and processing needs
• Consider off-the-shelf tooling– For 100% hierarchical data: XPath / CSS selectors, DOM, …
• Consider libraries– For extraction only: Tika http://tika.apache.org/– For roundtripping: Okapi http://okapi.opentag.com/ - in FREME
currently being adapted for roundtripping in selected formats • Make sure the annotation survives in the original format –
cf. Word example– Soon to be made easier by using Okapi
Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 23
Roundtripping of NIF based Linguistic Linked Data with non linked data sources
Felix SasakiDFKI / W3C Fellow