Sasaki datathon-madrid-2015

Sasaki – LLD Datathon – Cercedilla, Spain, May 2015 1

Roundtripping of NIF based Linguistic Linked Data with non linked data sources

Felix SasakiDFKI / W3C Fellow

Slides:http://de.slideshare.net/atcfsenzoku/sasaki-datathonmadrid2015


What is NIF?

• Natural Language Processing Interchange Format– See http://nlp2rdf.org/

• LLD format to store annotations & to organize NLP pipelines

• API specification to create NIF workflows• More details: after the coffee break • Following slides: main roles for NIF

http://nlp2rdf.org/


Example (Partial; JSON-LD Syntax){ "@graph" : [ { "@id" : "p:char=0,18", "@type" : [ "nif:Context", "nif:Sentence", "nif:RFC5147String" ], "anchorOf" : "Welcome to Prague.", "beginIndex" : "0", "endIndex" : "18", "isString" : "Welcome to Prague.", "referenceContext" : "p:char=0,18” }, { "@id" : "p:char=11,17", "@type" : [ "nif:RFC5147String", "nif:Word" ], … "referenceContext" : "p:char=0,18", "taIdentRef" : "http://dbpedia.org/resource/Prague" }, …] }



• Identifying and typing annotations

• Identifying annotation offsets

• Adding additional knowledge, e.g. named entity identifier

• Interrelating annotations





• Adding additional knowledge, e.g. named entity identifier






• Adding additional knowledge, e.g.named entity identifier






• Adding additional knowledge, e.g.named entity identifier



A NIF workflow

Existing content

Content analytics, e.g. named entity recognition

Conversion to NIF

Deploying knowledge from the LLD cloud


Potential scenario: roundtripping

Existing content

Content analytics, e.g. named entity recognition

Conversion to NIF

Storing annotations in original content

Deploying knowledge from the LLD cloud


Roundtripping

• Roundtripping: Storing the outcome of content processing (analytics) tasks in the original content

• Not always needed, but sometimes – examples:– Enriching Web content with named entity

information; generating Schema.org markup via NIF pipelines. Format: HTML

– Enriching localisation content, to add value beyond translation: Format: XLIFF


Example: HTML

Example roundtripping workflow… Welcome to Prague!…

…Welcome to Prague!…

1) Conversion to NIF 2) NER processing3) Back conversion to HTML


Example: XLIFF

Example roundtripping workflow… <xlf:source>Welcome to Prague!</xlf:source> …

… <xlf:source>Welcome to <mrk …its:taClassRef="http://schema.org/Place">Prague</mrk>!</xlf:source> …

1) Conversion to NIF 2) NER processing3) Back conversion to HTML


Example usage scenario:FREME project

• See http://www.freme-project.eu/ • Developing interfaces for multilingual and semantic

enrichment of digital content• Relies on NIF based enrichment workflows

– See FREME API version 0.1http://api.freme-project.eu/doc/0.1/

• Deploys aspects of the LIDER reference architecture for LLD processing– See D3.1.1 at http://lider-project.eu/?q=doc/deliverables

• Focuses on four business cases– Localization BC requires XLIFF roundtripping– Web content personalisation BC requires HTML roundtripping


Challenges for roundtripping

• Source format– How to store enrichment information

(annotations)– How to handle existing information

• Annotation model– NIF = a general graph-based annotation model– Sources format and annotation motivation may

require restriction of the model


How to store annotations in various source formats

• Solvable for markup languages like HTML or XLIFF

• Challenge to preserve existing markup“Welcome to Prague!”

• General issue with complex and proprietary formats:– “My own” storage mechanism = no tool support– Using existing storage mechanisms may mean:

overloading semantics


Source format example: Word… <w:t>Welcome to Prague!</w:t> …

… <w:commentRangeStart w:id="0"/><w:t>Prague</w:t><w:commentRangeEnd w:id="0"/><w:r w:rsidR="00987079"> …

<w:p w:rsidRPr="00987079">… Enrichment: type "http://schema.org/Place"…</w:p>

Enrichment process; storing enrichment as comments

Change of original content: creation of anchorComment stored separately; refers to anchor: “standoff approach”

Content storage

Comment storage

Content storage (Word file unzipped)


Annotation models

• NIF: like RDF = general graph model– Consisting of nodes and arcs

p:char=11,17 dbp:PraguetaIdentRef


Restricting graphs: Tree structured annotations on several layers

• Tree structures for syntactic annotations

• Several annotation layers for the same text

• Concurrent hierarchies

• Representation only of one of these in roundtripping with XML

Example taken from TEI http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html


Representing overlapping hierarchies with markup (1/2)

Solutions advertised by the TEI• Multiple encoding of the same information– One XML document per annotation

• Boundary marking with empty “milestone” elements– Also used by XLIFF


Representing overlapping hierarchies with markup (2/2)

Solutions advertised by the TEI• Fragmentation and reconstitution of virtual

elements– One hierarchy explicit, others with interrelated

marked-up spans• Stand-off markup– Separation of text and annotations, interlinked via

anchor and reference– Cf. Word example


Representing overlapping hierarchies in RDF

POWLA (cf. Chiarcos, 2012)• RDF representation for corpus annotation,

based on PAULA XML Standoff format• Allows to represent hierarchical, multi-layer

corpora in RDF and query in SPARQL• Not relevant for roundtripping, but for

linguistic annotation representation and processing in RDF


Lessons learned

• Choose the overlap solution that fits your roundtripping modelling and processing needs

• Consider off-the-shelf tooling– For 100% hierarchical data: XPath / CSS selectors, DOM, …

• Consider libraries– For extraction only: Tika http://tika.apache.org/– For roundtripping: Okapi http://okapi.opentag.com/ - in FREME

currently being adapted for roundtripping in selected formats • Make sure the annotation survives in the original format –

cf. Word example– Soon to be made easier by using Okapi


Roundtripping of NIF based Linguistic Linked Data with non linked data sources

Felix SasakiDFKI / W3C Fellow

Internet

Sasaki datathon-madrid-2015