MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Multi Tier Annotation SearchMTAS

Matthijs Brouwer

Meertens Institute

December 8, 2015

Matthijs Brouwer Multi Tier Annotation Search



Results

1 Introduction

2 Lucene

3 MTAS

4 Tokenizer FoLiA

5 Search using CQL

6 Results




Results

Text and MetadataAnnotated TextRequirements

Provide Search on Combination of Text and Metadata

Example data

Author Eduard Douwes DekkerPlace of birth AmsterdamDate of birth 1820, March 2Pseudonym Max HavelaarTitle MultatuliPublished 1860

Text Ik ben makelaar in ko�een woon op de Lauriergrachtno 37 . . .




Results


Solution based on Apache Solr

Reverse Index

Apache Solr (based on Apache Lucene)

Index on both Text and Metadata

Advantages

Search

Facets

Scalable

Custom plugin (join)

Actively developed




Results


Search Text

’Ik ben makelaar in ko�e, en woon op de Lauriergracht no 37.’

We can search for

”Makelaar”

”Makelaar in ko�e”

”Makel.* in ko�e”




Results


Annotations


text lemma pos/featuresIk ik VNW(pers,pron,nomin,vol,1,ev)ben zijn WW(pv,tgw,ev)makelaar makelaar N(soort,ev,basis,zijd,stan)in in VZ(init)ko�e ko�e N(soort,ev,basis,zijd,stan), , LET(). . . . . . . . .




Results


FoLiA

<text xml:id=”untitled.text”><p xml:id=”untitled.p.1”><s xml:id=”untitled.p.1.s.1”><w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”><t>Ik</t><pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791”head=”VNW”><feat class=”pers” subset=”vwtype”/><feat class=”pron” subset=”pdtype”/><feat class=”nomin” subset=”naamval”/><feat class=”vol” subset=”status”/><feat class=”1” subset=”persoon”/><feat class=”ev” subset=”getal”/>

</pos><morphology><morpheme><t o↵set=”0”>ik</t>

</morpheme></morphology><lemma class=”ik”/>

</w>

. . .




Results


Required functionality

Extend current Solr solution

Search on annotations like pos, lemma, features, . . .

Search on sentences, paragraphs, chapters, . . .

Search on entities and chunks

Search on dependencies

Statistics, grouping, facets, . . .

Important

Maintaining functionality and scalability

Upgradeable to new releases Solr/Lucene




Results

TokenizationReverse IndexLimitationsAlternatives

Tokenization

Something about Lucene internals

Focus on textTokenization

Text is split up into tokens

value, e.g. ”ko�e”position, e.g. 4o↵set, e.g. 19� 24payload, e.g. 1.000





Results


Reverse Index

Tokenstream used to construct Reverse Index

text document position o↵set payloadben 0 1 3� 5 0.500de 0 9 38� 39 0.200en 0 6 27� 28 0.250in 0 3 16� 17 0.350ko�e 0 4 19� 24 0.900makelaar 0 2 7� 14 0.800. . . . . . . . . . . . . . .

This enables fast search, since the locations of matching terms canbe found very quickly.




Results


Limitations

Limitations of this approach

Heavily based on grouping by documentCollecting statisticsGrouping results

Not possible to includeStructural information: sentences, paragraphs, . . .Annotations: pos, lemma’s, . . .Relations: dependencies, chunking, . . .

No real forward indexFinding all tokens for a given position




Results


Alternatives

Alternative solutions

Graph DatabaseExperiments Neo4j: problems scalability and performanceToo general, doesn’t use sequential nature of textual data

BlackLabBased on Lucene, no integration with SolrDi↵erent fields for each annotation layer




Results

GeneralPrefixesPayloadForward IndexesAdditional requirements

Extension provided by MTAS

Store multiple tokens on the same position, and use prefixesto distinguish between di↵erent layers of annotations

Use the payload to encode additional information on eachtoken

Construct forward indexes by extending the Lucene Codec

Implementation

Extension based on the Lucene Library

Provide query handlers for extended data structures

Provide Solr Plugin using the MTAS extension




Results


Prefixes

Store multiple tokens on the same position, and use prefixes todistinguish between di↵erent layers of annotations

text document positionlemma:de 0 9lemma:zijn 0 1. . . . . . . . .pos:LID 0 9pos:WW 0 1. . . . . . . . .t:ben 0 1t:de 0 9. . . . . . . . .




Results


Payload

Use the payload to encode additional information on each token

mtas id integer identifying token within a documentposition type of position: single, range or set

additional information for range or seto↵set start and end o↵setreal o↵set start and end real o↵setparent reference to another token by its mtas idpayload original payload




Results


Forward Indexes

Construct forward indexes by extending the Lucene Codec

Position Given the position within the document,return references to all objects on that position.

Parent Id Given the mtas id, return referencesto all objects referring to this mtas id as parent

Object Id Given the id, return a reference to the objectPrefix/Position Given prefix and position, return the value




Results


Usage new structure

The additions make it possible to quickly retrieve the requiredinformation for queries and results based on the annotated text.

To take advantage of these additions to the Lucene structure, weneed

Tokenizer mapping the original annotated data (FoLiA) on thenew structure

Query handlers, and query language: CQL




Results

FoLiA

<text xml:id=”untitled.text”><p xml:id=”untitled.p.1”><s xml:id=”untitled.p.1.s.1”><w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”><t>Ik</t><pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791”head=”VNW”><feat class=”pers” subset=”vwtype”/><feat class=”pron” subset=”pdtype”/><feat class=”nomin” subset=”naamval”/><feat class=”vol” subset=”status”/><feat class=”1” subset=”persoon”/><feat class=”ev” subset=”getal”/>

</pos><morphology><morpheme><t o↵set=”0”>ik</t>

</morpheme></morphology><lemma class=”ik”/>

</w>

. . .




Results

Tokenizer FoLiA

Several elements can be distinguished:

Words : <w/>

Annotations on Words : <pos/>, <t/>, <lemma/>

Groups of Words : <p/>, <s/>, <div/>

Annotations on Groups : <lang/>

References : <wref/>

Relations : <entity/>

The configurable FoLiA tokenizer enables to define these items andmap them onto the new index structure.




Results

Search using CQL

For new MTAS data structure

Query handlers provided

Support Corpus Query Language (CQL)

Enables to define conditions on annotations

Confusion about the exact interpretation and implementation




Results

Search using CQL

the big green shiny appleLID ADJ ADJ ADJ N

Ambiguities illustrated by examples

[pos = ”LID”|word = ”the”] (1)

[word = ”b. ⇤ ”|word = ”. ⇤ g”] (2)

[pos = ”ADJ”]{2} (3)

[pos = ”ADJ”]? [pos = ”N”] (4)




Results

Search using CQL

Within MTAS

Results should be considered as equal if and only if thepositions of both results exactly match.

Di↵ers from the default query interpretation of Lucene andthe CQL interpretation as used in other applications

No options to refer to parts of the matched pattern to e.g.sort, group or collect statistics




Results

Size indexesPerformanceTODO

Size indexes

Collection # FoLiA Zipped Size Index PositionsDBNL T 9, 465 29GB 198GB 677,476,310DBNL DT 131, 177 95GB 395,530,191SONAR 2, 063, 880 22GB 127GB 504,393,711

Search on combined indexes using Solr sharding

# FoLiA 2, 204, 522# Positions 1, 577, 400, 212# Sentences 92, 584, 655

There are approximately 10 tokens on each position.




Results


Performance

Virtual Machine, Ubuntu, 8 cores, 48GB (40GB Solr)

Computing stats (sum, mean, median, standarddeviation, etc.) onfull set of 2, 204, 522 documents and 1, 577, 400, 212 positions.

CQL Time Hits Docs[t = ”de”] 3, 023 ms 57, 531, 353 1, 801, 583[t = ”de” & pos = ”LID”] 7, 877 ms 56, 704, 921 1, 799, 499[t = ”de” & !pos = ”LID”] 3, 105 ms 826, 432 132, 722< s > [t = ”De”] 11, 568 ms 6, 085, 643 1, 090, 127[pos = ”N”] 6, 200 ms 259, 942, 340 2, 189, 750[pos = ”ADJ”] [pos = ”N”] 42, 977 ms 45, 366, 603 1, 821, 716[pos = ”ADJ”]? [pos = ”N”] 207, 795 ms 305, 308, 943 2, 189, 750




Results


TODO

Group results

Facets

Performance

. . .




Results


The end


Science

MTAS Henny Brugman