26
Introduction Lucene MTAS Tokenizer FoLiA Search using CQL Results Multi Tier Annotation Search MTAS Matthijs Brouwer Meertens Institute December 8, 2015 Matthijs Brouwer Multi Tier Annotation Search

MTAS Henny Brugman

  • Upload
    clariah

  • View
    335

  • Download
    2

Embed Size (px)

Citation preview

Page 1: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Multi Tier Annotation SearchMTAS

Matthijs Brouwer

Meertens Institute

December 8, 2015

Matthijs Brouwer Multi Tier Annotation Search

Page 2: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

1 Introduction

2 Lucene

3 MTAS

4 Tokenizer FoLiA

5 Search using CQL

6 Results

Matthijs Brouwer Multi Tier Annotation Search

Page 3: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Text and MetadataAnnotated TextRequirements

Provide Search on Combination of Text and Metadata

Example data

Author Eduard Douwes DekkerPlace of birth AmsterdamDate of birth 1820, March 2Pseudonym Max HavelaarTitle MultatuliPublished 1860

Text Ik ben makelaar in ko�een woon op de Lauriergrachtno 37 . . .

Matthijs Brouwer Multi Tier Annotation Search

Page 4: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Text and MetadataAnnotated TextRequirements

Solution based on Apache Solr

Reverse Index

Apache Solr (based on Apache Lucene)

Index on both Text and Metadata

Advantages

Search

Facets

Scalable

Custom plugin (join)

Actively developed

Matthijs Brouwer Multi Tier Annotation Search

Page 5: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Text and MetadataAnnotated TextRequirements

Search Text

’Ik ben makelaar in ko�e, en woon op de Lauriergracht no 37.’

We can search for

”Makelaar”

”Makelaar in ko�e”

”Makel.* in ko�e”

Matthijs Brouwer Multi Tier Annotation Search

Page 6: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Text and MetadataAnnotated TextRequirements

Annotations

’Ik ben makelaar in ko�e, en woon op de Lauriergracht no 37.’

text lemma pos/featuresIk ik VNW(pers,pron,nomin,vol,1,ev)ben zijn WW(pv,tgw,ev)makelaar makelaar N(soort,ev,basis,zijd,stan)in in VZ(init)ko�e ko�e N(soort,ev,basis,zijd,stan), , LET(). . . . . . . . .

Matthijs Brouwer Multi Tier Annotation Search

Page 7: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Text and MetadataAnnotated TextRequirements

FoLiA

<text xml:id=”untitled.text”><p xml:id=”untitled.p.1”><s xml:id=”untitled.p.1.s.1”><w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”><t>Ik</t><pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791”head=”VNW”><feat class=”pers” subset=”vwtype”/><feat class=”pron” subset=”pdtype”/><feat class=”nomin” subset=”naamval”/><feat class=”vol” subset=”status”/><feat class=”1” subset=”persoon”/><feat class=”ev” subset=”getal”/>

</pos><morphology><morpheme><t o↵set=”0”>ik</t>

</morpheme></morphology><lemma class=”ik”/>

</w>

. . .

Matthijs Brouwer Multi Tier Annotation Search

Page 8: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Text and MetadataAnnotated TextRequirements

Required functionality

Extend current Solr solution

Search on annotations like pos, lemma, features, . . .

Search on sentences, paragraphs, chapters, . . .

Search on entities and chunks

Search on dependencies

Statistics, grouping, facets, . . .

Important

Maintaining functionality and scalability

Upgradeable to new releases Solr/Lucene

Matthijs Brouwer Multi Tier Annotation Search

Page 9: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

TokenizationReverse IndexLimitationsAlternatives

Tokenization

Something about Lucene internals

Focus on textTokenization

Text is split up into tokens

value, e.g. ”ko�e”position, e.g. 4o↵set, e.g. 19� 24payload, e.g. 1.000

’Ik ben makelaar in ko�e, en woon op de Lauriergracht no 37.’

Matthijs Brouwer Multi Tier Annotation Search

Page 10: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

TokenizationReverse IndexLimitationsAlternatives

Reverse Index

Tokenstream used to construct Reverse Index

text document position o↵set payloadben 0 1 3� 5 0.500de 0 9 38� 39 0.200en 0 6 27� 28 0.250in 0 3 16� 17 0.350ko�e 0 4 19� 24 0.900makelaar 0 2 7� 14 0.800. . . . . . . . . . . . . . .

This enables fast search, since the locations of matching terms canbe found very quickly.

Matthijs Brouwer Multi Tier Annotation Search

Page 11: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

TokenizationReverse IndexLimitationsAlternatives

Limitations

Limitations of this approach

Heavily based on grouping by documentCollecting statisticsGrouping results

Not possible to includeStructural information: sentences, paragraphs, . . .Annotations: pos, lemma’s, . . .Relations: dependencies, chunking, . . .

No real forward indexFinding all tokens for a given position

Matthijs Brouwer Multi Tier Annotation Search

Page 12: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

TokenizationReverse IndexLimitationsAlternatives

Alternatives

Alternative solutions

Graph DatabaseExperiments Neo4j: problems scalability and performanceToo general, doesn’t use sequential nature of textual data

BlackLabBased on Lucene, no integration with SolrDi↵erent fields for each annotation layer

Matthijs Brouwer Multi Tier Annotation Search

Page 13: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

GeneralPrefixesPayloadForward IndexesAdditional requirements

Extension provided by MTAS

Store multiple tokens on the same position, and use prefixesto distinguish between di↵erent layers of annotations

Use the payload to encode additional information on eachtoken

Construct forward indexes by extending the Lucene Codec

Implementation

Extension based on the Lucene Library

Provide query handlers for extended data structures

Provide Solr Plugin using the MTAS extension

Matthijs Brouwer Multi Tier Annotation Search

Page 14: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

GeneralPrefixesPayloadForward IndexesAdditional requirements

Prefixes

Store multiple tokens on the same position, and use prefixes todistinguish between di↵erent layers of annotations

text document positionlemma:de 0 9lemma:zijn 0 1. . . . . . . . .pos:LID 0 9pos:WW 0 1. . . . . . . . .t:ben 0 1t:de 0 9. . . . . . . . .

Matthijs Brouwer Multi Tier Annotation Search

Page 15: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

GeneralPrefixesPayloadForward IndexesAdditional requirements

Payload

Use the payload to encode additional information on each token

mtas id integer identifying token within a documentposition type of position: single, range or set

additional information for range or seto↵set start and end o↵setreal o↵set start and end real o↵setparent reference to another token by its mtas idpayload original payload

Matthijs Brouwer Multi Tier Annotation Search

Page 16: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

GeneralPrefixesPayloadForward IndexesAdditional requirements

Forward Indexes

Construct forward indexes by extending the Lucene Codec

Position Given the position within the document,return references to all objects on that position.

Parent Id Given the mtas id, return referencesto all objects referring to this mtas id as parent

Object Id Given the id, return a reference to the objectPrefix/Position Given prefix and position, return the value

Matthijs Brouwer Multi Tier Annotation Search

Page 17: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

GeneralPrefixesPayloadForward IndexesAdditional requirements

Usage new structure

The additions make it possible to quickly retrieve the requiredinformation for queries and results based on the annotated text.

To take advantage of these additions to the Lucene structure, weneed

Tokenizer mapping the original annotated data (FoLiA) on thenew structure

Query handlers, and query language: CQL

Matthijs Brouwer Multi Tier Annotation Search

Page 18: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

FoLiA

<text xml:id=”untitled.text”><p xml:id=”untitled.p.1”><s xml:id=”untitled.p.1.s.1”><w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”><t>Ik</t><pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791”head=”VNW”><feat class=”pers” subset=”vwtype”/><feat class=”pron” subset=”pdtype”/><feat class=”nomin” subset=”naamval”/><feat class=”vol” subset=”status”/><feat class=”1” subset=”persoon”/><feat class=”ev” subset=”getal”/>

</pos><morphology><morpheme><t o↵set=”0”>ik</t>

</morpheme></morphology><lemma class=”ik”/>

</w>

. . .

Matthijs Brouwer Multi Tier Annotation Search

Page 19: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Tokenizer FoLiA

Several elements can be distinguished:

Words : <w/>

Annotations on Words : <pos/>, <t/>, <lemma/>

Groups of Words : <p/>, <s/>, <div/>

Annotations on Groups : <lang/>

References : <wref/>

Relations : <entity/>

The configurable FoLiA tokenizer enables to define these items andmap them onto the new index structure.

Matthijs Brouwer Multi Tier Annotation Search

Page 20: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Search using CQL

For new MTAS data structure

Query handlers provided

Support Corpus Query Language (CQL)

Enables to define conditions on annotations

Confusion about the exact interpretation and implementation

Matthijs Brouwer Multi Tier Annotation Search

Page 21: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Search using CQL

the big green shiny appleLID ADJ ADJ ADJ N

Ambiguities illustrated by examples

[pos = ”LID”|word = ”the”] (1)

[word = ”b. ⇤ ”|word = ”. ⇤ g”] (2)

[pos = ”ADJ”]{2} (3)

[pos = ”ADJ”]? [pos = ”N”] (4)

Matthijs Brouwer Multi Tier Annotation Search

Page 22: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Search using CQL

Within MTAS

Results should be considered as equal if and only if thepositions of both results exactly match.

Di↵ers from the default query interpretation of Lucene andthe CQL interpretation as used in other applications

No options to refer to parts of the matched pattern to e.g.sort, group or collect statistics

Matthijs Brouwer Multi Tier Annotation Search

Page 23: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Size indexesPerformanceTODO

Size indexes

Collection # FoLiA Zipped Size Index PositionsDBNL T 9, 465 29GB 198GB 677,476,310DBNL DT 131, 177 95GB 395,530,191SONAR 2, 063, 880 22GB 127GB 504,393,711

Search on combined indexes using Solr sharding

# FoLiA 2, 204, 522# Positions 1, 577, 400, 212# Sentences 92, 584, 655

There are approximately 10 tokens on each position.

Matthijs Brouwer Multi Tier Annotation Search

Page 24: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Size indexesPerformanceTODO

Performance

Virtual Machine, Ubuntu, 8 cores, 48GB (40GB Solr)

Computing stats (sum, mean, median, standarddeviation, etc.) onfull set of 2, 204, 522 documents and 1, 577, 400, 212 positions.

CQL Time Hits Docs[t = ”de”] 3, 023 ms 57, 531, 353 1, 801, 583[t = ”de” & pos = ”LID”] 7, 877 ms 56, 704, 921 1, 799, 499[t = ”de” & !pos = ”LID”] 3, 105 ms 826, 432 132, 722< s > [t = ”De”] 11, 568 ms 6, 085, 643 1, 090, 127[pos = ”N”] 6, 200 ms 259, 942, 340 2, 189, 750[pos = ”ADJ”] [pos = ”N”] 42, 977 ms 45, 366, 603 1, 821, 716[pos = ”ADJ”]? [pos = ”N”] 207, 795 ms 305, 308, 943 2, 189, 750

Matthijs Brouwer Multi Tier Annotation Search

Page 25: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Size indexesPerformanceTODO

TODO

Group results

Facets

Performance

. . .

Matthijs Brouwer Multi Tier Annotation Search

Page 26: MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Size indexesPerformanceTODO

The end

Matthijs Brouwer Multi Tier Annotation Search