31
Erik Fäßler Technical Introduction to Semedico 1 Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena, Jena, Germany http://www.julielab.de A Technical Introduction to the Semantic Search Engine SeMedico Erik Fäßler Talk in the Semesterprojekt Entwicklung einer Suchmaschine für Alternativmethoden zu Tierversuchen January 12, 2018 Humboldt-Universität zu Berlin

A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 1

Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,

Jena, Germany

http://www.julielab.de

A Technical Introduction to the Semantic Search Engine SeMedico

Erik Fäßler

TalkintheSemesterprojektEntwicklungeinerSuchmaschinefürAlternativmethodenzuTierversuchen

January12,2018Humboldt-UniversitätzuBerlin

Page 2: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 2

SeMedico Front Page

Page 3: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 3

SeMedico Auto Completion

Page 4: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 4

SeMedico Result View I

Page 5: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 5

SeMedico Result View II

Page 6: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 6

SeMedico System Overview JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO

ElasticSearchConceptDatabase

SeMedicoWebApplicationJavaServlet

Frontend(Tapestry/JavaScript)

NCBIGene

DocDoc

DocMEDLINE

Page 7: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 7

MEDLINE Document Storage I •  MEDLINE comes in (G)ZIPed XML

•  30K documents per file <PubmedArticleSet>

<PubmedArticle><MedlineCitation> <PMID>1234567</PMID>

<Article> <Journal>...</Journal> <ArticleTitle>...</ArticleTitle> <Abstract>...</Abstract> <AuthorList>...</AuthorList> <MeshHeadings>...</MeshHeadings>

</Article></MedlineCitation><MedlineCitation> <PMID>...</PMID> ...

</MedlineCitation></PubmedArticle></PubmedArticleSet>

Page 8: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 8

MEDLINE Document Storage II •  Import of MEDLINE citations into database table

•  Size of MEDLINE: 27M abstracts

pmid xml

1 1234567 <MedlineCitation><PMID>1234567</PMID>...</MedlineCitation>

2 1729454 <MedlineCitation><PMID>1729454</PMID>...</MedlineCitation>

3 1785742 <MedlineCitation><PMID>1785742</PMID>...</MedlineCitation>

4 2264674 <MedlineCitation><PMID>2264674</PMID>...</MedlineCitation>

... ... ...

Page 9: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 9

pmid xml

1 1234567 <MedlineCitation><PMID>1234567</PMID>...</MedlineCitation>

2 1729454 <MedlineCitation><PMID>1729454</PMID>...</MedlineCitation>

3 1785742 <MedlineCitation><PMID>1785742</PMID>...</MedlineCitation>

4 2264674 <MedlineCitation><PMID>2264674</PMID>...</MedlineCitation>

... ... ...

DocDoc

DocMEDLINE

JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO

ElasticSearchConceptDatabase

SeMedicoWebApplicationJavaServlet

Frontend(Tapestry/JavaScript)

NCBIGene

From the Database into the Pipeline I

Page 10: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 10

From the Database into the Pipeline II

UIMAMedlineDBReader•  DBconcurrencyhandling•  ParsingofXML•  PopulatingUIMACASinstance

•  Title/Abstract•  Authors•  JournalInfo•  etc.

JULIELabServer

PostgreSQL totextanalysiscomponents

CAS

CommonAnalysisSystem

Page 11: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 11

SeMedico System Overview JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO

ElasticSearchConceptDatabase

SeMedicoWebApplicationJavaServlet

Frontend(Tapestry/JavaScript)

NCBIGene

DocDoc

DocMEDLINE

Page 12: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 12

SeMedico UIMA JCoRe Pipeline I

Sentences Tokens Abbreviations PartsofSpeech

GeNo:Genes/Proteins

•  Recognition•  Normalization

(NCBIGene)

Semanticlayer

MolecularEventExtraction(BioSem)

MeSHTerms(Dictionary)

Ontologyclasses(GO,GRO;Dictionary)

EventCertaintyAssessment

Scale1to61:Negation6:Nodoubt

Species(LINNAEUS)

fromreader

toconsumer

https://github.com/JULIELab/,Hahn&Matthiesetal.,LREC2016

Page 13: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 13

SeMedico UIMA JCoRe Pipeline II

ElasticSearchCASConsumer•  TransformsCASinto

preanalyzedJSONdocument•  Transformationconfigurable

viaAPI•  JULIELabESpluginrequired

fromanalysispipeline

ElasticSearch

CAS

title

abstract

species

genes

events

preanalyzedJSON{

“title”:{…},“abstract”:{…},“authors”:{…},“…”:{…}

}

transformationAP

I

http

Page 14: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 14

Full texts from Pubmed Central

•  SeMedico integrates the open access subset of PMC

•  Use a specific reader from JCoRe: jcore-pmc-reader

•  The rest of the analysis is basically the same

•  But:

Matthies,Franz,&Hahn,Udo(2017).ScholarlyinformationextractionisgoingtomakeaquantumleapwithPubMedCentral(PMC)®—Butmovingfromabstractstofulltextsseemsharderthanexpected.in:MedInfo2017:PrecisionHealthcarethroughInformatics–Proceedingsofthe16thWorldCongressonMedicalandHealthInformatics.Hangzhou,China,21-25August2017,521-525.

Page 15: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 15

SeMedico System Overview JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO

ElasticSearchConceptDatabase

SeMedicoWebApplicationJavaServlet

Frontend(Tapestry/JavaScript)

NCBIGene

DocDoc

DocMEDLINE

Page 16: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 16

Concept Database I Name Description NumberofConcepts

MedicalSubjectHeadings(MeSH)

Biomedicalvocabulary,multihierarchy

26K

MeSHSupplementaryConcepts Chemicals,proteinsetc.connectedtoMeSH

150K

NCBIGene GeneDatabase 650K(inSeMedico)

NCBITaxonomy Taxonomicalclassificationofspecies

1.1M

GeneOntology(GO) Ontologyaboutgeneproductsandrelatedprocesses

50K

GeneRegulationOntology(GRO) Ontologyaboutgeneregulationprocesses

507

Page 17: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 17

Concept Database II

•  Concepts are arranged taxonomically •  Squamous Cell Carcinoma IS-A Carcinoma

•  Neo4j is a graph database •  Terminologies and arbitrary relations between

concepts can be modeled explicitly •  Appropiate query language:

•  “Get descendants of concept” •  “Compute shortest path between two

concepts”

Page 18: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 18

Neo4j Example Graph

type1

type2 type3

type4

Tauopathies

Page 19: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 19

Neo4j Concept Node Properties

Page 20: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 20

Zooming Out

Page 21: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 21

Concept IDs

ConceptDatabase

tid2341

tid914

tid42

CASabstract

speciesncbitax:9606

genesmTOR

ncbigene:2475

JSON{

“abstract”:{[“human”,“tid914”,“mTOR”,“tid42”]}

}transformationAP

I

ElasticSearch

SeMedicoWebApplicationJavaServlet

query:“match:tid914”facet“tid42”:{“name”:“mTOR”,“synonym”:“FRAP”,“description”:“…“}

Page 22: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 22

ElasticSearch I

• Manages Lucene index

•  Seamless index updates, no downtime

•  Easy to use index distribution model

•  Full text search

•  Faceting

• Highlighting

Page 23: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 23

ElasticSearch II

•  Lucene generates index terms via “text analysis” –  Tokenization, case folding, synonym enrichment, stemming –  ElasticSearch does the same on sent document text

•  How to integrate UIMA?

•  First idea: Create a Lucene UIMA analyzer, but –  Moves (a lot!) processing requirements into the ElasticSearch

cluster –  Requires to load dictionaries, machine learning models –  Memory that is lost to Lucene and ElasticSearch –  Overall: Diminishes search performance

?

Page 24: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 24

ElasticSearch III

•  JULIE Lab ElasticSearch plugin to exactly specify index terms without ES-internal analysis –  https://github.com/JULIELab/elasticsearch-mapper-preanalyzed

•  Employs the JSON format created for the Solr JsonPreAnalyzedParser –  https://lucene.apache.org/solr/guide/6_6/working-with-external-

files-and-processes.html#WorkingwithExternalFilesandProcesses-JsonPreAnalyzedParser

•  Created by JULIE Lab internal (currently) CAS consumer

Page 25: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 25

ElasticSearch IV Preanalyzed Format {"v":"1",

"str":"Immunohistochemistry performed to evaluate the expression of phosphorylated mTOR (p-mTOR), phosphorylated p70S6K (p-p70S6K), phosphorylated 4E-binding protein 1 (p-4E-BP1), and Ki-67 using 105 surgically resected ESCC correlated with treatment outcome.",

"tokens":[{"t":”immunohistochemistry","s”:0,"e”:20,"i":1},

{"t":”tid94702","s”:0,"e”:20,"i”:0},

{"t":”perform","s”:21,"e”:30,"i":1},

{"t":”evaluat","s”:34,"e”:42,"i":1},

{"t":”event","s”:34,"e”:42,"i”:0}, …

]

}

Page 26: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 26

ElasticSearch V Simple Query { "query": { "bool": { "must": [ { "match": { "abstracttext": { "query": ”cancer” }}}, { "nested": { "path": "events", "inner_hits": {}, "query": { "bool": { "must": [{ "match": { "events.allarguments": "mtor" }}], "filter": { "range": { "events.likelihood": { "lte": 5}}}}}}}]}},

"fields": [ "abstracttext", "title" ]}

Page 27: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 27

ElasticSearch VI Concept Query { "query": { "bool": { "must": [ { "match": { "abstracttext": { "query": ”tid52310” }}}, { "nested": { "path": "events", "inner_hits": {}, "query": { "bool": { "must": [{ "match": { "events.allarguments": “tid42" }}], "filter": { "range": { "events.likelihood": { "lte": 5}}}}}}}]}},

"fields": [ "abstracttext", "title" ]}

Page 28: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 28

ElasticSearch VII Highlighting

Page 29: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 29

References •  Semedico

–  Faessler, Erik, & Hahn, Udo (2017). SEMEDICO: A comprehensive semantic search engine for the life sciences. in: ACL 2017 – Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Vancouver, British Columbia, Canada, August 1, 2017, 91–96.

•  GeNo –  Wermter, Joachim, & Tomanek, Katrin, & Hahn, Udo (2009). High-performance gene name

normalization with GeNo. in: Bioinformatics, 25, 815-821.

•  BioSem –  Bui, Q., Mulligen, E. van, Campos, D., & Kors, J. (2013). A Fast Rule-based Approach for

Biomedical Event Extraction. In Proceedings of the BioNLP 2013 Shared Task Workshop (pp. 104–108). Sofia, Bulgaria: Association for Computational Linguistics.

•  Certainty Assessment –  Engelmann, Christine, & Hahn, Udo (2014). An empirically grounded approach to extend the

linguistic coverage and lexical diversity of verbal probabilities. in: CogSci 2014 - Proceedings of the 36th Annual Cognitive Science Conference. Cognitive Science Meets Artificial Intelligence: Human and Artificial Agents in Interactive Contexts. Québec City, Québec, Canada, July 23-26, 2014., 451-456.

•  JCoRe –  Hahn, Udo, & Matthies, Franz, & Faessler, Erik, & Hellrich, Johannes (2016). UIMA-based

JCoRe 2.0 goes GitHub and Maven Central: State-of-the-art software resource engineering and distribution of NLP pipelines. in: LREC 2016 – Proceedings of the 10th International Conference on Language Resources and Evaluation. Portorož, Slovenia, 23-28 May 2016, 2502-2509.

Page 30: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 30

Conclusion

DocDoc

DocMEDLINE

JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO

ElasticSearchConceptDatabase

SeMedicoWebApplicationJavaServlet

Frontend(Tapestry/JavaScript)

NCBIGene

http://www.semedico.org/

Page 31: A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 31

Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,

Jena, Germany

http://www.julielab.de

A Technical Introduction to the Semantic Search Engine SeMedico

Erik Fäßler

TalkintheSemesterprojektEntwicklungeinerSuchmaschinefürAlternativmethodenzuTierversuchen

January12,2018Humboldt-UniversitätzuBerlin