A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming

Erik Fäßler TechnicalIntroductiontoSemedico 1

Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,

Jena, Germany

http://www.julielab.de

A Technical Introduction to the Semantic Search Engine SeMedico

Erik Fäßler

TalkintheSemesterprojektEntwicklungeinerSuchmaschinefürAlternativmethodenzuTierversuchen

January12,2018Humboldt-UniversitätzuBerlin


SeMedico Front Page


SeMedico Auto Completion


SeMedico Result View I


SeMedico Result View II


SeMedico System Overview JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO

ElasticSearchConceptDatabase

SeMedicoWebApplicationJavaServlet

Frontend(Tapestry/JavaScript)

NCBIGene

DocDoc

DocMEDLINE


MEDLINE Document Storage I •  MEDLINE comes in (G)ZIPed XML

•  30K documents per file <PubmedArticleSet>

<PubmedArticle><MedlineCitation> <PMID>1234567</PMID>

<Article> <Journal>...</Journal> <ArticleTitle>...</ArticleTitle> <Abstract>...</Abstract> <AuthorList>...</AuthorList> <MeshHeadings>...</MeshHeadings>

</Article></MedlineCitation><MedlineCitation> <PMID>...</PMID> ...

</MedlineCitation></PubmedArticle></PubmedArticleSet>


MEDLINE Document Storage II •  Import of MEDLINE citations into database table

•  Size of MEDLINE: 27M abstracts

pmid xml

1 1234567 <MedlineCitation><PMID>1234567</PMID>...</MedlineCitation>




... ... ...


pmid xml





... ... ...

DocDoc

DocMEDLINE

JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO




NCBIGene

From the Database into the Pipeline I


From the Database into the Pipeline II

UIMAMedlineDBReader•  DBconcurrencyhandling•  ParsingofXML•  PopulatingUIMACASinstance

•  Title/Abstract•  Authors•  JournalInfo•  etc.

JULIELabServer

PostgreSQL totextanalysiscomponents

CAS

CommonAnalysisSystem



PostgreSQL

CR

AE

AE

AE

CO




NCBIGene

DocDoc

DocMEDLINE


SeMedico UIMA JCoRe Pipeline I

Sentences Tokens Abbreviations PartsofSpeech

GeNo:Genes/Proteins

•  Recognition•  Normalization

(NCBIGene)

Semanticlayer

MolecularEventExtraction(BioSem)

MeSHTerms(Dictionary)

Ontologyclasses(GO,GRO;Dictionary)

EventCertaintyAssessment

Scale1to61:Negation6:Nodoubt

Species(LINNAEUS)

fromreader

toconsumer

https://github.com/JULIELab/,Hahn&Matthiesetal.,LREC2016


SeMedico UIMA JCoRe Pipeline II

ElasticSearchCASConsumer•  TransformsCASinto

preanalyzedJSONdocument•  Transformationconfigurable

viaAPI•  JULIELabESpluginrequired

fromanalysispipeline

ElasticSearch

CAS

title

abstract

species

genes

events

preanalyzedJSON{

“title”:{…},“abstract”:{…},“authors”:{…},“…”:{…}

}

transformationAP

I

http


Full texts from Pubmed Central

•  SeMedico integrates the open access subset of PMC

•  Use a specific reader from JCoRe: jcore-pmc-reader

•  The rest of the analysis is basically the same

•  But:

Matthies,Franz,&Hahn,Udo(2017).ScholarlyinformationextractionisgoingtomakeaquantumleapwithPubMedCentral(PMC)®—Butmovingfromabstractstofulltextsseemsharderthanexpected.in:MedInfo2017:PrecisionHealthcarethroughInformatics–Proceedingsofthe16thWorldCongressonMedicalandHealthInformatics.Hangzhou,China,21-25August2017,521-525.



PostgreSQL

CR

AE

AE

AE

CO




NCBIGene

DocDoc

DocMEDLINE


Concept Database I Name Description NumberofConcepts

MedicalSubjectHeadings(MeSH)

Biomedicalvocabulary,multihierarchy

26K

MeSHSupplementaryConcepts Chemicals,proteinsetc.connectedtoMeSH

150K

NCBIGene GeneDatabase 650K(inSeMedico)

NCBITaxonomy Taxonomicalclassificationofspecies

1.1M

GeneOntology(GO) Ontologyaboutgeneproductsandrelatedprocesses

50K

GeneRegulationOntology(GRO) Ontologyaboutgeneregulationprocesses

507


Concept Database II

•  Concepts are arranged taxonomically •  Squamous Cell Carcinoma IS-A Carcinoma

•  Neo4j is a graph database •  Terminologies and arbitrary relations between

concepts can be modeled explicitly •  Appropiate query language:

•  “Get descendants of concept” •  “Compute shortest path between two

concepts”


Neo4j Example Graph

type1

type2 type3

type4

Tauopathies


Neo4j Concept Node Properties


Zooming Out


Concept IDs

ConceptDatabase

tid2341

tid914

tid42

CASabstract

speciesncbitax:9606

genesmTOR

ncbigene:2475

JSON{

“abstract”:{[“human”,“tid914”,“mTOR”,“tid42”]}

}transformationAP

I

ElasticSearch


query:“match:tid914”facet“tid42”:{“name”:“mTOR”,“synonym”:“FRAP”,“description”:“…“}


ElasticSearch I

• Manages Lucene index

•  Seamless index updates, no downtime

•  Easy to use index distribution model

•  Full text search

•  Faceting

• Highlighting


ElasticSearch II

•  Lucene generates index terms via “text analysis” –  Tokenization, case folding, synonym enrichment, stemming –  ElasticSearch does the same on sent document text

•  How to integrate UIMA?

•  First idea: Create a Lucene UIMA analyzer, but –  Moves (a lot!) processing requirements into the ElasticSearch

cluster –  Requires to load dictionaries, machine learning models –  Memory that is lost to Lucene and ElasticSearch –  Overall: Diminishes search performance

?


ElasticSearch III

•  JULIE Lab ElasticSearch plugin to exactly specify index terms without ES-internal analysis –  https://github.com/JULIELab/elasticsearch-mapper-preanalyzed

•  Employs the JSON format created for the Solr JsonPreAnalyzedParser –  https://lucene.apache.org/solr/guide/6_6/working-with-external-

files-and-processes.html#WorkingwithExternalFilesandProcesses-JsonPreAnalyzedParser

•  Created by JULIE Lab internal (currently) CAS consumer


ElasticSearch IV Preanalyzed Format {"v":"1",

"str":"Immunohistochemistry performed to evaluate the expression of phosphorylated mTOR (p-mTOR), phosphorylated p70S6K (p-p70S6K), phosphorylated 4E-binding protein 1 (p-4E-BP1), and Ki-67 using 105 surgically resected ESCC correlated with treatment outcome.",

"tokens":[{"t":”immunohistochemistry","s”:0,"e”:20,"i":1},

{"t":”tid94702","s”:0,"e”:20,"i”:0},

{"t":”perform","s”:21,"e”:30,"i":1},

{"t":”evaluat","s”:34,"e”:42,"i":1},

{"t":”event","s”:34,"e”:42,"i”:0}, …

]

}


ElasticSearch V Simple Query { "query": { "bool": { "must": [ { "match": { "abstracttext": { "query": ”cancer” }}}, { "nested": { "path": "events", "inner_hits": {}, "query": { "bool": { "must": [{ "match": { "events.allarguments": "mtor" }}], "filter": { "range": { "events.likelihood": { "lte": 5}}}}}}}]}},

"fields": [ "abstracttext", "title" ]}


ElasticSearch VI Concept Query { "query": { "bool": { "must": [ { "match": { "abstracttext": { "query": ”tid52310” }}}, { "nested": { "path": "events", "inner_hits": {}, "query": { "bool": { "must": [{ "match": { "events.allarguments": “tid42" }}], "filter": { "range": { "events.likelihood": { "lte": 5}}}}}}}]}},

"fields": [ "abstracttext", "title" ]}


ElasticSearch VII Highlighting


References •  Semedico

–  Faessler, Erik, & Hahn, Udo (2017). SEMEDICO: A comprehensive semantic search engine for the life sciences. in: ACL 2017 – Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Vancouver, British Columbia, Canada, August 1, 2017, 91–96.

•  GeNo –  Wermter, Joachim, & Tomanek, Katrin, & Hahn, Udo (2009). High-performance gene name

normalization with GeNo. in: Bioinformatics, 25, 815-821.

•  BioSem –  Bui, Q., Mulligen, E. van, Campos, D., & Kors, J. (2013). A Fast Rule-based Approach for

Biomedical Event Extraction. In Proceedings of the BioNLP 2013 Shared Task Workshop (pp. 104–108). Sofia, Bulgaria: Association for Computational Linguistics.

•  Certainty Assessment –  Engelmann, Christine, & Hahn, Udo (2014). An empirically grounded approach to extend the

linguistic coverage and lexical diversity of verbal probabilities. in: CogSci 2014 - Proceedings of the 36th Annual Cognitive Science Conference. Cognitive Science Meets Artificial Intelligence: Human and Artificial Agents in Interactive Contexts. Québec City, Québec, Canada, July 23-26, 2014., 451-456.

•  JCoRe –  Hahn, Udo, & Matthies, Franz, & Faessler, Erik, & Hellrich, Johannes (2016). UIMA-based

JCoRe 2.0 goes GitHub and Maven Central: State-of-the-art software resource engineering and distribution of NLP pipelines. in: LREC 2016 – Proceedings of the 10th International Conference on Language Resources and Evaluation. Portorož, Slovenia, 23-28 May 2016, 2502-2509.


Conclusion

DocDoc

DocMEDLINE

JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO




NCBIGene

http://www.semedico.org/


Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,

Jena, Germany

http://www.julielab.de

A Technical Introduction to the Semantic Search Engine SeMedico

Erik Fäßler

TalkintheSemesterprojektEntwicklungeinerSuchmaschinefürAlternativmethodenzuTierversuchen

January12,2018Humboldt-UniversitätzuBerlin

Documents

A Technical Introduction to the Semantic Search Engine ...€¦ · • Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming