Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Erik Fäßler TechnicalIntroductiontoSemedico 1
Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,
Jena, Germany
http://www.julielab.de
A Technical Introduction to the Semantic Search Engine SeMedico
Erik Fäßler
TalkintheSemesterprojektEntwicklungeinerSuchmaschinefürAlternativmethodenzuTierversuchen
January12,2018Humboldt-UniversitätzuBerlin
Erik Fäßler TechnicalIntroductiontoSemedico 2
SeMedico Front Page
Erik Fäßler TechnicalIntroductiontoSemedico 3
SeMedico Auto Completion
Erik Fäßler TechnicalIntroductiontoSemedico 4
SeMedico Result View I
Erik Fäßler TechnicalIntroductiontoSemedico 5
SeMedico Result View II
Erik Fäßler TechnicalIntroductiontoSemedico 6
SeMedico System Overview JULIELabServer
PostgreSQL
CR
AE
AE
AE
CO
ElasticSearchConceptDatabase
SeMedicoWebApplicationJavaServlet
Frontend(Tapestry/JavaScript)
NCBIGene
DocDoc
DocMEDLINE
Erik Fäßler TechnicalIntroductiontoSemedico 7
MEDLINE Document Storage I • MEDLINE comes in (G)ZIPed XML
• 30K documents per file <PubmedArticleSet>
<PubmedArticle><MedlineCitation> <PMID>1234567</PMID>
<Article> <Journal>...</Journal> <ArticleTitle>...</ArticleTitle> <Abstract>...</Abstract> <AuthorList>...</AuthorList> <MeshHeadings>...</MeshHeadings>
</Article></MedlineCitation><MedlineCitation> <PMID>...</PMID> ...
</MedlineCitation></PubmedArticle></PubmedArticleSet>
Erik Fäßler TechnicalIntroductiontoSemedico 8
MEDLINE Document Storage II • Import of MEDLINE citations into database table
• Size of MEDLINE: 27M abstracts
pmid xml
1 1234567 <MedlineCitation><PMID>1234567</PMID>...</MedlineCitation>
2 1729454 <MedlineCitation><PMID>1729454</PMID>...</MedlineCitation>
3 1785742 <MedlineCitation><PMID>1785742</PMID>...</MedlineCitation>
4 2264674 <MedlineCitation><PMID>2264674</PMID>...</MedlineCitation>
... ... ...
Erik Fäßler TechnicalIntroductiontoSemedico 9
pmid xml
1 1234567 <MedlineCitation><PMID>1234567</PMID>...</MedlineCitation>
2 1729454 <MedlineCitation><PMID>1729454</PMID>...</MedlineCitation>
3 1785742 <MedlineCitation><PMID>1785742</PMID>...</MedlineCitation>
4 2264674 <MedlineCitation><PMID>2264674</PMID>...</MedlineCitation>
... ... ...
DocDoc
DocMEDLINE
JULIELabServer
PostgreSQL
CR
AE
AE
AE
CO
ElasticSearchConceptDatabase
SeMedicoWebApplicationJavaServlet
Frontend(Tapestry/JavaScript)
NCBIGene
From the Database into the Pipeline I
Erik Fäßler TechnicalIntroductiontoSemedico 10
From the Database into the Pipeline II
UIMAMedlineDBReader• DBconcurrencyhandling• ParsingofXML• PopulatingUIMACASinstance
• Title/Abstract• Authors• JournalInfo• etc.
JULIELabServer
PostgreSQL totextanalysiscomponents
CAS
CommonAnalysisSystem
Erik Fäßler TechnicalIntroductiontoSemedico 11
SeMedico System Overview JULIELabServer
PostgreSQL
CR
AE
AE
AE
CO
ElasticSearchConceptDatabase
SeMedicoWebApplicationJavaServlet
Frontend(Tapestry/JavaScript)
NCBIGene
DocDoc
DocMEDLINE
Erik Fäßler TechnicalIntroductiontoSemedico 12
SeMedico UIMA JCoRe Pipeline I
Sentences Tokens Abbreviations PartsofSpeech
GeNo:Genes/Proteins
• Recognition• Normalization
(NCBIGene)
Semanticlayer
MolecularEventExtraction(BioSem)
MeSHTerms(Dictionary)
Ontologyclasses(GO,GRO;Dictionary)
EventCertaintyAssessment
Scale1to61:Negation6:Nodoubt
Species(LINNAEUS)
fromreader
toconsumer
https://github.com/JULIELab/,Hahn&Matthiesetal.,LREC2016
Erik Fäßler TechnicalIntroductiontoSemedico 13
SeMedico UIMA JCoRe Pipeline II
ElasticSearchCASConsumer• TransformsCASinto
preanalyzedJSONdocument• Transformationconfigurable
viaAPI• JULIELabESpluginrequired
fromanalysispipeline
ElasticSearch
CAS
title
abstract
species
genes
events
preanalyzedJSON{
“title”:{…},“abstract”:{…},“authors”:{…},“…”:{…}
}
transformationAP
I
http
Erik Fäßler TechnicalIntroductiontoSemedico 14
Full texts from Pubmed Central
• SeMedico integrates the open access subset of PMC
• Use a specific reader from JCoRe: jcore-pmc-reader
• The rest of the analysis is basically the same
• But:
Matthies,Franz,&Hahn,Udo(2017).ScholarlyinformationextractionisgoingtomakeaquantumleapwithPubMedCentral(PMC)®—Butmovingfromabstractstofulltextsseemsharderthanexpected.in:MedInfo2017:PrecisionHealthcarethroughInformatics–Proceedingsofthe16thWorldCongressonMedicalandHealthInformatics.Hangzhou,China,21-25August2017,521-525.
Erik Fäßler TechnicalIntroductiontoSemedico 15
SeMedico System Overview JULIELabServer
PostgreSQL
CR
AE
AE
AE
CO
ElasticSearchConceptDatabase
SeMedicoWebApplicationJavaServlet
Frontend(Tapestry/JavaScript)
NCBIGene
DocDoc
DocMEDLINE
Erik Fäßler TechnicalIntroductiontoSemedico 16
Concept Database I Name Description NumberofConcepts
MedicalSubjectHeadings(MeSH)
Biomedicalvocabulary,multihierarchy
26K
MeSHSupplementaryConcepts Chemicals,proteinsetc.connectedtoMeSH
150K
NCBIGene GeneDatabase 650K(inSeMedico)
NCBITaxonomy Taxonomicalclassificationofspecies
1.1M
GeneOntology(GO) Ontologyaboutgeneproductsandrelatedprocesses
50K
GeneRegulationOntology(GRO) Ontologyaboutgeneregulationprocesses
507
Erik Fäßler TechnicalIntroductiontoSemedico 17
Concept Database II
• Concepts are arranged taxonomically • Squamous Cell Carcinoma IS-A Carcinoma
• Neo4j is a graph database • Terminologies and arbitrary relations between
concepts can be modeled explicitly • Appropiate query language:
• “Get descendants of concept” • “Compute shortest path between two
concepts”
Erik Fäßler TechnicalIntroductiontoSemedico 18
Neo4j Example Graph
type1
type2 type3
type4
Tauopathies
Erik Fäßler TechnicalIntroductiontoSemedico 19
Neo4j Concept Node Properties
Erik Fäßler TechnicalIntroductiontoSemedico 20
Zooming Out
Erik Fäßler TechnicalIntroductiontoSemedico 21
Concept IDs
ConceptDatabase
tid2341
tid914
tid42
CASabstract
speciesncbitax:9606
genesmTOR
ncbigene:2475
JSON{
“abstract”:{[“human”,“tid914”,“mTOR”,“tid42”]}
}transformationAP
I
ElasticSearch
SeMedicoWebApplicationJavaServlet
query:“match:tid914”facet“tid42”:{“name”:“mTOR”,“synonym”:“FRAP”,“description”:“…“}
Erik Fäßler TechnicalIntroductiontoSemedico 22
ElasticSearch I
• Manages Lucene index
• Seamless index updates, no downtime
• Easy to use index distribution model
• Full text search
• Faceting
• Highlighting
Erik Fäßler TechnicalIntroductiontoSemedico 23
ElasticSearch II
• Lucene generates index terms via “text analysis” – Tokenization, case folding, synonym enrichment, stemming – ElasticSearch does the same on sent document text
• How to integrate UIMA?
• First idea: Create a Lucene UIMA analyzer, but – Moves (a lot!) processing requirements into the ElasticSearch
cluster – Requires to load dictionaries, machine learning models – Memory that is lost to Lucene and ElasticSearch – Overall: Diminishes search performance
?
Erik Fäßler TechnicalIntroductiontoSemedico 24
ElasticSearch III
• JULIE Lab ElasticSearch plugin to exactly specify index terms without ES-internal analysis – https://github.com/JULIELab/elasticsearch-mapper-preanalyzed
• Employs the JSON format created for the Solr JsonPreAnalyzedParser – https://lucene.apache.org/solr/guide/6_6/working-with-external-
files-and-processes.html#WorkingwithExternalFilesandProcesses-JsonPreAnalyzedParser
• Created by JULIE Lab internal (currently) CAS consumer
Erik Fäßler TechnicalIntroductiontoSemedico 25
ElasticSearch IV Preanalyzed Format {"v":"1",
"str":"Immunohistochemistry performed to evaluate the expression of phosphorylated mTOR (p-mTOR), phosphorylated p70S6K (p-p70S6K), phosphorylated 4E-binding protein 1 (p-4E-BP1), and Ki-67 using 105 surgically resected ESCC correlated with treatment outcome.",
"tokens":[{"t":”immunohistochemistry","s”:0,"e”:20,"i":1},
{"t":”tid94702","s”:0,"e”:20,"i”:0},
{"t":”perform","s”:21,"e”:30,"i":1},
{"t":”evaluat","s”:34,"e”:42,"i":1},
{"t":”event","s”:34,"e”:42,"i”:0}, …
]
}
Erik Fäßler TechnicalIntroductiontoSemedico 26
ElasticSearch V Simple Query { "query": { "bool": { "must": [ { "match": { "abstracttext": { "query": ”cancer” }}}, { "nested": { "path": "events", "inner_hits": {}, "query": { "bool": { "must": [{ "match": { "events.allarguments": "mtor" }}], "filter": { "range": { "events.likelihood": { "lte": 5}}}}}}}]}},
"fields": [ "abstracttext", "title" ]}
Erik Fäßler TechnicalIntroductiontoSemedico 27
ElasticSearch VI Concept Query { "query": { "bool": { "must": [ { "match": { "abstracttext": { "query": ”tid52310” }}}, { "nested": { "path": "events", "inner_hits": {}, "query": { "bool": { "must": [{ "match": { "events.allarguments": “tid42" }}], "filter": { "range": { "events.likelihood": { "lte": 5}}}}}}}]}},
"fields": [ "abstracttext", "title" ]}
Erik Fäßler TechnicalIntroductiontoSemedico 28
ElasticSearch VII Highlighting
Erik Fäßler TechnicalIntroductiontoSemedico 29
References • Semedico
– Faessler, Erik, & Hahn, Udo (2017). SEMEDICO: A comprehensive semantic search engine for the life sciences. in: ACL 2017 – Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Vancouver, British Columbia, Canada, August 1, 2017, 91–96.
• GeNo – Wermter, Joachim, & Tomanek, Katrin, & Hahn, Udo (2009). High-performance gene name
normalization with GeNo. in: Bioinformatics, 25, 815-821.
• BioSem – Bui, Q., Mulligen, E. van, Campos, D., & Kors, J. (2013). A Fast Rule-based Approach for
Biomedical Event Extraction. In Proceedings of the BioNLP 2013 Shared Task Workshop (pp. 104–108). Sofia, Bulgaria: Association for Computational Linguistics.
• Certainty Assessment – Engelmann, Christine, & Hahn, Udo (2014). An empirically grounded approach to extend the
linguistic coverage and lexical diversity of verbal probabilities. in: CogSci 2014 - Proceedings of the 36th Annual Cognitive Science Conference. Cognitive Science Meets Artificial Intelligence: Human and Artificial Agents in Interactive Contexts. Québec City, Québec, Canada, July 23-26, 2014., 451-456.
• JCoRe – Hahn, Udo, & Matthies, Franz, & Faessler, Erik, & Hellrich, Johannes (2016). UIMA-based
JCoRe 2.0 goes GitHub and Maven Central: State-of-the-art software resource engineering and distribution of NLP pipelines. in: LREC 2016 – Proceedings of the 10th International Conference on Language Resources and Evaluation. Portorož, Slovenia, 23-28 May 2016, 2502-2509.
Erik Fäßler TechnicalIntroductiontoSemedico 30
Conclusion
DocDoc
DocMEDLINE
JULIELabServer
PostgreSQL
CR
AE
AE
AE
CO
ElasticSearchConceptDatabase
SeMedicoWebApplicationJavaServlet
Frontend(Tapestry/JavaScript)
NCBIGene
http://www.semedico.org/
Erik Fäßler TechnicalIntroductiontoSemedico 31
Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,
Jena, Germany
http://www.julielab.de
A Technical Introduction to the Semantic Search Engine SeMedico
Erik Fäßler
TalkintheSemesterprojektEntwicklungeinerSuchmaschinefürAlternativmethodenzuTierversuchen
January12,2018Humboldt-UniversitätzuBerlin