Maschinelle Sprachverarbeitung
Ulf Leser
Ulf Leser: Maschinelle Sprachverarbeitung 2
Case Report
Case report from Univ. Hospital Geneva, thanks to Christian Meisel, Roche
• Patient with pneumonia and cough• Normal dosage of codeine• Patient not responding any more at day 4• What‘s going on?
– PubMed „Codeine intoxication“ -> 170 abstracts– Aren’t there better ways?
Ulf Leser: Maschinelle Sprachverarbeitung 3
Using AliBaba
Ulf Leser: Maschinelle Sprachverarbeitung 4
What we Need to doZ-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various immunomodulatory activities, such as the induction of interleukin 12, interferon gamma (IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus G envelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv vector (in which the env gene is defective and the nef gene is replaced with the firefly luciferase gene) when this vector was transfected directly into MDMs. These findings suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z-100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs, suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100 induced IFN-beta production in these cells, resulting in induction of the 16-kDa CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses HIV-1 long terminal repeat transcription. These effects were alleviated by SB 203580, a specific inhibitor of p38 mitogen-activated protein kinases (MAPK), indicating that the p38 MAPK signalling pathway was involved in Z-100-induced repression of HIV-1 replication in MDMs. These findings suggest that Z-100 might be a useful immunomodulator for control of HIV-1 infection.
Ulf Leser: Maschinelle Sprachverarbeitung 5
Find EntitiesZ-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various immunomodulatory activities, such as the induction of interleukin 12, interferon gamma(IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus Genvelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv vector (in which the env gene is defective and the nef gene is replaced with the firefly luciferase gene) when this vector was transfected directly into MDMs. These findings suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z-100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs, suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100induced IFN-beta production in these cells, resulting in induction of the 16-kDa CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses HIV-1 long terminal repeat transcription. These effects were alleviated by SB 203580, a specific inhibitor of p38 mitogen-activated protein kinases (MAPK), indicating that the p38 MAPK signalling pathway was involved in Z-100-induced repression of HIV-1replication in MDMs. These findings suggest that Z-100 might be a useful immunomodulator for control of HIV-1 infection.
Ulf Leser: Maschinelle Sprachverarbeitung 6
Find RelationshipsZ-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various immunomodulatory activities, such as the induction of interleukin 12, interferon gamma(IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus Genvelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv vector (in which the env gene is defective and the nef gene is replaced with the firefly luciferase gene) when this vector was transfected directly into MDMs. These findings suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z-100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs, suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100induced IFN-beta production in these cells, resulting in induction of the 16-kDa CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses HIV-1 long terminal repeat transcription. These effects were alleviated by SB 203580, a specific inhibitor of p38 mitogen-activated protein kinases (MAPK), indicating that the p38 MAPK signalling pathway was involved in Z-100-induced repression of HIV-1replication in MDMs. These findings suggest that Z-100 might be a useful immunomodulator for control of HIV-1 infection.
inhibit
induction
Ulf Leser: Maschinelle Sprachverarbeitung 7
Detecting Gene Names
The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.
Ulf Leser: Maschinelle Sprachverarbeitung 8
Detecting Gene Names (Leser & Hakenberg, 2005)
The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.
• Also: hedgehog, soul, the, white, …• State-of-the-art methods reach ~85% in NEN
– Plus 10% for less stringent boundary definitions– Large dicts, CRF, species classification, large background corpus, …– That’s about as high as inter-annotator agreement
• Different performance for other classes (mutations, diseases, functional terms, cell lines, …)
Ulf Leser: Maschinelle Sprachverarbeitung 9
Typical IE-Workflow
Document Retrieval
Text Preprocessing
Linguistic Annotation
Named Entity Recognition
Named Entity Normalization
Relationship Extraction
Text Mining
NLP
InformationRetrieval
Ulf Leser: Maschinelle Sprachverarbeitung 10
Applications in Business Intelligence
• What problems are most frequently reported by our customers? Which products, product lines, parts etc.? – Mails, knowledge bases, repair reports, call centers, …
• How does our customer satisfaction change? – Tone in communication? – Reports in Blogs, Twitter, …?
• Can we improve customer self service?– “Entity Search”– Precise routing and prioritization of requests
• See, e.g., Lang, A. and Reinwald, B. (2008). "Nutzung unstrukturierter Datenfür Business Intelligence." Datenbank Spektrum 25.
Ulf Leser: Maschinelle Sprachverarbeitung 11
Some Recent Students Work
• Can we predict the results of elections using Twitter?– Tweet classification, sentiment detection
• What aspects of mobile apps are good / bad?– Aspect extraction, topic modelling, sentiment detection
• Can we find texts talking about the biology of stem cells?– Text classification, q-gram models
• Can we predict the success of a drug based on papers?– Named entity recognition, time series analysis, classification
• Can we semantically cluster tables from the web / papers?– Word similarity, text clustering
• Can active learning help for finding gene relationships?
Ulf Leser: Maschinelle Sprachverarbeitung 12
Modul Maschinelle Sprachverarbeitung
• Vorlesung 2 SWS• Übung 2 SWS
– Starts tomorrow, 18.10.2018• Slides are English
• ContactUlf LeserRaum: IV.401Tel: (030) 2093 – 3902eMail: leser (..) informatik . hu-berlin . de
Ulf Leser: Maschinelle Sprachverarbeitung 13
Literatur
• Highly recommended– Manning, C.D., Schütze, H. (1999). „Foundations of Statistical
Natural Language Processing”, MIT Press.• Other
– Weiss, Indurkhya, Zhang: „Fundamentals of Predictive Text Mining“, Springer, 2016
– Aggarwal, Zhai: „ Mining Text Data“, Springer, 2012 – Manning, C. D., Raghavan, P. and Schütze, H. (2008). “Introduction
to Information Retrieval", Cambridge University Press.– Lemnitzer, L. and Zinsmeister, H. (2010). "Korpuslinguistik - Eine
Einführung", narr Studienbücher.– Original papers
Ulf Leser: Maschinelle Sprachverarbeitung 14
Web
Ulf Leser: Maschinelle Sprachverarbeitung 15
Questions?
Ulf Leser: Maschinelle Sprachverarbeitung 16
Questions
• Diplominformatiker?• IBI / Wirtschaftsinformatik? • Bachelor?• Semester?
• Special expectations, experiences, questions?
Ulf Leser: Maschinelle Sprachverarbeitung 17
Feedback MaschSprach 2017/2018 (n=13)
Ulf Leser: Maschinelle Sprachverarbeitung 18
Free comments
Ulf Leser: Maschinelle Sprachverarbeitung 19
Was ich ändern wollte
• Rausnehmen: Scalable IE• Reinnehmen: Deep learning and word embeddings
• Was ich nicht gemacht habe: Auf 3+2 ausbauen– Question Answering, Opinion Mining, Topic modelling, …
Ulf Leser: Maschinelle Sprachverarbeitung 20
Content of this Lecture
• A very short primer on Information Retrieval• A very short primer on Natural Language Processing• A very short primer on Text Mining
Ulf Leser: Maschinelle Sprachverarbeitung 21
Phases in Text Mining
Document Retrieval
Text Preprocessing
Linguistic Annotation
Named Entity Recognition
Named Entity Normalization
Relationship Extraction
Text Mining
NLP
InformationRetrieval
Text Classification
Text Clustering
Ulf Leser: Maschinelle Sprachverarbeitung 22
Information Retrieval (aka “Search”)
• Find all documents which contain the following words• „Leading the user to those documents that will best enable
him/her to satisfy his/her need for information“[Robertson 1981]
– A user wants to know something– The user needs to tell the machine what he wants to know: query– Posing exact queries is difficult: room for interpretation– Machine interprets query to compute the (hopefully) best answer– Goodness of answer depends on original intention of user, not on
the query (relevance)
Ulf Leser: Maschinelle Sprachverarbeitung 23
Difference to Database Queries
• Queries: Formal language versus natural language• Exactly defined result versus loosely described relevance• Result set versus ranked list of results• DB: Posing the right query is completely left to the user• IR: Understanding the query is a problem of the software
Ulf Leser: Maschinelle Sprachverarbeitung 24
Natural Language Processing
• Making natural language text accessible to a computer– Find semantic units: words, tokens, phrases, clauses, sentences
• “Implementing the C4.5 algorithm with languages such as DOT.NET, Java etc. is not as simple as one might think …”
• “The α(3)-helicase-5’ mRNA is …”– Find grammatical role of words– Find grammatical structure of sentences– Find syntactic relationships between entities– Draw semantic inferences from a text– …
• “Understand” the text
Ulf Leser: Maschinelle Sprachverarbeitung 25
MyoD
p300KIX
PAX1
has_domain
binds_toinhibits_binding
has_transcriptional_activity_when_bound_by_MyoD
repressesreason
„The PAX1 protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX
domain of p300.“
Understanding Text is Difficult (even for us)
Ulf Leser: Maschinelle Sprachverarbeitung 26
Part-Of-Speech Tagging
• Part-of-Speech (POS) is the grammatical class of a word– Adverb, verb, adjective, … – Verb: Tense, number, …– Noun: Gender, case, number, …
• POS tagging: Given a text, assign each word its POS– “Does/VBZ flight/NN LH750/NNP serve/VB dinner/NN ?”– Caveat: There are different tag sets
• POS tags are very useful for many tasks– NER: names of entities should be nouns
• Methods: Maximum Entropy, Hidden Markov Models
Ulf Leser: Maschinelle Sprachverarbeitung 27
Text Mining
• Text Mining = “Data Mining on text”• Text Mining = “Statistical NLP minus parsing”
[Altmann/Schütze]
• Typical tasks– Document classification (route emails to the right operator)– Document clustering (group search results by topics)– Information extraction (find all celebrities and their partners)– Question Answering (what will be the weather tomorrow?)
Ulf Leser: Maschinelle Sprachverarbeitung 28
Clinical Entity Recognition for ICD-9 Code Prediction in Clinical
Discharge Summaries
Ulf Leser: Maschinelle Sprachverarbeitung 29
- Clinical data is often stored in textual form- Reports contain valuable information
- Diseases, symptoms, treatments, drugs, dosage, family history, lab measurements, images/radiology, progression, …
- Many of these not available in structured form- Especially important: Disease (symptoms, phenotypes)
- For accounting- For decision support
Ulf Leser: Maschinelle Sprachverarbeitung 30
• DATE OF ADMISSION: MM/DD/YYYY• DATE OF DISCHARGE: MM/DD/YYYY• DISCHARGE DIAGNOSES:• 1. Vasovagal syncope, status post fall.
2. Traumatic arthritis, right knee.3. Hypertension.6. History of chronic obstructive pulmonary disease.
• BRIEF HISTORY: The patient is an (XX)-year-old female with history of previous stroke; hypertension; COPD, stable; renal carcinoma; presenting after a fall and possible syncope. While walking, she accidentally fell to her knees and did hit her head on the ground, near her left eye. Her fall was not observed, but the patient does not profess any loss of consciousness, recalling the entire event. The patient does have a history of previous falls, one of which resulted in a hip fracture. She has had physical therapy and recovered completely from that…
• DIAGNOSTIC STUDIES: All x-rays including left foot, right knee, left shoulder and cervical spine showed no acute fractures. The left shoulder did show old healed left humeral head and neck fracture with baseline anterior dislocation. …
• HOSPITAL COURSE:• 1. Fall: The patient was admitted and ruled out for syncopal episode. Echocardiogram was normal,
and when the patient was able, …• 2. Status post fall with trauma: The patient was unable to walk normally secondary to traumatic
injury of her knee, causing significant pain and swelling. Although a scan showed no acute fractures, …
Ulf Leser: Maschinelle Sprachverarbeitung 31
Goals and Methods
• Predict discharge diagnosis based on clinical texts• Approach 1: Recognize diseases in text (NER-based
approach)
• Approach 2: Predict disease based on (entire, partial) text (classification-based approach)
Extract clinical entities
Map to ICD-9-CM
Compare to assigned
codes
Extract clinical entities
Transform into vector
space
Train classifiers per code
Determine prediction
quality
Ulf Leser: Maschinelle Sprachverarbeitung 32
ICD-9-CM Root
001-139Infectious and parasitic diseases
140-239Neoplasms
240-…
001-009Intestinal infect.
dis.010-018
Tuberculosis019-…
001 Cholera 002 Typhoid and parat. fevers 003 …
001.0 Cholera due to vibrio cholerae001.1 …
002.0 Typhoid fever002.1 Paratyphoid fever A003.2 …
Disease Names: ICD-9
Ulf Leser: Maschinelle Sprachverarbeitung 33
MetaMap
cTAKES
HITEx
NCBO
DNorm
UMLS SPECIALIST
UIMAOpenNLP
GATEBANNER
UMLS
SNOMED
Any Ontology
MEDIC
Medical NER Tools Evaluated
Ulf Leser: Maschinelle Sprachverarbeitung 34
635,
80
18,5
0
217,
20
294,
90
642,
30
191,
40
18,5
0 61,0
0 119,
70 152,
70
59,5
0
33,3
0
4,80 35
,10
32,9
0
36,7
0
10,9
0
CTAKES DNORM HITEX METAMAP NCBO MANUAL
Total Disease Mentions ICD-9 Mapped
Number of Extracted Concepts (Per Document)
Ulf Leser: Maschinelle Sprachverarbeitung 35
Issues (Typical)
• Hierarchical classification – which level of ICD-9?– Higher levels: More training data, few classes, high accuracy
But: Little value– Lower levels: Little training data, many classes, low accuracy
But: High value• Mapping between ontologies
– Concepts with different syntax & synonyms– Concepts at different granularities– Conflicting subsumption relationships– Diverging coverage– …
ICD-9-CM
SNOMED-CT(Code)
UMLS(CUI)
Other Ontologies
MetaMap HITExDNorm cTAKES
Ulf Leser: Maschinelle Sprachverarbeitung 36
0,17
3
0,17
9
0,09
2
0,18
6
0,19
0,17
4
0,71 0,
74
0,62
0,68
0,64
0,73
0,21
9
0,36
6
0,20
3
0,39
7
0,36
9
0,36
2
0,54
0,53
0,44
0,54
0,51 0,52
0,19
3 0,24
0,12
7
0,25
4
0,25
1
0,23
5
0,62
0,62
0,52
0,6
0,57 0,
61
Precision Recall
Results / Evaluation - 50 k discharge summaries- 7 k classes (diagnosis codes)
Ulf Leser: Maschinelle Sprachverarbeitung 37
0,17
3
0,17
9
0,09
2
0,18
6
0,19
0,17
4
0,71 0,
74
0,62
0,68
0,64
0,73
0,21
9
0,36
6
0,20
3
0,39
7
0,36
9
0,36
2
0,54
0,53
0,44
0,54
0,51 0,52
0,19
3 0,24
0,12
7
0,25
4
0,25
1
0,23
5
0,62
0,62
0,52
0,6
0,57 0,
61
Precision Recall
Results / Evaluation - Baseline: 10 k top concepts 7 k- Train/test split 90% / 10%
Ulf Leser: Maschinelle Sprachverarbeitung 38
Error Analysis
• False positive code assignments• Mapping errors• Contextual errors• Negation / temporal status
• False negative code assignments• Obvious codes not tagged in gold standard (hypertension) • Heavy use of abbreviations and acronyms• Missing sections• Missing mention
Ulf Leser: Maschinelle Sprachverarbeitung 39
What we will not cover
• Linguistic analysis beyond parsing– Pragmatic structure, co-reference resolution, …
• Spoken language• Machine translation• Cross-language search / analysis• User interfaces• Special classification problems: Sentiment analysis etc.• Question answering • Topic modelling• …