Text Analytics for Semantic Computing

  • Published on

  • View

  • Download

Embed Size (px)


Text Analytics for Semantic Computing - the good, the bad and the ugly, tutorial presentation at ICSC 2008http://icsc.eecs.uci.edu/tutorial1.html


<ul><li>1.CARTICRAMAKRISHNAN MEENAKSHINAGARAJAN AMITSHETH</li></ul> <p>2. We have used material from severalpopular books, papers, course notes and presentations made by experts in this area. We have provided all references to the best of our knowledge. This list however, serves only as a pointer to work in this area and is by no means a comprehensive resource. 3. </p> <ul><li>KNO.E.SIS </li></ul> <ul><li><ul><li>knoesis.org </li></ul></li></ul> <ul><li>Director: Amit Sheth </li></ul> <ul><li><ul><li>knoesis.wright.edu/amit/</li></ul></li></ul> <ul><li>Graduate Students: </li></ul> <ul><li><ul><li>Meena Nagarajan </li></ul></li></ul> <ul><li><ul><li><ul><li>knoesis.wright.edu/students/meena/ </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>[email_address] </li></ul></li></ul></li></ul> <ul><li><ul><li>Cartic Ramakrishnan </li></ul></li></ul> <ul><li><ul><li><ul><li>knoesis.wright.edu/students/cartic/ </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>[email_address] </li></ul></li></ul></li></ul> <p>4. An Overview of Empirical Natural Language Processing, Eric Brill, Raymond J. Mooney Word Sequence Syntactic Parser Parse Tree Semantic Analyzer Literal Meaning Discourse Analyzer Meaning 5. </p> <ul><li>Traditional (Rationalist) Natural Language Processing </li></ul> <ul><li><ul><li>Main insight: Using rule-based representations of knowledge and grammar (hand-coded) for language study </li></ul></li></ul> <p>KB Text NLP System Analysis 6. </p> <ul><li>Empirical Natural Language Processing </li></ul> <ul><li><ul><li>Main insight: Using distributional environment of a word as a tool for language study </li></ul></li></ul> <p>KB Text NLP System Analysis Corpus Learning System 7. </p> <ul><li>Two approaches not incompatible. Several systems use both.</li></ul> <ul><li>Many empirical systems make use of manually created domain knowledge. </li></ul> <ul><li>Many empirical systems use representations of rationalist methods replacing hand-coded rules with rules acquired from data. </li></ul> <p>8. </p> <ul><li>Several algorithms, methods in each task, rationalist and empiricist approaches </li></ul> <ul><li>What does a NL processing task typically entail? </li></ul> <ul><li>How do systems, applications and tasks perform these tasks? </li></ul> <ul><li>Syntax : POS Tagging, Parser</li></ul> <ul><li>Semantics : Meaning of words, using context/domain knowledge to enhance tasks </li></ul> <p>9. </p> <ul><li>Finding more about what we already know </li></ul> <ul><li><ul><li>Ex. patterns that characterize known information </li></ul></li></ul> <ul><li><ul><li>The search/browse ORfinding a needle in a haystackparadigm </li></ul></li></ul> <ul><li>Discovering what we did not know </li></ul> <ul><li><ul><li>Deriving new information from data </li></ul></li></ul> <ul><li><ul><li><ul><li>Ex. Relationships between known entities previously unknown </li></ul></li></ul></li></ul> <ul><li><ul><li>Theextracting ore from rockparadigm </li></ul></li></ul> <p>10. </p> <ul><li>Information Extraction - those that operate directly on the text input </li></ul> <ul><li><ul><li><ul><li>this includes entity, relationship and event detection</li></ul></li></ul></li></ul> <ul><li>Inferring new links and paths between key entities </li></ul> <ul><li><ul><li><ul><li>sophisticated representations for information content, beyond the "bag-of-words" representations used by IR systems </li></ul></li></ul></li></ul> <ul><li>Scenario detection techniques </li></ul> <ul><li><ul><li><ul><li>discover patterns of relationships between entities that signify some larger event, e.g. money laundering activities. </li></ul></li></ul></li></ul> <p>11. </p> <ul><li>They all make use of knowledge of language (exploiting syntax and structure, different extents) </li></ul> <ul><li><ul><li>Named entities begin with capital letters </li></ul></li></ul> <ul><li><ul><li>Morphology and meanings of words </li></ul></li></ul> <ul><li>They all use some fundamental text analysis operations </li></ul> <ul><li><ul><li>Pre-processing, Parsing, chunking, part-of-speech, lemmatization, tokenization </li></ul></li></ul> <ul><li>To some extent, they all deal with some language understanding challenges </li></ul> <ul><li><ul><li>Ambiguity, co-reference resolution, entity variations etc. </li></ul></li></ul> <ul><li>Use of a core subset of theoretical models and algorithms </li></ul> <ul><li><ul><li>State machines, rule systems, probabilistic models, vector-space models, classifiers, EM etc. </li></ul></li></ul> <p>12. </p> <ul><li>Analysis for both these goals have many similarities </li></ul> <ul><li>Finding entities </li></ul> <ul><li><ul><li>What are we interested in knowing more about? (the known) </li></ul></li></ul> <ul><li><ul><li>Is what we found something of interest? (the unknown) </li></ul></li></ul> <ul><li>Text is structured (to some extent) </li></ul> <ul><li><ul><li>Syntax, Structure </li></ul></li></ul> <ul><li><ul><li>Semantics, Pragmatics, Discourse </li></ul></li></ul> <ul><li>Text is noisy </li></ul> <ul><li><ul><li>Pre-processing is not an option in many cases </li></ul></li></ul> <ul><li><ul><li>Variations not uncommon </li></ul></li></ul> <p>13. </p> <ul><li>Wikipedia like text (GOOD) </li></ul> <ul><li><ul><li> Thomas Edison invented the light bulb. </li></ul></li></ul> <ul><li>Scientific literature (BAD) </li></ul> <ul><li><ul><li> This MEK dependency was observed in BRAF mutant cells regardless of tissue lineage, and correlated with both downregulation of cyclin D1 protein expression and the induction of G1 arrest. </li></ul></li></ul> <ul><li>Text from Social Media (UGLY) </li></ul> <ul><li><ul><li>"heylooo..ano u must hear it loadsss bu your propa faabbb!!"</li></ul></li></ul> <p>14. </p> <ul><li>Illustrate analysis of and challenges posed by these three text types throughout the tutorial </li></ul> <p>15. </p> <ul><li>WHATCANTMDOFORHARRYPORTER? </li></ul> <p>A bag of words 16. Discovering connections hidden in text UNDISCOVERED PUBLIC KNOWLEDGE 17. </p> <ul><li>Undiscovered Public Knowledge [Swanson] as mentioned in [Hearst99] </li></ul> <ul><li>Search no longer enough </li></ul> <ul><li><ul><li><ul><li>Information overload prohibitively large number of hits </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>UPK increases with increasing corpus size </li></ul></li></ul></li></ul> <ul><li>Manual analysis very tedious </li></ul> <ul><li>Examples [Hearst99] </li></ul> <ul><li><ul><li><ul><li>Example 1 Using Text to Form Hypotheses about Disease</li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>Example 2 Using Text to Uncover Social Impact </li></ul></li></ul></li></ul> <p>18. </p> <ul><li><ul><li>Swansons discoveries </li></ul></li></ul> <ul><li><ul><li><ul><li>Associations between Migraine and Magnesium [Hearst99] </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li><ul><li>stressisassociated with migraines </li></ul></li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li><ul><li>stresscanlead to lossofmagnesium </li></ul></li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li><ul><li>calcium channel blockers preventsomemigraines </li></ul></li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li><ul><li>magnesium is anaturalcalcium channel blocker </li></ul></li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li><ul><li>spreading cortical depression (SCD)isimplicatedin somemigraines </li></ul></li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li><ul><li>high levels ofmagnesium inhibitSCD</li></ul></li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li><ul><li>migraine patients havehigh platelet aggregability </li></ul></li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li><ul><li>magnesiumcansuppress platelet aggregability </li></ul></li></ul></li></ul></li></ul> <p>19. </p> <ul><li>Mining popularity from Social Media </li></ul> <ul><li><ul><li>Goal: Top X artists from MySpace artist comment pages </li></ul></li></ul> <ul><li><ul><li>Traditional Top X lists got from radio plays, cd sales. An attempt at creating a list closer to listeners preferences </li></ul></li></ul> <ul><li>Mining positive, negative affect / sentiment </li></ul> <ul><li><ul><li>Slang, casual text necessitates transliteration </li></ul></li></ul> <ul><li><ul><li><ul><li> you are so bad == you are good </li></ul></li></ul></li></ul> <p>20. </p> <ul><li>Mining text to improve existing information access mechanisms </li></ul> <ul><li><ul><li>Search [Storylines] </li></ul></li></ul> <ul><li><ul><li>IR [QA systems] </li></ul></li></ul> <ul><li><ul><li>Browsing [Flamenco] </li></ul></li></ul> <ul><li>Mining text for</li></ul> <ul><li><ul><li>Discovery &amp; insight [Relationship Extraction] </li></ul></li></ul> <ul><li><ul><li>Creation of new knowledge</li></ul></li></ul> <ul><li><ul><li><ul><li>Ontology instance-base population </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>Ontology schema learning </li></ul></li></ul></li></ul> <p>21. </p> <ul><li>Web search aims at optimizing for top k (~10) hits </li></ul> <ul><li>Beyond top 10</li></ul> <ul><li><ul><li>Pages expressing related latent views on topic </li></ul></li></ul> <ul><li><ul><li>Possible reliable sources of additional information </li></ul></li></ul> <ul><li>Storylines in search results [3] </li></ul> <p>22. 23. </p> <ul><li>TextRunner[4]</li></ul> <ul><li><ul><li>A system that uses the result of dependency parses of sentences to train a Nave Bayes classifier for Web-scale extraction of relationships </li></ul></li></ul> <ul><li><ul><li>Does not require parsing for extraction only required for training </li></ul></li></ul> <ul><li><ul><li>Training on features POS tag sequences, if object is proper noun, number of tokens to right or left etc.</li></ul></li></ul> <ul><li><ul><li>This system is able to respond to queries like"What did Thomas Edison invent?"</li></ul></li></ul> <p>24. </p> <ul><li>Castanet[1]</li></ul> <ul><li><ul><li>Semi-automatically builds faceted hierarchical metadata structures from text</li></ul></li></ul> <ul><li><ul><li>This is combined withFlamenco [2] to support faceted browsing of content </li></ul></li></ul> <p>25. Documents Selectterms WordNet Buildcore tree Augment core tree Remove top level categories Compress Tree Divide into facets 26. Domains used to prune applicable senses inWordnet(e.g. dip)frozen dessert sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet sundae sherbet substance,matter nutriment dessert sherbet,sorbet frozen dessert entity ice cream sundae 27. Biologicallyactive substance Lipid Disease or Syndrome affects causes affects causes complicates Fish Oils Raynauds Disease ??????? instance_of instance_of UMLSSemantic Network MeSH PubMed 9284documents 4733documents 5documents 28. [Hearst92] Finding class instances [Ramakrishnan et. al. 08] [Nguyen07] Finding attribute like relation instances 29. </p> <ul><li>Automatic acquisition of</li></ul> <ul><li><ul><li>Class Labels </li></ul></li></ul> <ul><li><ul><li>Class hierarchies </li></ul></li></ul> <ul><li><ul><li>Attributes</li></ul></li></ul> <ul><li><ul><li>Relationships</li></ul></li></ul> <ul><li><ul><li>Constraints</li></ul></li></ul> <ul><li><ul><li>Rules </li></ul></li></ul> <p>30. 31. </p> <ul><li>SYNTAX, SEMANTICS, STATISTICAL NLP, TOOLS, RESOURCES, GETTING STARTED </li></ul> <p>32. </p> <ul><li>[hearst 97]Abstract concepts are difficult to represent</li></ul> <ul><li> Countless combinations of subtle, abstract relationships among concepts</li></ul> <ul><li>Many ways to represent similar concepts</li></ul> <ul><li><ul><li>E.g. space ship, flying saucer, UFO </li></ul></li></ul> <ul><li>Concepts are difficult to visualize</li></ul> <ul><li><ul><li>High dimensionality</li></ul></li></ul> <ul><li><ul><li>Tens or hundreds of thousands of features </li></ul></li></ul> <p>33. </p> <ul><li>Ambiguity (sense) </li></ul> <ul><li><ul><li>Keep that smile playin(Smile is a track) </li></ul></li></ul> <ul><li><ul><li>Keep that smile on! </li></ul></li></ul> <ul><li>Variations (spellings, synonyms, complex forms) </li></ul> <ul><li><ul><li>Illeal Neoplasm vs. Adenomatous lesion of the Illeal wall </li></ul></li></ul> <ul><li>Coreference resolution </li></ul> <ul><li><ul><li> John wanted a copy of Netscape to run on his PC on the desk in his den; fortunately, his ISP included it in their startup package, </li></ul></li></ul> <p>34. </p> <ul><li>[hearst 97] Highly redundant data </li></ul> <ul><li><ul><li> most of the methods count on this property </li></ul></li></ul> <ul><li>Just about any simple algorithm can get good results for simple tasks:</li></ul> <ul><li><ul><li>Pull out important phrases</li></ul></li></ul> <ul><li><ul><li>Find meaningfully related words</li></ul></li></ul> <ul><li><ul><li>Create some sort of summary from documents </li></ul></li></ul> <p>35. </p> <ul><li>Concerned with processing documents in natural language </li></ul> <ul><li><ul><li>Computational Linguistics, Information Retrieval, Machine learning, Statistics, Information Theory, Data Mining etc. </li></ul></li></ul> <ul><li>TM generally concerned with practical applications</li></ul> <ul><li><ul><li>As opposed to lexical acquisition (for ex.)in CL </li></ul></li></ul> <p>36. </p> <ul><li>Computing Resources </li></ul> <ul><li><ul><li>Faster disks, CPUs, Networked Information </li></ul></li></ul> <ul><li>Data Resources </li></ul> <ul><li><ul><li>Large corpora, tree banks, lexical data for training and testing systems </li></ul></li></ul> <ul><li>Tools for analysis </li></ul> <ul><li><ul><li>NL analysis: taggers, parsers, noun-chunkers, tokenizers; Statistical Text Analysis: classifiers, nl model generators </li></ul></li></ul> <ul><li>Emphasis on applications and evaluation </li></ul> <ul><li><ul><li>Practical systems experimentally evaluated on real data </li></ul></li></ul> <p>37. </p> <ul><li>TYPES,WHATTHEYDOANDDONT </li></ul> <p>38. </p> <ul><li>Co-occurrence based </li></ul> <ul><li>Rule / knowledge based </li></ul> <ul><li>Statistical / machine-learning based </li></ul> <ul><li>Systems typically use a combination of two or more </li></ul> <p>39. </p> <ul><li>Look for terms and posit relationships based on co-occurrence </li></ul> <ul><li><ul><li> A word is known by the company it keeps </li></ul></li></ul> <ul><li>Non-trivial as they deal with problems of language expression variability, ambiguity </li></ul> <ul><li>Sometimes used as a simple baseline when evaluating more sophisticated systems </li></ul> <p>40. </p> <ul><li>Exploit real-world knowledge </li></ul> <ul><li><ul><li>About language </li></ul></li></ul> <ul><li><ul><li>About terms in the domain </li></ul></li></ul> <ul><li><ul><li>About relationship between terms in the domain </li></ul></li></ul> <ul><li><ul><li>About variations of terms we know of etc. </li></ul></li></ul> <ul><li>Spectrum of work from using hard-coded patterns (text or linguistic) for TM to using linguistic + semantic analysis for TM </li></ul> <ul><li>Harder to develop, maintain rules; comprehensiveness also an issue </li></ul> <p>41. </p> <ul><li>Using classifiers operating on text </li></ul> <ul><li><ul><li>At pos level, n-grams, parse trees </li></ul></li></ul> <ul><li><ul><li>Classifying documents, words, attributes/properties </li></ul></li></ul> <ul><li>Typically requires hand-tagged/labeled data </li></ul> <ul><li><ul><li>Supervised, semi-supervised approaches </li></ul></li></ul> <p>http://compbio.uchsc.edu/Hunter_lab/Cohen/Hunter_Cohen_Molecular_Cell.pdf 42. </p> <ul><li>Computational Linguistics - Syntax </li></ul> <ul><li><ul><li>Parts of speech, morphology, phrase structure, parsing, chunking </li></ul></li></ul> <ul><li>Semantics </li></ul> <ul><li><ul><li>Lexical semantics, Syntax-driven semantic analysis, domain model-assisted semantic analysis (WordNet),</li></ul></li></ul> <ul><li>Getting your hands dirty </li></ul> <ul><li><ul><li>Text encoding, Tokenization, sentence splitting, morphology variants, lemmatization </li></ul></li></ul> <ul><li><ul><li>Using parsers, understanding outputs </li></ul></li></ul> <ul><li><ul><li>Tools, resources, frameworks </li></ul></li></ul> <p>43. </p> <ul><li>StatisticalNLP </li></ul> <ul><li><ul><li>Mathematical foundations, some information theory </li></ul></li></ul> <ul><li><ul><li>Words, Statistical Inference using n-grams, language models </li></ul></li></ul> <ul><li>How are these used some examples </li></ul> <ul><li><ul><li>Collocations, Lexical Acquisition, Word sense disambiguation </li></ul></li></ul> <p>44. </p> <ul><li>Several algorithms, variations for each </li></ul> <ul><li>We wont gointo each algorithm </li></ul> <ul><li><ul><li>Thats a whole semester course </li></ul></li></ul> <ul><li>Well define the problem, point out general approach, show examples </li></ul> <ul><li>More importantly, well show how results of these components are used by applications we know of today </li></ul> <p>45. </p> <ul><li>POS Tags, Taggers, Ambiguities, Examples </li></ul> <p>Word Sequence Syntactic Parser Parse Tree Semantic Analyzer Literal Meaning Discourse Analyzer Meaning 46. </p> <ul><li>Word classes, syntactic/grammatical categories, parts-of-speech </li></ul> <ul><li>Comprehensive lists used by taggers </li></ul> <ul><li><ul><li>87 in the Brown Corpus tagset </li></ul></li></ul> <ul><li><ul><li>45 in the Penn Treebank tagset </li></ul></li></ul> <ul><li><ul><li>146 for the C7 tagset </li></ul></li></ul> <p>POS TagDescriptionExample CCcoordinating conjunctionand CDcardinal number1, third DTdeterminerthe EXexistential therethere is FWforeign wordd'hoevre INpreposition/subordinating conjunctionin, of, like JJadjectivegreen JJRadjective, comparativegreener JJSadjective, superlativegreenest LSlist marker1) MDmodalcould, will NNnoun, singular or masstable NNSnoun pluraltables NNPproper noun, singularJohn NNPSproper noun, pluralVikings PDTpredeterminerboth the boys POSpossessive endingfriend's PRPpersonal pronounI, he, it PRP$possessive pronounmy, his 47. </p> <ul><li>Assigning a pos or syntactic class marker to a word in a sentence/corpus. </li></ul> <ul><li><ul><li>Word classes, syntactic/grammatical categories </li></ul></li></ul> <ul><li>Usually preceded by tokenization</li></ul> <ul><li><ul><li>delimit sentence boundaries, tag punctuations and words. </li></ul></li></ul> <ul><li>Publicly available tree banks, documents tagged for syntactic structure </li></ul> <ul><li>Typical input and output of a tagger </li></ul> <ul><li><ul><li><ul><li>Cancel that ticket. Cancel /VBthat /DTticket /NN ./. </li></ul></li></ul></li></ul> <p>48. </p> <ul><li>Lexical ambiguity </li></ul> <ul><li><ul><li>Words have multiple usages and parts-of-speech </li></ul></li></ul> <ul><li><ul><li><ul><li>Aduckin the pond;Don tduckwhen I bowl </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li><ul><li>Isducka noun or a verb? </li></ul></li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li>Yes, wecan ;Canof soup ;Ican ned this idea </li></ul></li></ul></li></ul> <ul><li><ul><li><ul><li><ul><li>Iscanan auxiliary, a noun or a verb? </li></ul></li></ul></li></ul></li></ul> <ul><li>Problem in tagging is resolving such ambiguities </li></ul> <p>49. </p> <ul><li>Information about a word and its neighbors </li></ul> <ul><li><ul><li>has implications on languagemodels </li></ul></li></ul> <ul><li><ul><li><ul><li>Possessive pronouns (mine, her, its) usually followed anoun </li></ul></li></ul></li></ul> <ul><li><ul><li>Understand new words </li></ul></li></ul> <ul><li><ul><li><ul><li>Tovesdidgyre and gimble. </li></ul></li></ul></li></ul> <ul><li><ul><li>On IE </li></ul></li></ul> <ul><li><ul><li><ul><li>Nouns as cues for named entities </li></ul>