27
1 Introduction to Natural Language Processing and Text Mining and The basic building blocks Sudeshna Sarkar Professor Computer Science & Engineering Department Indian Institute of Technology Kharagpur

Introduction to Natural Language Processing and Text Mining and The basic building blocks

  • Upload
    reece

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

Introduction to Natural Language Processing and Text Mining and The basic building blocks. Sudeshna Sarkar Professor Computer Science & Engineering Department Indian Institute of Technology Kharagpur. Ambiguity. At last, a computer that understands you like your mother. - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

1

Introduction to Natural Language Processing and Text MiningandThe basic building blocks

Sudeshna SarkarProfessor

Computer Science & Engineering DepartmentIndian Institute of Technology Kharagpur

Page 2: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

2

Ambiguity

At last, a computer that understands you like your mother.

-- 1985 McDonnell-Douglas AdDifferent interpretations:1. The computer understands you as well as your mother

understands you.2. The computer understands that you like your mother.3. The computer understands you as well as it understands your

mother.

Speech : ….. a computer that understands your lie cured mother …

Page 3: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

3

Why is NLP difficult?

Natural Language is highly ambiguous.Syntactic ambiguity

– The president spoke to the nation about the problem of drug use in the schools from one coast to the other.

– has 720 parses.– Ex:

“to the other” can attach to any of the previous NPs (ex. “the problem”), or the head verb 6 places

“from one coast” has 5 places to attach …

Page 4: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

4

Why is NLP difficult?

Word category ambiguity– book --> verb? or noun?

Word sense ambiguity– bank --> financial institution? building? or river side?

Words can mean more than their sum of parts – make up a story

Fictitious worlds – People on mars can fly.

Defining scope – People like ice-cream. – Does this mean that all (or some?) people like ice cream?

Language is changing and evolving– I’ll email you my answer.– This new S.U.V. has a compartment for your mobile phone.– Googling, …

Page 5: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

5

Why is NLP hard?

Natural language isHighly ambiguous at all levelsComplexProbabilistic, fuzzyInvolves reasoning about the worldDeals with complex social interactions

Why Text is tough?Abstract concepts are difficult to represent Countless combinations of subtle, abstract relationships among concepts Many ways to represent similar concepts Concepts are difficult to visualize High dimensionality - Tens or hundreds of thousands of features

Page 6: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

6

How is NLP doable?

But in some senses NLP is quite easyRough text features good enough for many useful tasks

Why Text is easy?Highly redundant dataJust about any simple algorithm can get “good” results for simple tasks:

– Pull out “important” phrases – Find “meaningfully” related words – Create some sort of summary from documents

Page 7: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

7

Levels of Text Processing

Word LevelWords PropertiesStop-WordsStemmingFrequent N-GramsThesaurus (WordNet)

Sentence LevelDocument LevelDocument-Collection LevelLinked-Document-Collection LevelApplication Level

Page 8: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

8

Models and Algorithms

Models: formalisms used to capture the various kinds of linguistic structure.

State machines (fsa, transducers, markov models)Formal rule systems (context-free grammars, feature systems)Logic (predicate calculus, inference)Probabilistic versions of all of these + others (gaussian mixture models, probabilistic relational models, etc etc)

Algorithms used to manipulate representations to create structure.

Search (A*, dynamic programming)EMSupervised learning, etc etc

Page 9: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

9

Language Processing Pipeline

Phonetic/Phonological Analysis

Morphological and lexical analysis

OCR/Tokenization

Syntactic analysis

Semantic Interpretation

Discourse Processing

speech text

POS tagging

WSDShallow parsing

Deep Parsing

Anaphora resolution

Integration

Page 10: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

10

The Big Picture

Speech recognition Speech Synthesis

Source text Analysis Target text Generation

Source Language Speech Signal

Target Language Speech Signal

Page 11: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

11

Some Building Blocks

Text Normalization

Morphological Analysis

POS Tagging

Parsing

Semantic Analysis

Discourse Analysis

Text Rendering

Morphological Synthesis

Phrase Generation

Role Ordering

Lexical Choice

Discourse Planning

Source Language Analysis Target Language Generation

Page 12: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

12

Two Approaches

SymbolicEncode all the necessary knowledgeGood when annotated data is not availableAllows steady developmentThe development can be monitoredFits well with logic and reasoning in AI

StatisticalLearn language from its usageSupervised learning require large collections manually annotated with meta-tagsDevelopment is almost blind

– Few ways to check the correctness– Debugging is very frustrating

Page 13: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

13

Resolve Ambiguities

We will introduce models and algorithms to resolve ambiguities at different levels.

part-of-speech tagging -- Deciding whether duck is verb or noun.

word-sense disambiguation -- Deciding whether make is create or cook.

lexical disambiguation -- Resolution of part-of-speech and word-sense ambiguities are two important kinds of lexical disambiguation.

syntactic ambiguity -- her duck is an example of syntactic ambiguity, and can be addressed by probabilistic parsing.

Page 14: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

14

Languages

Languages: 39,000 languages and dialects (22,000 dialects in India alone)Top languages:

Chinese/Mandarin (885M), Spanish (332M), English (322M), Bengali (189M), Hindi (182M), Portuguese (170M), Russian (170M), Japanese (125M)

Source: www.sil.org/ethnologue, www.nytimes.comInternet: English (128M), Japanese (19.7M), German (14M), Spanish (9.4M), French (9.3M), Chinese (7.0M)Usage: English (1999-54%, 2001-51%, 2003-46%, 2005-43%)Source: www.computereconomics.com

Page 15: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

15

TokenizationSegmentationStemming/ lemmatization

Page 16: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

16

Morphology

Morphology is the field of linguistics that studies the internal structure of words How words are built up from smaller meaningful units called morphemes (morph = shape, logos = word)We can usefully divide morphemes into two classes

Stems: The core meaning bearing unitsAffixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions

– Prefix: un-, anti-, etc (a- ati- pra- etc)– Suffix: -ity, -ation, etc ( -taa, -ke, -ka etc)– Infix: are inserted inside the stem

Tagalog: um + hingi humingi– Circumfixes – precede and follow the stem

Turkish can have words with a lot of suffixes (agglutinative language) Many indian languages also have agglutinative suffixes

Page 17: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

17

Examples (English)

“unladylike”3 morphemes, 4 syllables

un- ‘not’lady ‘(well behaved) female adult human’-like ‘having the characteristics of’

Can’t break any of these down further without distorting the meaning of the units

“dogs”2 morphemes, 1 syllable

-s, a plural marker on nouns

Page 18: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

18

Examples (Bengali)

“chhelederTaakei”5 morphemes

chhele ‘boy’-der ‘plural genitive’-Taa ‘classifier’-ke ‘dative’-i ‘emphasizer’

Can’t break any of these down further without distorting the meaning of the units

“atipraakrritake”ati-praakrrita-ke

Page 19: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

19

Inflectional & Derivational Morphology

We can also divide morphology up into two broad classes

InflectionalDerivational

Inflectional morphology is grammaticalnumber, tense, case, gender

Derivational morphology concerns word buildingpart-of-speech derivationwords with related meaning

Page 20: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

20

Inflectional Morphology

Inflection:Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast.

– Doesn’t change the word class– Usually produces a predictable, nonidiosyncratic change of meaning.

Eg, may add tense, number, person, mood, aspect– Serves a grammatical/semantic purpose different from the original

Highly systematic, though there may be irregularities and exceptionsSimplifies lexicon, only exceptions need to be listedUnknown words may be guessable

After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change.

eat / eats pencil / pencilshelaa / khele / khelchhila bai / baiTAke / baiyera

Page 21: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

21

Derivational Morphology

Derivation:The formation of a new word or inflectable stem from another word or stem.

After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change.

compute / computer do / undo friend / friendlyUygar / uygarlaş kapı / kapıcı udaara (J) / udaarataa (N)bhadra / abhadrabaayu / baayabiiya

Irregular changes may happen with derivational affixes.Fairly systematic, and predictable up to a point

Simplifies description of lexicon: regularly derived words need not be listedUnknown words may be guessable

But …Apparent derivations have specialised meaningSome derivations missing

Page 22: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

22

Morphological processes

Affixes: prefix, suffix, infix, circumfixVowel change (umlaut, ablaut)Gemination, (partial) reduplicationRoot and patternStress (or tone) changeSandhi

Page 23: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

23

Concatenative Morphology

Morpheme+Morpheme+Morpheme+…Stems: also called lemma, base form, root, lexeme

hope+ing hoping hop hopping

AffixesPrefixes: AntidisestablishmentarianismSuffixes: AntidisestablishmentarianismInfixes: hingi (borrow) – humingi (borrower) in TagalogCircumfixes: sagen (say) – gesagt (said) in German

Agglutinative Languagesuygarlaştıramadıklarımızdanmışsınızcasınauygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casınaBehaving as if you are among those whom we could not cause to become civilized

Page 24: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

24

Morphophonemics

Morphemes and allomorphseg {plur}: +(e)s, vowel change, yies, fves, um a, , ...

Morphophonemic variationAffixes and stems may have variants which are conditioned by context

– eg +ing in lifting, swimming, boxing, raining, hoping, hopping

Rules may be generalisable across morphemes– eg +(e)s in cats, boxes, tomatoes, matches, dishes,

buses– Applies to both {plur} (nouns) and {3rd sing pres} (verbs)

Page 25: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

25

Templatic Morphology

Roots and PatternsExample: Hebrew verbsRoot:

– Consists of 3 consonants CCC– Carries basic meaning

Template:– Gives the ordering of consonants and vowels– Specifies semantic information about the verb

Active, passive, middle voiceExample:

– lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught)

Page 26: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

26

Syntax and Morphology

Phrase-level agreementSubject-Verb

– John studies hard (STUDY+3SG)Noun-Adjective

– Achchhi Ladki

In some languages like Sanskrit, morphology contains a lot of information about structure

Page 27: Introduction to  Natural Language Processing and Text Mining and The basic building blocks

27

Morphology in NLP

Analysis vs synthesiswhat does dogs mean? vs what is the plural of dog?

AnalysisNeed to identify lexeme

– Tokenization– To access lexical information

Inflections (etc) carry information that will be needed by other processes (eg agreement useful in parsing, inflections can carry meaning (eg tense, number)Morphology can be ambiguous

– May need other process to disambiguate (eg German –en)

SynthesisNeed to generate appropriate inflections from underlying representation