SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS

SIMS 296a-4Text Data Mining

Marti Hearst UC Berkeley SIMS

The Textbook Foundations of Statistical Natural

Language Processing, by Chris Manning and Hinrich Schuetze

We’ll go through one chapter each week

Chapters to be Covered1. Introduction (this week)2. Linguistic Essentials3. Mathematical Foundations 4. Mathematical Foundations (cont.)5. Collocations6. Statistical Inference7. Word Sense Disambiguation8. Markov Models9. Text Categorization10. Topics in Information Retrieval11. Clustering12. Lexical Acquisition

Introduction Scientific basis for this inquiry Rationalist vs. Empirical Approach

to Language Analysis– Justification for rationalist view:

poverty of the stimulus– Can overcome this if we assume

humans can generalize concepts

Introduction Competence vs. performance

theory of grammar– Focus on whether or not sentences

are well-formed– Syntactic vs. semantic well-

formedness– Conventionality of expression breaks

this notion

Introduction Categorical perception

– Recognizing phonemes, works pretty well– But not for larger phenomena like syntax– Language change example as counter-

evidence to strict categorizability of language

» kind of/sort of -- change parts of speech very gradually

» Occupied an intermediate syntactic status during the transition

– Better to adopt a probabilistic view (of cognition as well as of language)

Introduction The ambiguity of language

– Unlike programming languages, natural language is ambiguous if not understood in terms of all its parts»Sometimes truly ambiguous too

– Parsing with syntax only is harder than if using the underlying meaning as well

Classifying Application Types

Patterns Non- NovelNuggets

NovelNuggets

Non- textualdata Standard data

miningDatabasequeries ?

Textual dataComputational

linguisticsI nformation

retrievalReal text

data mining

Word Token Distribution Word tokens are not uniformly

distributed in text– The most common tokens are about 50%

of the occurrences– About 50% of the tokens occur only once– ~12% of the text consists of words

occurring 3 times or fewer Thus it is hard to predict the behavior

of many words in the text.

Zipf’s “Law”Histogram

0

50

100

150

200

250

300

350

Bin

Frequency

Frequency

10//1

NCrCf

Rank = order of words’ frequency of occurrence

The product of the frequency of words (f) and their rank (r) is approximately constant

Consequences of Zipf There are always a few very frequent

tokens that are not good discriminators.– Called “stop words” in Information Retrieval– Usually correspond to linguistic notion of

“closed-class” words» English examples: to, from, on, and, the, ...» Grammatical classes that don’t take on new members.

Typically– A few very common words– A middling number of medium frequency words– A large number of very infrequent words

Medium frequency words most descriptive

Word Frequency vs. Resolving Power (from van

Rijsbergen 79)The most frequent words are not (usually) the most descriptive.

Order by Rank vs. by Alphabetical Order

Other Zipfian “Laws” Conservation of speaker/hearer effort ->

– Number of meanings of a word is correlated with its meaning

– (there would be only one word for all meanings vs. only one meaning for all words)

– m inversely proportional to sqrt(f)– Important for word sense disambiguation

Content words tend to clump together– Important for computing term distribution

models

Is Zipf a Red Herring? Power laws are common in natural systems Li 1992 shows a Zipfian distribution of words can

be generated randomly – 26 characters and a blank– The blank or any other character is equally likely to be

generated.– Key insights:

» There are 26 times more words of length n+1 than of length n

» There is a constant ratio by which words of length n are more frequent than length n+1

Nevertheless, the Zipf insight is important to keep in mind when working with text corpora. Language modeling is hard because most words are rare.

Collocations Collocation: any turn of phrase or

accepted usage where the whole is perceived to have an existence beyond the sum of its parts.– Compounds (disk drive)– Phrasal verbs (make up)– Stock phrases (bacon and eggs)

Another definition:– The frequent use of a phrase as a fixed

expression accompanied by certain connotations.

Computing Collocations Take the most frequent adjacent

pairs– Doesn’t yield interesting results– Need to normalize for the word

frequency within the corpus. Another tack: retain only those with

interesting syntactic categories»adj noun»noun noun

More on this later!

Next Week Learn about linguistics! Decide on project participation

Documents

SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS