Text Mining - webapps.lehigh.eduwebapps.lehigh.edu/hpc/training/TextMining.pdf · Text Mining Introduction Outline • Sentence segmentation. • Word tokenization. • Word normalization:

Text MiningResearch Computing Center

Sachin Joshi

8/27/16

4

Logistics

• Course outcomes:• Python programming proficiency (to the extend of stand-alone ones);• Linguistics basics;• Mining algorithm in real world applications;• The ability to tackle more challenging NLP and text mining problems;• Awareness of the values locked up in text data;• Both computation- and data-driven thinking (good for big data and

analytics jobs).• Some running and cool projects to brag about!

CSE398/498 7

What is text mining

Text data• Tweets• Reviews• Government reports• Scientific papers• News• Books• Messages

CSE398/498 8

Non-text data• Images• Videos• Temperatures• Time series• Location• Graph/networks

Data Mining (Data Science)

What is text miningReal world Human beings Text data

Mining as reverse engineering • Infer what the real world is;• Infer what the human beings are thinking;• Infer the language itself.

Decision making!

What is text mining

• A more practical view

Computing Modeling Linguistics

Real world

• Example

Attractive Nakiri.(double bevel) I have been using Shun Classic Santokus and utility knives for almost everything so I felt a need to try something "new." This was it. This blade is quite thin and based on the specs it is made of a good quality steel. It isn't at the very top in terms of hardness, but I hope that also means that it will be less brittle and potentially easier to sharpen.

AttractiveBlade ->ThinSteel ->GoodqualityBlade ->lessbrittleBlade ->Easytosharpen

Text data Knowledge

What is text mining

Prob(buying) = 95%Decision

Text mining vs. other approaches

• Data Science: focus on data processing, with simple miningmodels.

• Data Mining: focus on general mining techniques for moregeneral data formats.

• AI: a general area, providing some techniques for text mining.• Database: focuses on structured data, while text data are highly

unstructured.• traditional NLP: some techniques can be used for text mining,

but it focuses more on text analysis (beat a sentence to death).

Why text mining

• Texts are every where!• Texts have valuable but hidden knowledge.• Many useful and real-world applications

• Stock market (deep)• Customer survey (Amazon.com)• Policy and government (opengov.com)• Question and answering systems (Baidu’s medical QA)• Many more …

Stock market prediction

• Predict whether a stock will go up or down using opinionexpressed in public forum and news.

Entities: companies,countries,people,etc.

Opinions:positive,negative,neutral,etc.

Customer relation management

• Know what the customers like• and don’t like about a product.

• Possibly recommend alternative• products.

• Sway customers opinions via• incentive.

• Retain leaving customers.

Scientific literal management

• Categorization of publications;• Information retrieval;• Discovering scientific hypothesis;• Influential paper discovery;• Trending topics for research

Question answering

How to do text mining• Tools:

Computers: store and process big text data. (programming and data structures)

Linguistics: human knowledge about syntax, semantics, etc. (of English)

Statistical and machine learning models: infer the hidden knowledge about the real world in the text data. (Calculus, linear algebra, probability and statistics.)

Text Mining

Introduction

Outline• Sentence segmentation.• Word tokenization.• Word normalization:

§ case folding§ lemmatization§ stemming.

• Text representations.

A big picture• Decide the tokens/vocabulary of your corpus.• Further tasks:

§ word collocation (phrase level)§ classification, clustering, topic models (document level)§ syntax parsing (sentence level)§ semantic analysis: entity resolution, relation detection (all levels)§ sentiment analysis (all levels).

Text processing pipeline• Get the tokens!

Classification

Clustering

Topicmodeling

IndexingandIR

Vector space model

Sequential model

& normalization

Sentence segmentation• Why: we want to study properties of sentences sometimes.• What are the boundaries between sentences? Punctuations:

§ Question mark: “?”§ Exclamation (!)§ Semi-colon (;)§ Period (.): 500.00 dollars, Ph.D,§ Comma (,): 10,500 dollars§ Quotation (“): “Bye”, I said§ Ampersand (&): AT&T, Barnes & Nobel

• More advanced methods are based on machine learning: checkthe surrounding of a punctuation to decide whether it is aboundary or not.• NLTK uses Punkt sentence segmenter.

Word tokenization• A sequence of characters -> sequence of meaningful tokens.• Example:

From IIR:

From FSNLP:

Noticetheirdifference?

Tokenization• How to define a valid token is task-dependent.• A simple space separator is not enough: “San” and “Francisco”? “Mar 2015”?• Do we care about phrases? “San Francisco”? “New York”?• Special characters like “$10”? Or hashtag on Twitter “#LU”• Apostrophes (’): “doesn’t” or “does” and “n’t”? In sentiment analysis, negative is

informative. “rock ‘n’ roll”, “Tom’s place”§ Commas: “100, 000 dollars” or (“100” “000” “dollars”)§ Hyphens: “soon-to-be” or (“soon” “to” “be”), “Hewlett-Packard”§ Email addresses, dates, URLs (Usually treated separately).

§ In practice, no tokenization is perfect.§ Instead, it is usually via fast programmed automata via regular

expressions.

Stop word removal• Stop word list

• To remove or not to remove, that’s question:• Some stop words have no meaning: “the”, “a”, “for”• Stop word removal can reduce data size.• But stop word are critical elements in syntax and semantics. NLP

usually keeps the stop words to facilitate the analysis of a wholesentence.

Word normalization• After tokenization, we may have two words that can belong to

the same class.• Turn multiple words into a single class (may be incorrect).• Examples:• (“is” and “was” may be considered equivalent).• USA and U.S.A have the same meaning

• There are various kinds of word normalization:§ case folding.§ lemmatization.§ stemming.§ semantic links (“auto” and “car”).

Case folding• Usually we want lower-case the capital letters at the beginning

of a sentence.• Example: “He went to church” -> “he”, “went”, “to”, “church”• Counter-examples: “Kennedy was shot”? “USA” -> “usa”?

• Case can be informative.• “US” -> “us”: country name vs. a pronoun, big loss of information.• “C.A.T” -> “cat”: company name vs. an animal

• Rule-based: Only lower case the first letter of a sentence and allwords in titles, leaving other things un-touched.• Machine learning: sequence model with rich features.

Lemmatization• Lemma -> lemmatization• A lemma is a major entry or base form in an English dictionary.• Examples:

§ “is”, “are”, “were” share the lemma “be”.§ “dinner” and “dinners” share the lemma “dinner”

“He is reading detective stories.”

“He be read detective story.”

Some informationislost.

Lemmatization• More formally, we want to break a word into a few parts to

recover the most basic component in the word (morphology).• A word consists of morphemes /mofims/

§ stem morphemes: the basic meaning of the word;§ affix morphemes: added meanings§ Example: “dog” -> “dog”, “cats” -> “cat” + “s”§ “organization” -> “organize” -> “organ

• Morphological Parsing is the technical term for this wordbreaking process

Stemming (Porter Stemmer)• A simple but crude rule-based lemmatization method.

• A word is passed through the stemmer multiple times, with theoutput of the last pass as the input to the current pass.• Can have a lot of errors:

• “organization” -> “organize” -> “organ”• “noisy” -> “noise”

asmalltest

Example of different stemmers

Stemming vs. lemmatization• Stemming is crude and rule-based.• Lemmatization involves a dictionary and morphological analysis

of words. (requiring more linguistic knowledge).• Example:• Stemming: “saw” -> “s”• Lemmatization: “saw” -> “see” (verb) or “saw” (noun)

• When is stemming useful and harmful?• Examples?

Text representation• Vector space (or bag-of-words) models:

§ Word order does not matter (or lost).§ Boolean.§ Term-frequency.§ Term-frequency inverse-document-frequency

• Sequences of tokens (after text pre-processing).• Word order matters and shall be modeled.

All linear algebra concepts and

operations hold in this vector space.

Requires a set of math tools beyond linear algebra.

A corpus as a Boolean matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Tokens

Documents

Each document is represented by a binary vector ∈ {0,1}|V|

Issues with such text representation?Think about finding relevant docs using keyword “Antony”

Term frequency matrix• Consider the number of occurrences of a term in a document:• Each document is a count vector in ℕv: a column below



Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

Relevance information is now better preserved: try to find docs containing “Antony”

Usually we take the log the frequencies to avoid scaling problem.

Bag of words model• Vector representation doesn’t consider the ordering of words in

a document• John is quicker than Mary and Mary is quicker than John have

the same vectors• This is called the bag of words model.• In a sense, this is a step back: The positional index was able to

distinguish these two documents.

Document frequency• Rare terms are more informative than frequent terms• Recall stop words

• Consider a term (e.g., “Calpurnia”) in the query that is rare inthe collection.



Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

Think about finding docs relevant to the query “Caesar” & “Calpurnia”.

Document frequency• Frequent terms are less informative than rare terms• Consider a query term that is frequent in the collection (e.g., high,

increase, line)• A document containing such a term is more likely to be relevant than

a document that doesn’t• But it’s not a sure indicator of relevance.• → For frequent terms, we want high positive weights for words like

high, increase, and line• Given equal term frequencies, want lower weights for rare terms.• We will use document frequency (df) to capture this.

idf weight• dft is the document frequency of t: the number of documents

that contain t• dft is an inverse measure of the informativeness of t• dft ≤ N

• We define the idf (inverse document frequency) of t by

• We use log (N/dft) instead of N/dft to “dampen” the effect of idf. (Thinkabout N=1M, df=100 and 10.

)/df( log idf 10 tt N=

idf example, suppose N = 1 million• There is one idf value for each term t in a collection.

term dft idft

calpurnia 1

animal 100

sunday 1,000

fly 10,000

under 100,000

the 1,000,000

)/df( log idf 10 tt N=

tf-idf weighting• The tf-idf weight of a term is the product of its tf weight and its

idf weight.

• Best known weighting scheme in information retrieval• Note: the “-” in tf-idf is a hyphen, not a minus sign!• Alternative names: tf.idf, tf x idf

• Increases with the number of occurrences within a document• Increases with the rarity of the term in the collection

)df/(log)tf1log(w 10,, tdt Ndt

×+=

Binary → count → weight matrix


Antony 5.25 3.18 0 0 0 0.35Brutus 1.21 6.1 0 1 0 0Caesar 8.59 2.54 0 1.51 0.25 0

Calpurnia 0 1.54 0 0 0 0Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 1.9 0.12 5.25 0.88

worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

Documents as vectors• So we have a |V|-dimensional vector space• Terms are axes of the space• Documents are points or vectors in this space• Very high-dimensional: tens of millions of dimensions when you

apply this to a web search engine• These are very sparse vectors - most entries are zero.

Documents in vector space with two terms

Finance

Great

All concepts and operations in linear algebra apply• Distance of two vectors.

• Angle between two vectors (cos similarity).

Distance vs. similarity

Notation:di andqarealldocuments.

Summary• Low level text processing.• Bag-of-words or vector space text representations.• Distance/similarity measures.

• Coming up:• Classification based on the vector space of terms.

Text Mining

Text Classification

Text classification• Why text classification

§ Spam detection;§ Finding relevant documents;§ Sentiment analysis

Formulation (Supervised learning)

§Given:§ A document d (usually in a vector space).§ A fixed set of classes:

C = {c1, c2,…, cJ}§ A training set D of documents each with a label in C

§Determine:§ A learning method or algorithm which will enable us to learn a classifier f§ For a test document d, we assign it the class f(d) ∈ C

Classifiers§Supervised learning

§ Naive Bayes (simple, common).§ k-Nearest Neighbors (simple, powerful)§ Support-vector machines (new, generally more powerful)§ … plus many other methods§ No free lunch: requires hand-classified training data§ But data can be built up (and refined) by amateurs

§Many commercial systems use a mixture of methods

Classification using bag-of-wordsI love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

)=cf (

Classification using bag-of-words

f ( )=cgreat 2love 2

recommend 1

laugh 1happy 1

... ...

Features§ Features = axes in the vector space.§Supervised learning classifiers can use any sort of feature

§ URL, email address, punctuation, capitalization, dictionaries, networkfeatures

§ In the bag of words view of documents§ We use only word features§ we use all of the words (vocabulary) in the text (not a subset)

Feature Selection: Why?§ Text collections have a large number of features

§ 10,000 – 1,000,000 unique words … and more

Feature Selection: Why?§ Selection may make a particular classifier feasible

§ Some classifiers can’t deal with 1,000,000 features§ Reduces training time

§ Training time for some methods is quadratic or worse in the number of features

§ Makes runtime models smaller and faster§ Can improve generalization (performance)

§ Eliminates noise features§ Avoids overfitting

Evaluating§Evaluation must be done on test data that are independent of

the training data§ Sometimes use cross-validation (averaging results over multiple

training and test splits of the overall data)§Easy to get good performance on a test set that was available

to the learner during training (e.g., just memorize the test set)§ Measures: precision, recall, F1, classification accuracy§ Classification accuracy: r/n where n is the total number of test docs and r

is the number of test docs correctly classified

A running example§Classify webpages from CS departments into:

§ student, faculty, course, project§ Train on ~5,000 hand-labeled web pages

§ Cornell, Washington, U.Texas, Wisconsin

§Crawl and classify a new site (CMU) using Naïve Bayes

§ Results

ClassificationUsingVectorSpaces

§ In vector space classification, training set corresponds to alabeled set of points (equivalently, vectors)

§Premise 1: Documents in the same class form a contiguousregion of space

§Premise 2: Documents from different classes don’t overlap(much)

§ Learning a classifier: build surfaces to delineate classes in thespace

DocumentsinaVectorSpace

Government

Science

Arts

Sec.14.1

TestDocumentofwhatclass?

Government

Science

Arts

Sec.14.1

TestDocument=Government

Government

Science

Arts

Is this similarityhypothesistrue ingeneral?

Our focus: how to find good separators

Sec.14.1

Rocchio Classifier• Training:

• Use standard tf-idf weighted vectors to represent text documents• For training documents in each category, compute a prototype vector by

summing the vectors of the training documents in the category.• Prototype = centroid of members of class

• Where Dc is the set of all documents that belong to class c and v(d) is the vectorspace representation of d.

• Testing: assign test documents to the category with the closest prototypevector based on cosine similarity.

Illustration of Rocchio Text Categorization

9/20/16

10

Rocchio Properties • Forms a simple generalization of the examples in each class (a

prototype).• Prototype vector does not need to be averaged or otherwise

normalized for length since cosine similarity is insensitive to vector length.• Classification is based on similarity to class prototypes.• Does not guarantee classifications are consistent with the given

training data.

CSE398/498 19

Why not?

may be problematic

Rocchio Anomaly • Prototype models have problems with polymorphic (disjunctive)

categories.

CSE398/498 20

k Nearest Neighbor Classification• kNN = k Nearest Neighbor• It is a supervised learning method.• Training: storing the representations of the training examples in D.• Testing: classify a document d into class c:

§ Define k-neighborhood N as k nearest neighbors of d§ Count number of documents i in N that belong to c§ Estimate P(c|d) as i/k§ Choose as class argmaxc P(c|d) [ = majority class]

Example: k=6 (6NN)

Government

Science

Arts

P(science| )?

Properties of kNN• kNN performance sensitive to k.• When k=1?

• Noise (i.e., an error) in the category label of a single training example.• More robust alternative is to find the k most-similar examples

and return the majority category of these k examples.• When k=the number of all documents?• Value of k is typically odd to avoid ties; 3 and 5 are most

common.• Time complexity when testing:𝑂(𝑛|𝑉|) where 𝑛 is the number of

training documents and |𝑉| is the vocabulary size.

Properties of kNN• Nearest neighbor method depends on a similarity (or distance)

metric.• Simplest for continuous m-dimensional instance: space is

Euclidean distance.• Simplest for m-dimensional binary instance space: is Hamming

distance (number of feature values that differ).• For text, cosine similarity of tf.idf weighted vectors is typically

most effective.

9/20/16

13

Illustration of 3 Nearest Neighbor for Text Vector Space

CSE398/498 25

kNN vs. Rocchio• Nearest Neighbor tends to handle polymorphic categories better

than Rocchio.

CSE398/498 26

Why kNN can handle this?

Optimization for logistic regression

Likelihood function: Log-likelihood function:

Gradientof log-likelihood

Linear classification• Many common text classifiers are linear classifiers

• Naïve Bayes• Perceptron• Rocchio• Logistic regression• Support vector machines (with linear kernel)• Linear regression with threshold

• Despite this similarity, noticeable performance differences• For separable problems, there is an infinite number of separating hyperplanes. Which

one do you choose?• What to do for non-separable problems?• Different training methods pick different hyperplanes

• Classifiers more powerful than linear often don’t perform better on text problems.

Linear classification• Can find separating hyperplane by linear programming

(or can iteratively fit solution via perceptron):• separator can be expressed as ax + by = c

Find a,b,c,such thatax +by > c forred pointsax +by < c forblue points.

Example linear text classifier• Class: “interest” (as in interest rate)• Example features of a linear classifier• wi ti wi ti

• To classify, find dot product of feature vector and weights

• 0.70 prime• 0.67 rate• 0.63 interest• 0.60 rates• 0.46 discount• 0.43 bundesbank

• −0.71 dlrs• −0.35 world• −0.33 sees• −0.25 year• −0.24 group• −0.24 dlr

Which hyperplane?

• Lots of possible solutions for a,b,c.• Some methods find a separating

hyperplane, but not the optimal one[according to some criterion of expected goodness]• E.g., perceptron

• Most methods find an optimal separatinghyperplane

• Which points should influence optimality?• All points

• Linear/logistic regression• Naïve Bayes

• Only “difficult points” close to decisionboundary• Support vector machines

Properties of text classification• High dimensional data: thousands or millions of features, some

relevant, many are irrelevant• Documents are zero along almost all axes• Most document pairs are very far apart (i.e., not strictly

orthogonal, but only share very common words and a fewscattered others)• In classification terms: often document sets are separable, for

most any classification• This is part of why linear classifiers are quite successful in this

domain

More than one class

• Multi-labeled classification• A document can belong to 0, 1, or >1 classes.• Decompose into n binary problems• Quite common for documents

• Multi-class classification• Classes are mutually exclusive.• Each document belongs to exactly one class• E.g., digit recognition. Digits are mutually exclusive

One-vs-all classificaiton• Build a separator between each class and its complementary

set (docs from all other classes).• Given test doc, evaluate it for membership in each class.• Assign document to class with:• maximum score(s)• maximum confidence(s)• maximum probability (probabilities)

?

??

?

Text Mining

Clustering

Topics• Clustering.

• Motivation• Quality of clustering

• Clustering methods.• Flat clustering (K-means)

What is clustering• Clustering: the process of grouping a set of objects into classes

of similar objects• Documents within a cluster should be similar.• Documents from different clusters should be dissimilar.

• The commonest form of unsupervised learning• Unsupervised learning = learning from raw data, as opposed to

supervised data where a classification of examples is given

A data set with clear cluster structure• Grouping the following points into 3 groups (clusters), based on

similarity/distance.

Motivating example• Document clustering

Words having multiple meanings

6

Multiple meaningsof the word“Cluster”, eachmeaning is represented by a set of documents

Helping information retrieval• Cluster hypothesis - Documents in the same cluster behave similarly with respect to

relevance to information needs• Therefore, to improve search recall:

• Cluster docs in corpus a priori• When a query matches a doc D, also return other docs in the cluster containing D

• Hope if we do this: The query “car” will also return docs containing automobile• Because clustering grouped together docs containing car with those containing

automobile.

Cluster for “car” and “automobile”

Motivating example• Word clustering

• grouping words withsimilar topictogether

• 5 topics: shopping,tech, tagging, rdf,firefox

8

Issues of clustering• How many clusters?

• Do you know that before clustering?• Too many?• Too few?

• Which distance measure to adopt? e.g. cosine vs. Euclidean?

Categorization of clustering algorithms• Based on methodology.• Flat algorithms

• Usually start with a random (partial) partitioning• Refine it iteratively

• K means clustering• (Model based clustering)

• Hierarchical algorithms• Bottom-up, agglomerative• (Top-down, divisive)

Categorization of clustering algorithms• Based on results.• Hard clustering: Each document belongs to exactly one cluster

• More common and easier to do• Soft clustering: A document can belong to more than one cluster.

• Makes more sense for applications like creating browsable hierarchies• You may want to put a pair of sneakers in two clusters: (i) sports apparel and

(ii) shoes• You can only do that with a soft clustering approach.

Partitioning Algorithms• Partitioning method: Construct a partition of n documents

into a set of K clusters• Given: a set of documents and the number K• Find: a partition of K clusters that optimizes the chosen

partitioning criterion• Globally optimal

• Intractable for many objective functions• Ergo, exhaustively enumerate all partitions

• Effective heuristic methods: K-means and K-medoidsalgorithms

K-means• Assumes documents are real-valued vectors.• Clusters based on centroids (aka the center of gravity or mean)

of points in a cluster, c:

• Reassignment of instances to clusters is based on distance tothe current cluster centroids.

• (Or one can equivalently phrase it in terms of similarities)

∑∈

=cxx

c !

!!||1(c)µ

K-Means Algorithm• Select K random docs {s1, s2,… sK} as seeds.• Until clustering converges (or other stopping criterion):• For each doc di:• Assign di to the cluster cj such that dist(xi, sj) is minimal.• (Next, update the seeds to the centroid of each cluster)• For each cluster cj

• sj = µ(cj)

A running example (K=2)

Pick seedsReassign clusters

Compute centroids

xx

Reassign clustersx

x xxCompute centroidsReassign clusters

Converged!

When to stop?• Several possibilities, e.g.,

• A fixed number of iterations.• Doc partition unchanged.• Centroid positions don’t change.

• Are the last two conditions the same?

Convergence• Convergence?• Why should the K-means algorithm ever reach a fixed

point?• A state in which clusters don’t change.

• K-means is a special case of a general procedure known asthe Expectation Maximization (EM) algorithm.

• EM is known to converge.• Number of iterations could be large.

• But in practice usually isn’t!

Convergence• Define goodness measure of cluster k as sum of squared

distances from cluster centroid:• Gk = Σi (di – ck)2 (sum over all di in cluster k)

• G = Σk Gk

• Reassignment monotonically decreases G since eachvector is assigned to the closest centroid.

Convergence of K-Means• Recomputation monotonically decreases each Gk since

(mk is number of members in cluster k):• Σ (di – a)2 reaches minimum for:• Σ –2(di – a) = 0• Σ di = Σ a• mK a = Σ di• a = (1/ mk) Σ di = ck

• K-means typically converges quickly

Sensitivity to seed set selection• Results can vary based on random

seed selection.• Some seeds can result in poor

convergence rate, or convergenceto sub-optimal clusterings.

• Select good seeds using a heuristic(e.g., doc least similar to any existingmean)

• Try out multiple starting points• Initialize with the results of another

method.

In the above, if you startwith B and E as centroidsyou converge to {A,B,C}and {D,E,F}If you start with D and Fyou converge to {A,B,D,E} {C,F}

Example showingsensitivity to seeds

How many clusters?• Number of clusters K is given

• Partition n docs intopredetermined number ofclusters

• Finding the “right” number ofclusters is part of the problem

• Given docs, partition into an“appropriate” number of subsets.

• E.g., for Google news - we knowthe number of clusters (sports,politics, finance).

Text Mining

Word Collocation

What is word collocation• “an expression consisting of two or more words that correspond

to some conventional way of saying things.”

• “Collocations of a given word are statements of the habitual orcustomary places of that word”

• Examples: “stiff breeze”, “strong tea”, “powerful drug”, “broaddaylight”, “weapons of mass destruction”, “make up”, “check in”

What is word collocation• (Choueka, 1988)

[A collocation is defined as] “a sequence of two or more consecutive words, thathas characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components."

• Criteria:• non-compositionality• non-substitutability• non-modifiability• non-translatable word for word

Word collocation• A phrase is compositional if its meaning can be predicted from

the meaning of its parts• Collocations have limited compositionality• there is usually an element of meaning added to the combination• Ex: strong tea

• Idioms are the most extreme examples of non-compositionality• Ex: to hear it through the grapevine

Word collocation• We cannot substitute near-synonyms for the components of a

collocation.• Strong is a near-synonym of powerful

• strong tea ?powerful tea• yellow is as good a description of the color of white wines

• white wine ?yellow wine

• Many collocations cannot be freely modified with additionallexical material or through grammatical transformations• weapons of mass destruction --> ?weapons of massive destruction• to be fed up to the back teeth --> ?to be fed up to the teeth in the back

Types of collocations• Verb particle/phrasal verb constructions

• to go down, to check out,…• Proper nouns

• John Smith• Terminological expressions

• concepts and objects in technical domains• hydraulic oil filter

• Idioms• to hear it through the grapevines.

Why study word collocation• In natural language generation

• The output should be natural• make a decision ?take a decision

• In lexicography• Identify collocations to list them in a dictionary• To distinguish the usage of synonyms or near-synonyms

• In parsing• To give preference to most natural attachments

• plastic (can opener) ? (plastic can) opener• In corpus linguistics and psycholinguists

• Ex: To study social attitudes towards different types of substances• strong cigarettes/tea/coffee• powerful drug

(Near-)Synonyms • To determine if 2 words are synonyms-- Principle of substitutability:

• 2 words are synonym if they can be substituted for one another insome?/any? sentence without changing the meaning or acceptability of thesentence

• How big/large is this plane?• Would I be flying on a big/large or small plane?

• Miss Nelson became a kind of big / ?? large sister to Tom.• I think I made a big / ?? large mistake.

Frequency based method• Justeson and Katz’s filter• Hypothesis:

§ if 2 words occur together very often, they must be interestingcandidates for a collocation

• Method:§ Select the most frequently occurring bigrams (sequence of 2 adjacent

words)

Example• Except for “New York”, all bigrams are

pairs of function words• Need some additional information to filter

these out

degrees of freedomN P Nclass probability functionN N Nmean squared errorN A Ncumulative distribution functionA N NGaussian random variableA A Nregression coefficientN Nlinear functionA NExampleTag Pattern

Example

• Based on POS tags and frequency• Simple method that works very well.

[(’The’, ’DT’), (’portfolio’, ’NN’), (’is’, ’VBZ’), (’fine’, ’JJ’), (’except’, ’IN’), (’for’, ’IN’), (’the’, ’DT’), (’fact’, ’NN’), (’that’, ’IN’), (’the’, ’DT’), (’last’, ’JJ’), (’movement’, ’NN’), (’of’, ’IN’), (’sonata’, ’NN’), (’#6’, ’CD’), (’is’, ’VBZ’), (’missing’, ’VBG’), (’.’, ’.’)]

“The portfolio is fine except for the fact that the last movement of sonata #6 is missing .”

Subtle difference between “strong” and “powerful”• On a 14 million word corpus from the New-York Times (Aug.-

Nov. 1990)

n-grams

• Similar to word collocation, studies the relationships betweenwords.• n-grams are predictive: use the previous n-1 words to predict

the n-th word, think in a probabilistic way:• Examples:

• Tri-gram: use the first two words to predict the third.• “large green ___________”

tree? mountain? frog? car?• 5-th grams:• swallowed the large green ________”

pill? broccoli?

Why n-gram?• Compute the probability of a sentence:• Pr (“This is a valid one”) >> Pr (“ate top an in and”)

• Assume bi-gram

Applications• Machine translation, OCR (optical character recognition),

speech recognition, natural language generation.• Pick the most likely sentence.

What’s the best n?• Reliability vs discrimination• “large green ___________”

tree? mountain? frog? car?• swallowed the large green ________”

pill? broccoli?

• larger n: more information about the context of the specificinstance (greater discrimination)

• smaller n: more instances in training data, better statisticalestimates (more reliability)

9/27/16

20

Example

CSE398/498 39

• Estimate what comes next after “monstrous”• Use the concordance function in NLTK on a corpus:

Estimate the probabilities• Maximum likelihood estimate with multinomial probabilistic

model, based on word counts C(w1,w2)/C(w1)

CSE398/498 40

too few non-zeros

too many zeros:this is not thetrue distribution!

saj415

Sticky Note

Cancelled set by saj415