Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Text MiningResearch Computing Center
Sachin Joshi
8/27/16
4
Logistics
• Course outcomes:• Python programming proficiency (to the extend of stand-alone ones);• Linguistics basics;• Mining algorithm in real world applications;• The ability to tackle more challenging NLP and text mining problems;• Awareness of the values locked up in text data;• Both computation- and data-driven thinking (good for big data and
analytics jobs).• Some running and cool projects to brag about!
CSE398/498 7
What is text mining
Text data• Tweets• Reviews• Government reports• Scientific papers• News• Books• Messages
CSE398/498 8
Non-text data• Images• Videos• Temperatures• Time series• Location• Graph/networks
Data Mining (Data Science)
What is text miningReal world Human beings Text data
Mining as reverse engineering • Infer what the real world is;• Infer what the human beings are thinking;• Infer the language itself.
Decision making!
What is text mining
• A more practical view
Computing Modeling Linguistics
Real world
• Example
Attractive Nakiri.(double bevel) I have been using Shun Classic Santokus and utility knives for almost everything so I felt a need to try something "new." This was it. This blade is quite thin and based on the specs it is made of a good quality steel. It isn't at the very top in terms of hardness, but I hope that also means that it will be less brittle and potentially easier to sharpen.
AttractiveBlade ->ThinSteel ->GoodqualityBlade ->lessbrittleBlade ->Easytosharpen
Text data Knowledge
What is text mining
Prob(buying) = 95%Decision
Text mining vs. other approaches
• Data Science: focus on data processing, with simple miningmodels.
• Data Mining: focus on general mining techniques for moregeneral data formats.
• AI: a general area, providing some techniques for text mining.• Database: focuses on structured data, while text data are highly
unstructured.• traditional NLP: some techniques can be used for text mining,
but it focuses more on text analysis (beat a sentence to death).
Why text mining
• Texts are every where!• Texts have valuable but hidden knowledge.• Many useful and real-world applications
• Stock market (deep)• Customer survey (Amazon.com)• Policy and government (opengov.com)• Question and answering systems (Baidu’s medical QA)• Many more …
Stock market prediction
• Predict whether a stock will go up or down using opinionexpressed in public forum and news.
Entities: companies,countries,people,etc.
Opinions:positive,negative,neutral,etc.
Customer relation management
• Know what the customers like• and don’t like about a product.
• Possibly recommend alternative• products.
• Sway customers opinions via• incentive.
• Retain leaving customers.
Scientific literal management
• Categorization of publications;• Information retrieval;• Discovering scientific hypothesis;• Influential paper discovery;• Trending topics for research
Question answering
How to do text mining• Tools:
Computers: store and process big text data. (programming and data structures)
Linguistics: human knowledge about syntax, semantics, etc. (of English)
Statistical and machine learning models: infer the hidden knowledge about the real world in the text data. (Calculus, linear algebra, probability and statistics.)
Text Mining
Introduction
Outline• Sentence segmentation.• Word tokenization.• Word normalization:
§ case folding§ lemmatization§ stemming.
• Text representations.
A big picture• Decide the tokens/vocabulary of your corpus.• Further tasks:
§ word collocation (phrase level)§ classification, clustering, topic models (document level)§ syntax parsing (sentence level)§ semantic analysis: entity resolution, relation detection (all levels)§ sentiment analysis (all levels).
Text processing pipeline• Get the tokens!
Classification
Clustering
Topicmodeling
IndexingandIR
Vector space model
Sequential model
& normalization
Sentence segmentation• Why: we want to study properties of sentences sometimes.• What are the boundaries between sentences? Punctuations:
§ Question mark: “?”§ Exclamation (!)§ Semi-colon (;)§ Period (.): 500.00 dollars, Ph.D,§ Comma (,): 10,500 dollars§ Quotation (“): “Bye”, I said§ Ampersand (&): AT&T, Barnes & Nobel
• More advanced methods are based on machine learning: checkthe surrounding of a punctuation to decide whether it is aboundary or not.• NLTK uses Punkt sentence segmenter.
Word tokenization• A sequence of characters -> sequence of meaningful tokens.• Example:
From IIR:
From FSNLP:
Noticetheirdifference?
Tokenization• How to define a valid token is task-dependent.• A simple space separator is not enough: “San” and “Francisco”? “Mar 2015”?• Do we care about phrases? “San Francisco”? “New York”?• Special characters like “$10”? Or hashtag on Twitter “#LU”• Apostrophes (’): “doesn’t” or “does” and “n’t”? In sentiment analysis, negative is
informative. “rock ‘n’ roll”, “Tom’s place”§ Commas: “100, 000 dollars” or (“100” “000” “dollars”)§ Hyphens: “soon-to-be” or (“soon” “to” “be”), “Hewlett-Packard”§ Email addresses, dates, URLs (Usually treated separately).
§ In practice, no tokenization is perfect.§ Instead, it is usually via fast programmed automata via regular
expressions.
Stop word removal• Stop word list
• To remove or not to remove, that’s question:• Some stop words have no meaning: “the”, “a”, “for”• Stop word removal can reduce data size.• But stop word are critical elements in syntax and semantics. NLP
usually keeps the stop words to facilitate the analysis of a wholesentence.
Word normalization• After tokenization, we may have two words that can belong to
the same class.• Turn multiple words into a single class (may be incorrect).• Examples:• (“is” and “was” may be considered equivalent).• USA and U.S.A have the same meaning
• There are various kinds of word normalization:§ case folding.§ lemmatization.§ stemming.§ semantic links (“auto” and “car”).
Case folding• Usually we want lower-case the capital letters at the beginning
of a sentence.• Example: “He went to church” -> “he”, “went”, “to”, “church”• Counter-examples: “Kennedy was shot”? “USA” -> “usa”?
• Case can be informative.• “US” -> “us”: country name vs. a pronoun, big loss of information.• “C.A.T” -> “cat”: company name vs. an animal
• Rule-based: Only lower case the first letter of a sentence and allwords in titles, leaving other things un-touched.• Machine learning: sequence model with rich features.
Lemmatization• Lemma -> lemmatization• A lemma is a major entry or base form in an English dictionary.• Examples:
§ “is”, “are”, “were” share the lemma “be”.§ “dinner” and “dinners” share the lemma “dinner”
“He is reading detective stories.”
“He be read detective story.”
Some informationislost.
Lemmatization• More formally, we want to break a word into a few parts to
recover the most basic component in the word (morphology).• A word consists of morphemes /mofims/
§ stem morphemes: the basic meaning of the word;§ affix morphemes: added meanings§ Example: “dog” -> “dog”, “cats” -> “cat” + “s”§ “organization” -> “organize” -> “organ
• Morphological Parsing is the technical term for this wordbreaking process
Stemming (Porter Stemmer)• A simple but crude rule-based lemmatization method.
• A word is passed through the stemmer multiple times, with theoutput of the last pass as the input to the current pass.• Can have a lot of errors:
• “organization” -> “organize” -> “organ”• “noisy” -> “noise”
asmalltest
Example of different stemmers
Stemming vs. lemmatization• Stemming is crude and rule-based.• Lemmatization involves a dictionary and morphological analysis
of words. (requiring more linguistic knowledge).• Example:• Stemming: “saw” -> “s”• Lemmatization: “saw” -> “see” (verb) or “saw” (noun)
• When is stemming useful and harmful?• Examples?
Text representation• Vector space (or bag-of-words) models:
§ Word order does not matter (or lost).§ Boolean.§ Term-frequency.§ Term-frequency inverse-document-frequency
• Sequences of tokens (after text pre-processing).• Word order matters and shall be modeled.
All linear algebra concepts and
operations hold in this vector space.
Requires a set of math tools beyond linear algebra.
A corpus as a Boolean matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Tokens
Documents
Each document is represented by a binary vector ∈ {0,1}|V|
Issues with such text representation?Think about finding relevant docs using keyword “Antony”
Term frequency matrix• Consider the number of occurrences of a term in a document:• Each document is a count vector in ℕv: a column below
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0Brutus 4 157 0 1 0 0Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Relevance information is now better preserved: try to find docs containing “Antony”
Usually we take the log the frequencies to avoid scaling problem.
Bag of words model• Vector representation doesn’t consider the ordering of words in
a document• John is quicker than Mary and Mary is quicker than John have
the same vectors• This is called the bag of words model.• In a sense, this is a step back: The positional index was able to
distinguish these two documents.
Document frequency• Rare terms are more informative than frequent terms• Recall stop words
• Consider a term (e.g., “Calpurnia”) in the query that is rare inthe collection.
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0Brutus 4 157 0 1 0 0Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Think about finding docs relevant to the query “Caesar” & “Calpurnia”.
Document frequency• Frequent terms are less informative than rare terms• Consider a query term that is frequent in the collection (e.g., high,
increase, line)• A document containing such a term is more likely to be relevant than
a document that doesn’t• But it’s not a sure indicator of relevance.• → For frequent terms, we want high positive weights for words like
high, increase, and line• Given equal term frequencies, want lower weights for rare terms.• We will use document frequency (df) to capture this.
idf weight• dft is the document frequency of t: the number of documents
that contain t• dft is an inverse measure of the informativeness of t• dft ≤ N
• We define the idf (inverse document frequency) of t by
• We use log (N/dft) instead of N/dft to “dampen” the effect of idf. (Thinkabout N=1M, df=100 and 10.
)/df( log idf 10 tt N=
idf example, suppose N = 1 million• There is one idf value for each term t in a collection.
term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000
)/df( log idf 10 tt N=
tf-idf weighting• The tf-idf weight of a term is the product of its tf weight and its
idf weight.
• Best known weighting scheme in information retrieval• Note: the “-” in tf-idf is a hyphen, not a minus sign!• Alternative names: tf.idf, tf x idf
• Increases with the number of occurrences within a document• Increases with the rarity of the term in the collection
)df/(log)tf1log(w 10,, tdt Ndt
×+=
Binary → count → weight matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 3.18 0 0 0 0.35Brutus 1.21 6.1 0 1 0 0Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95
Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|
Documents as vectors• So we have a |V|-dimensional vector space• Terms are axes of the space• Documents are points or vectors in this space• Very high-dimensional: tens of millions of dimensions when you
apply this to a web search engine• These are very sparse vectors - most entries are zero.
Documents in vector space with two terms
Finance
Great
All concepts and operations in linear algebra apply• Distance of two vectors.
• Angle between two vectors (cos similarity).
Distance vs. similarity
Notation:di andqarealldocuments.
Summary• Low level text processing.• Bag-of-words or vector space text representations.• Distance/similarity measures.
• Coming up:• Classification based on the vector space of terms.
Text Mining
Text Classification
Text classification• Why text classification
§ Spam detection;§ Finding relevant documents;§ Sentiment analysis
Formulation (Supervised learning)
§Given:§ A document d (usually in a vector space).§ A fixed set of classes:
C = {c1, c2,…, cJ}§ A training set D of documents each with a label in C
§Determine:§ A learning method or algorithm which will enable us to learn a classifier f§ For a test document d, we assign it the class f(d) ∈ C
Classifiers§Supervised learning
§ Naive Bayes (simple, common).§ k-Nearest Neighbors (simple, powerful)§ Support-vector machines (new, generally more powerful)§ … plus many other methods§ No free lunch: requires hand-classified training data§ But data can be built up (and refined) by amateurs
§Many commercial systems use a mixture of methods
Classification using bag-of-wordsI love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.
)=cf (
Classification using bag-of-words
f ( )=cgreat 2love 2
recommend 1
laugh 1happy 1
... ...
Features§ Features = axes in the vector space.§Supervised learning classifiers can use any sort of feature
§ URL, email address, punctuation, capitalization, dictionaries, networkfeatures
§ In the bag of words view of documents§ We use only word features§ we use all of the words (vocabulary) in the text (not a subset)
Feature Selection: Why?§ Text collections have a large number of features
§ 10,000 – 1,000,000 unique words … and more
Feature Selection: Why?§ Selection may make a particular classifier feasible
§ Some classifiers can’t deal with 1,000,000 features§ Reduces training time
§ Training time for some methods is quadratic or worse in the number of features
§ Makes runtime models smaller and faster§ Can improve generalization (performance)
§ Eliminates noise features§ Avoids overfitting
Evaluating§Evaluation must be done on test data that are independent of
the training data§ Sometimes use cross-validation (averaging results over multiple
training and test splits of the overall data)§Easy to get good performance on a test set that was available
to the learner during training (e.g., just memorize the test set)§ Measures: precision, recall, F1, classification accuracy§ Classification accuracy: r/n where n is the total number of test docs and r
is the number of test docs correctly classified
A running example§Classify webpages from CS departments into:
§ student, faculty, course, project§ Train on ~5,000 hand-labeled web pages
§ Cornell, Washington, U.Texas, Wisconsin
§Crawl and classify a new site (CMU) using Naïve Bayes
§ Results
ClassificationUsingVectorSpaces
§ In vector space classification, training set corresponds to alabeled set of points (equivalently, vectors)
§Premise 1: Documents in the same class form a contiguousregion of space
§Premise 2: Documents from different classes don’t overlap(much)
§ Learning a classifier: build surfaces to delineate classes in thespace
DocumentsinaVectorSpace
Government
Science
Arts
Sec.14.1
TestDocumentofwhatclass?
Government
Science
Arts
Sec.14.1
TestDocument=Government
Government
Science
Arts
Is this similarityhypothesistrue ingeneral?
Our focus: how to find good separators
Sec.14.1
Rocchio Classifier• Training:
• Use standard tf-idf weighted vectors to represent text documents• For training documents in each category, compute a prototype vector by
summing the vectors of the training documents in the category.• Prototype = centroid of members of class
• Where Dc is the set of all documents that belong to class c and v(d) is the vectorspace representation of d.
• Testing: assign test documents to the category with the closest prototypevector based on cosine similarity.
Illustration of Rocchio Text Categorization
9/20/16
10
Rocchio Properties • Forms a simple generalization of the examples in each class (a
prototype).• Prototype vector does not need to be averaged or otherwise
normalized for length since cosine similarity is insensitive to vector length.• Classification is based on similarity to class prototypes.• Does not guarantee classifications are consistent with the given
training data.
CSE398/498 19
Why not?
may be problematic
Rocchio Anomaly • Prototype models have problems with polymorphic (disjunctive)
categories.
CSE398/498 20
k Nearest Neighbor Classification• kNN = k Nearest Neighbor• It is a supervised learning method.• Training: storing the representations of the training examples in D.• Testing: classify a document d into class c:
§ Define k-neighborhood N as k nearest neighbors of d§ Count number of documents i in N that belong to c§ Estimate P(c|d) as i/k§ Choose as class argmaxc P(c|d) [ = majority class]
Example: k=6 (6NN)
Government
Science
Arts
P(science| )?
Properties of kNN• kNN performance sensitive to k.• When k=1?
• Noise (i.e., an error) in the category label of a single training example.• More robust alternative is to find the k most-similar examples
and return the majority category of these k examples.• When k=the number of all documents?• Value of k is typically odd to avoid ties; 3 and 5 are most
common.• Time complexity when testing:𝑂(𝑛|𝑉|) where 𝑛 is the number of
training documents and |𝑉| is the vocabulary size.
Properties of kNN• Nearest neighbor method depends on a similarity (or distance)
metric.• Simplest for continuous m-dimensional instance: space is
Euclidean distance.• Simplest for m-dimensional binary instance space: is Hamming
distance (number of feature values that differ).• For text, cosine similarity of tf.idf weighted vectors is typically
most effective.
9/20/16
13
Illustration of 3 Nearest Neighbor for Text Vector Space
CSE398/498 25
kNN vs. Rocchio• Nearest Neighbor tends to handle polymorphic categories better
than Rocchio.
CSE398/498 26
Why kNN can handle this?
Optimization for logistic regression
Likelihood function: Log-likelihood function:
Gradientof log-likelihood
Linear classification• Many common text classifiers are linear classifiers
• Naïve Bayes• Perceptron• Rocchio• Logistic regression• Support vector machines (with linear kernel)• Linear regression with threshold
• Despite this similarity, noticeable performance differences• For separable problems, there is an infinite number of separating hyperplanes. Which
one do you choose?• What to do for non-separable problems?• Different training methods pick different hyperplanes
• Classifiers more powerful than linear often don’t perform better on text problems.
Linear classification• Can find separating hyperplane by linear programming
(or can iteratively fit solution via perceptron):• separator can be expressed as ax + by = c
Find a,b,c,such thatax +by > c forred pointsax +by < c forblue points.
Example linear text classifier• Class: “interest” (as in interest rate)• Example features of a linear classifier• wi ti wi ti
• To classify, find dot product of feature vector and weights
• 0.70 prime• 0.67 rate• 0.63 interest• 0.60 rates• 0.46 discount• 0.43 bundesbank
• −0.71 dlrs• −0.35 world• −0.33 sees• −0.25 year• −0.24 group• −0.24 dlr
Which hyperplane?
• Lots of possible solutions for a,b,c.• Some methods find a separating
hyperplane, but not the optimal one[according to some criterion of expected goodness]• E.g., perceptron
• Most methods find an optimal separatinghyperplane
• Which points should influence optimality?• All points
• Linear/logistic regression• Naïve Bayes
• Only “difficult points” close to decisionboundary• Support vector machines
Properties of text classification• High dimensional data: thousands or millions of features, some
relevant, many are irrelevant• Documents are zero along almost all axes• Most document pairs are very far apart (i.e., not strictly
orthogonal, but only share very common words and a fewscattered others)• In classification terms: often document sets are separable, for
most any classification• This is part of why linear classifiers are quite successful in this
domain
More than one class
• Multi-labeled classification• A document can belong to 0, 1, or >1 classes.• Decompose into n binary problems• Quite common for documents
• Multi-class classification• Classes are mutually exclusive.• Each document belongs to exactly one class• E.g., digit recognition. Digits are mutually exclusive
One-vs-all classificaiton• Build a separator between each class and its complementary
set (docs from all other classes).• Given test doc, evaluate it for membership in each class.• Assign document to class with:• maximum score(s)• maximum confidence(s)• maximum probability (probabilities)
?
??
?
Text Mining
Clustering
Topics• Clustering.
• Motivation• Quality of clustering
• Clustering methods.• Flat clustering (K-means)
What is clustering• Clustering: the process of grouping a set of objects into classes
of similar objects• Documents within a cluster should be similar.• Documents from different clusters should be dissimilar.
• The commonest form of unsupervised learning• Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given
A data set with clear cluster structure• Grouping the following points into 3 groups (clusters), based on
similarity/distance.
Motivating example• Document clustering
Words having multiple meanings
6
Multiple meaningsof the word“Cluster”, eachmeaning is represented by a set of documents
Helping information retrieval• Cluster hypothesis - Documents in the same cluster behave similarly with respect to
relevance to information needs• Therefore, to improve search recall:
• Cluster docs in corpus a priori• When a query matches a doc D, also return other docs in the cluster containing D
• Hope if we do this: The query “car” will also return docs containing automobile• Because clustering grouped together docs containing car with those containing
automobile.
Cluster for “car” and “automobile”
Motivating example• Word clustering
• grouping words withsimilar topictogether
• 5 topics: shopping,tech, tagging, rdf,firefox
8
Issues of clustering• How many clusters?
• Do you know that before clustering?• Too many?• Too few?
• Which distance measure to adopt? e.g. cosine vs. Euclidean?
Categorization of clustering algorithms• Based on methodology.• Flat algorithms
• Usually start with a random (partial) partitioning• Refine it iteratively
• K means clustering• (Model based clustering)
• Hierarchical algorithms• Bottom-up, agglomerative• (Top-down, divisive)
Categorization of clustering algorithms• Based on results.• Hard clustering: Each document belongs to exactly one cluster
• More common and easier to do• Soft clustering: A document can belong to more than one cluster.
• Makes more sense for applications like creating browsable hierarchies• You may want to put a pair of sneakers in two clusters: (i) sports apparel and
(ii) shoes• You can only do that with a soft clustering approach.
Partitioning Algorithms• Partitioning method: Construct a partition of n documents
into a set of K clusters• Given: a set of documents and the number K• Find: a partition of K clusters that optimizes the chosen
partitioning criterion• Globally optimal
• Intractable for many objective functions• Ergo, exhaustively enumerate all partitions
• Effective heuristic methods: K-means and K-medoidsalgorithms
K-means• Assumes documents are real-valued vectors.• Clusters based on centroids (aka the center of gravity or mean)
of points in a cluster, c:
• Reassignment of instances to clusters is based on distance tothe current cluster centroids.
• (Or one can equivalently phrase it in terms of similarities)
∑∈
=cxx
c !
!!||1(c)µ
K-Means Algorithm• Select K random docs {s1, s2,… sK} as seeds.• Until clustering converges (or other stopping criterion):• For each doc di:• Assign di to the cluster cj such that dist(xi, sj) is minimal.• (Next, update the seeds to the centroid of each cluster)• For each cluster cj
• sj = µ(cj)
A running example (K=2)
Pick seedsReassign clusters
Compute centroids
xx
Reassign clustersx
x xxCompute centroidsReassign clusters
Converged!
When to stop?• Several possibilities, e.g.,
• A fixed number of iterations.• Doc partition unchanged.• Centroid positions don’t change.
• Are the last two conditions the same?
Convergence• Convergence?• Why should the K-means algorithm ever reach a fixed
point?• A state in which clusters don’t change.
• K-means is a special case of a general procedure known asthe Expectation Maximization (EM) algorithm.
• EM is known to converge.• Number of iterations could be large.
• But in practice usually isn’t!
Convergence• Define goodness measure of cluster k as sum of squared
distances from cluster centroid:• Gk = Σi (di – ck)2 (sum over all di in cluster k)
• G = Σk Gk
• Reassignment monotonically decreases G since eachvector is assigned to the closest centroid.
Convergence of K-Means• Recomputation monotonically decreases each Gk since
(mk is number of members in cluster k):• Σ (di – a)2 reaches minimum for:• Σ –2(di – a) = 0• Σ di = Σ a• mK a = Σ di• a = (1/ mk) Σ di = ck
• K-means typically converges quickly
Sensitivity to seed set selection• Results can vary based on random
seed selection.• Some seeds can result in poor
convergence rate, or convergenceto sub-optimal clusterings.
• Select good seeds using a heuristic(e.g., doc least similar to any existingmean)
• Try out multiple starting points• Initialize with the results of another
method.
In the above, if you startwith B and E as centroidsyou converge to {A,B,C}and {D,E,F}If you start with D and Fyou converge to {A,B,D,E} {C,F}
Example showingsensitivity to seeds
How many clusters?• Number of clusters K is given
• Partition n docs intopredetermined number ofclusters
• Finding the “right” number ofclusters is part of the problem
• Given docs, partition into an“appropriate” number of subsets.
• E.g., for Google news - we knowthe number of clusters (sports,politics, finance).
Text Mining
Word Collocation
What is word collocation• “an expression consisting of two or more words that correspond
to some conventional way of saying things.”
• “Collocations of a given word are statements of the habitual orcustomary places of that word”
• Examples: “stiff breeze”, “strong tea”, “powerful drug”, “broaddaylight”, “weapons of mass destruction”, “make up”, “check in”
What is word collocation• (Choueka, 1988)
[A collocation is defined as] “a sequence of two or more consecutive words, thathas characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components."
• Criteria:• non-compositionality• non-substitutability• non-modifiability• non-translatable word for word
Word collocation• A phrase is compositional if its meaning can be predicted from
the meaning of its parts• Collocations have limited compositionality• there is usually an element of meaning added to the combination• Ex: strong tea
• Idioms are the most extreme examples of non-compositionality• Ex: to hear it through the grapevine
Word collocation• We cannot substitute near-synonyms for the components of a
collocation.• Strong is a near-synonym of powerful
• strong tea ?powerful tea• yellow is as good a description of the color of white wines
• white wine ?yellow wine
• Many collocations cannot be freely modified with additionallexical material or through grammatical transformations• weapons of mass destruction --> ?weapons of massive destruction• to be fed up to the back teeth --> ?to be fed up to the teeth in the back
Types of collocations• Verb particle/phrasal verb constructions
• to go down, to check out,…• Proper nouns
• John Smith• Terminological expressions
• concepts and objects in technical domains• hydraulic oil filter
• Idioms• to hear it through the grapevines.
Why study word collocation• In natural language generation
• The output should be natural• make a decision ?take a decision
• In lexicography• Identify collocations to list them in a dictionary• To distinguish the usage of synonyms or near-synonyms
• In parsing• To give preference to most natural attachments
• plastic (can opener) ? (plastic can) opener• In corpus linguistics and psycholinguists
• Ex: To study social attitudes towards different types of substances• strong cigarettes/tea/coffee• powerful drug
(Near-)Synonyms • To determine if 2 words are synonyms-- Principle of substitutability:
• 2 words are synonym if they can be substituted for one another insome?/any? sentence without changing the meaning or acceptability of thesentence
• How big/large is this plane?• Would I be flying on a big/large or small plane?
• Miss Nelson became a kind of big / ?? large sister to Tom.• I think I made a big / ?? large mistake.
Frequency based method• Justeson and Katz’s filter• Hypothesis:
§ if 2 words occur together very often, they must be interestingcandidates for a collocation
• Method:§ Select the most frequently occurring bigrams (sequence of 2 adjacent
words)
Example• Except for “New York”, all bigrams are
pairs of function words• Need some additional information to filter
these out
degrees of freedomN P Nclass probability functionN N Nmean squared errorN A Ncumulative distribution functionA N NGaussian random variableA A Nregression coefficientN Nlinear functionA NExampleTag Pattern
Example
• Based on POS tags and frequency• Simple method that works very well.
[(’The’, ’DT’), (’portfolio’, ’NN’), (’is’, ’VBZ’), (’fine’, ’JJ’), (’except’, ’IN’), (’for’, ’IN’), (’the’, ’DT’), (’fact’, ’NN’), (’that’, ’IN’), (’the’, ’DT’), (’last’, ’JJ’), (’movement’, ’NN’), (’of’, ’IN’), (’sonata’, ’NN’), (’#6’, ’CD’), (’is’, ’VBZ’), (’missing’, ’VBG’), (’.’, ’.’)]
“The portfolio is fine except for the fact that the last movement of sonata #6 is missing .”
Subtle difference between “strong” and “powerful”• On a 14 million word corpus from the New-York Times (Aug.-
Nov. 1990)
n-grams
• Similar to word collocation, studies the relationships betweenwords.• n-grams are predictive: use the previous n-1 words to predict
the n-th word, think in a probabilistic way:• Examples:
• Tri-gram: use the first two words to predict the third.• “large green ___________”
tree? mountain? frog? car?• 5-th grams:• swallowed the large green ________”
pill? broccoli?
Why n-gram?• Compute the probability of a sentence:• Pr (“This is a valid one”) >> Pr (“ate top an in and”)
• Assume bi-gram
Applications• Machine translation, OCR (optical character recognition),
speech recognition, natural language generation.• Pick the most likely sentence.
What’s the best n?• Reliability vs discrimination• “large green ___________”
tree? mountain? frog? car?• swallowed the large green ________”
pill? broccoli?
• larger n: more information about the context of the specificinstance (greater discrimination)
• smaller n: more instances in training data, better statisticalestimates (more reliability)
9/27/16
20
Example
CSE398/498 39
• Estimate what comes next after “monstrous”• Use the concordance function in NLTK on a corpus:
Estimate the probabilities• Maximum likelihood estimate with multinomial probabilistic
model, based on word counts C(w1,w2)/C(w1)
CSE398/498 40
too few non-zeros
too many zeros:this is not thetrue distribution!