Upload
jaganadh-gopinadhan
View
5.395
Download
4
Embed Size (px)
DESCRIPTION
Basics of Text Mining with Python
Citation preview
Elements of Text MiningPart - I
Jaganadh Ghttpjaganadhgin
cba
Jaganadh G Elements of Text Mining
Tokenization
Tokenization
Tokenization is the process of breaking a stream of text up into words phrases symbolsor other meaningful elements called tokens The list of tokens becomes input for furtherprocessing such as parsing or text mining
Tokenizig text with Python
import re
def tokenize(text)
tokenizer = recompile(rsquoWrsquo)
return tokenizersplit(textlower())
doc = John likes to watch movies Mary likes too
words = tokenize(doc)
print words
Jaganadh G Elements of Text Mining
Tokenization
Tokenization
Tokenization is the process of breaking a stream of text up into words phrases symbolsor other meaningful elements called tokens The list of tokens becomes input for furtherprocessing such as parsing or text mining
Tokenizig text with Python
import re
def tokenize(text)
tokenizer = recompile(rsquoWrsquo)
return tokenizersplit(textlower())
doc = John likes to watch movies Mary likes too
words = tokenize(doc)
print words
Jaganadh G Elements of Text Mining
Twokenization
Rise of social media introduced new orthographic patterns in digital text Typical example is a tweet wherepeople use abbreviated forms of words emoticons hash-tags etc Generic text tokenization techniques wont yieldgood result in separating words in social media text like tweets A good social media tokenizer has to take care ofemoticons hash-tags shortened urls etc
Social media tokenization with Python using happyfuntokenizing 1
from happyfuntokenizing import Tokenizer
def twokenize(tweetpc=True)
twokenizer = Tokenizer(preserve_case=pc)
return twokenizertokenize(tweet)
tweet = RT USER Relevant 2 clinical text gt Recursive neural networks
Deep Learning Natural Language Processing NLProc httptco
twokens = tokenize(tweet)
1httpsbitbucketorgjaganadhgtwittertokenize
Jaganadh G Elements of Text Mining
Sentence Tokenization
Heuristic sentence boundary detection algorithm 2
Place putative sentence boundaries after all occurrences of ( and maybe - )
Move the boundary after following quotation marks if any
Disqualify a period boundary in the following circumstances
If it is preceded by a known abbreviation of a sort that does not normally occur wordfinally but is commonly followed by a capitalized proper name such as Prof or vsIf it is preceded by a known abbreviation and not followed by an uppercase word Thiswill deal correctly with most usages of abbreviations like etc or Jr which can occursentence medially or finally
Disqualify a boundary with a or if
It is followed by a lowercase letter (or a known name)
Regard other putative sentence boundaries as sentence boundaries
2Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Jaganadh G Elements of Text Mining
Sentence Tokenization
Sentence Tokenization with Python and NLTK
from nltkdata import load
tokenizer = load(rsquotokenizerspunktenglishpicklersquo)
text = How can this be implemented There are a lot of subtleties
such as dot being used in abbreviations
sents = tokenizertokenize(text)
for sent in sents
print sent
Jaganadh G Elements of Text Mining
Counting Words
Word Count - Python
def word_count(text)
words = tokenize(text)
word_freq = dict([(word wordscount(word)) for word
in set(words)])
return word_freq
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wc = word_count(text)
for wordcount in wcitems()
print word t t count
Jaganadh G Elements of Text Mining
Finding Word Length
Word Length
def word_length(text)
words = tokenize(text)
word_length =
[word_length__setitem__(len(word)1 +
word_lengthget(len(word)0)) for word in words]
return word_length
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wl = word_length(text)
for length count in wlitems()
print There are d words of length d (count length)
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Words Types and Ratio
Words and Types
Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Tokenization
Tokenization
Tokenization is the process of breaking a stream of text up into words phrases symbolsor other meaningful elements called tokens The list of tokens becomes input for furtherprocessing such as parsing or text mining
Tokenizig text with Python
import re
def tokenize(text)
tokenizer = recompile(rsquoWrsquo)
return tokenizersplit(textlower())
doc = John likes to watch movies Mary likes too
words = tokenize(doc)
print words
Jaganadh G Elements of Text Mining
Tokenization
Tokenization
Tokenization is the process of breaking a stream of text up into words phrases symbolsor other meaningful elements called tokens The list of tokens becomes input for furtherprocessing such as parsing or text mining
Tokenizig text with Python
import re
def tokenize(text)
tokenizer = recompile(rsquoWrsquo)
return tokenizersplit(textlower())
doc = John likes to watch movies Mary likes too
words = tokenize(doc)
print words
Jaganadh G Elements of Text Mining
Twokenization
Rise of social media introduced new orthographic patterns in digital text Typical example is a tweet wherepeople use abbreviated forms of words emoticons hash-tags etc Generic text tokenization techniques wont yieldgood result in separating words in social media text like tweets A good social media tokenizer has to take care ofemoticons hash-tags shortened urls etc
Social media tokenization with Python using happyfuntokenizing 1
from happyfuntokenizing import Tokenizer
def twokenize(tweetpc=True)
twokenizer = Tokenizer(preserve_case=pc)
return twokenizertokenize(tweet)
tweet = RT USER Relevant 2 clinical text gt Recursive neural networks
Deep Learning Natural Language Processing NLProc httptco
twokens = tokenize(tweet)
1httpsbitbucketorgjaganadhgtwittertokenize
Jaganadh G Elements of Text Mining
Sentence Tokenization
Heuristic sentence boundary detection algorithm 2
Place putative sentence boundaries after all occurrences of ( and maybe - )
Move the boundary after following quotation marks if any
Disqualify a period boundary in the following circumstances
If it is preceded by a known abbreviation of a sort that does not normally occur wordfinally but is commonly followed by a capitalized proper name such as Prof or vsIf it is preceded by a known abbreviation and not followed by an uppercase word Thiswill deal correctly with most usages of abbreviations like etc or Jr which can occursentence medially or finally
Disqualify a boundary with a or if
It is followed by a lowercase letter (or a known name)
Regard other putative sentence boundaries as sentence boundaries
2Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Jaganadh G Elements of Text Mining
Sentence Tokenization
Sentence Tokenization with Python and NLTK
from nltkdata import load
tokenizer = load(rsquotokenizerspunktenglishpicklersquo)
text = How can this be implemented There are a lot of subtleties
such as dot being used in abbreviations
sents = tokenizertokenize(text)
for sent in sents
print sent
Jaganadh G Elements of Text Mining
Counting Words
Word Count - Python
def word_count(text)
words = tokenize(text)
word_freq = dict([(word wordscount(word)) for word
in set(words)])
return word_freq
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wc = word_count(text)
for wordcount in wcitems()
print word t t count
Jaganadh G Elements of Text Mining
Finding Word Length
Word Length
def word_length(text)
words = tokenize(text)
word_length =
[word_length__setitem__(len(word)1 +
word_lengthget(len(word)0)) for word in words]
return word_length
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wl = word_length(text)
for length count in wlitems()
print There are d words of length d (count length)
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Words Types and Ratio
Words and Types
Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Tokenization
Tokenization
Tokenization is the process of breaking a stream of text up into words phrases symbolsor other meaningful elements called tokens The list of tokens becomes input for furtherprocessing such as parsing or text mining
Tokenizig text with Python
import re
def tokenize(text)
tokenizer = recompile(rsquoWrsquo)
return tokenizersplit(textlower())
doc = John likes to watch movies Mary likes too
words = tokenize(doc)
print words
Jaganadh G Elements of Text Mining
Twokenization
Rise of social media introduced new orthographic patterns in digital text Typical example is a tweet wherepeople use abbreviated forms of words emoticons hash-tags etc Generic text tokenization techniques wont yieldgood result in separating words in social media text like tweets A good social media tokenizer has to take care ofemoticons hash-tags shortened urls etc
Social media tokenization with Python using happyfuntokenizing 1
from happyfuntokenizing import Tokenizer
def twokenize(tweetpc=True)
twokenizer = Tokenizer(preserve_case=pc)
return twokenizertokenize(tweet)
tweet = RT USER Relevant 2 clinical text gt Recursive neural networks
Deep Learning Natural Language Processing NLProc httptco
twokens = tokenize(tweet)
1httpsbitbucketorgjaganadhgtwittertokenize
Jaganadh G Elements of Text Mining
Sentence Tokenization
Heuristic sentence boundary detection algorithm 2
Place putative sentence boundaries after all occurrences of ( and maybe - )
Move the boundary after following quotation marks if any
Disqualify a period boundary in the following circumstances
If it is preceded by a known abbreviation of a sort that does not normally occur wordfinally but is commonly followed by a capitalized proper name such as Prof or vsIf it is preceded by a known abbreviation and not followed by an uppercase word Thiswill deal correctly with most usages of abbreviations like etc or Jr which can occursentence medially or finally
Disqualify a boundary with a or if
It is followed by a lowercase letter (or a known name)
Regard other putative sentence boundaries as sentence boundaries
2Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Jaganadh G Elements of Text Mining
Sentence Tokenization
Sentence Tokenization with Python and NLTK
from nltkdata import load
tokenizer = load(rsquotokenizerspunktenglishpicklersquo)
text = How can this be implemented There are a lot of subtleties
such as dot being used in abbreviations
sents = tokenizertokenize(text)
for sent in sents
print sent
Jaganadh G Elements of Text Mining
Counting Words
Word Count - Python
def word_count(text)
words = tokenize(text)
word_freq = dict([(word wordscount(word)) for word
in set(words)])
return word_freq
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wc = word_count(text)
for wordcount in wcitems()
print word t t count
Jaganadh G Elements of Text Mining
Finding Word Length
Word Length
def word_length(text)
words = tokenize(text)
word_length =
[word_length__setitem__(len(word)1 +
word_lengthget(len(word)0)) for word in words]
return word_length
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wl = word_length(text)
for length count in wlitems()
print There are d words of length d (count length)
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Words Types and Ratio
Words and Types
Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Twokenization
Rise of social media introduced new orthographic patterns in digital text Typical example is a tweet wherepeople use abbreviated forms of words emoticons hash-tags etc Generic text tokenization techniques wont yieldgood result in separating words in social media text like tweets A good social media tokenizer has to take care ofemoticons hash-tags shortened urls etc
Social media tokenization with Python using happyfuntokenizing 1
from happyfuntokenizing import Tokenizer
def twokenize(tweetpc=True)
twokenizer = Tokenizer(preserve_case=pc)
return twokenizertokenize(tweet)
tweet = RT USER Relevant 2 clinical text gt Recursive neural networks
Deep Learning Natural Language Processing NLProc httptco
twokens = tokenize(tweet)
1httpsbitbucketorgjaganadhgtwittertokenize
Jaganadh G Elements of Text Mining
Sentence Tokenization
Heuristic sentence boundary detection algorithm 2
Place putative sentence boundaries after all occurrences of ( and maybe - )
Move the boundary after following quotation marks if any
Disqualify a period boundary in the following circumstances
If it is preceded by a known abbreviation of a sort that does not normally occur wordfinally but is commonly followed by a capitalized proper name such as Prof or vsIf it is preceded by a known abbreviation and not followed by an uppercase word Thiswill deal correctly with most usages of abbreviations like etc or Jr which can occursentence medially or finally
Disqualify a boundary with a or if
It is followed by a lowercase letter (or a known name)
Regard other putative sentence boundaries as sentence boundaries
2Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Jaganadh G Elements of Text Mining
Sentence Tokenization
Sentence Tokenization with Python and NLTK
from nltkdata import load
tokenizer = load(rsquotokenizerspunktenglishpicklersquo)
text = How can this be implemented There are a lot of subtleties
such as dot being used in abbreviations
sents = tokenizertokenize(text)
for sent in sents
print sent
Jaganadh G Elements of Text Mining
Counting Words
Word Count - Python
def word_count(text)
words = tokenize(text)
word_freq = dict([(word wordscount(word)) for word
in set(words)])
return word_freq
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wc = word_count(text)
for wordcount in wcitems()
print word t t count
Jaganadh G Elements of Text Mining
Finding Word Length
Word Length
def word_length(text)
words = tokenize(text)
word_length =
[word_length__setitem__(len(word)1 +
word_lengthget(len(word)0)) for word in words]
return word_length
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wl = word_length(text)
for length count in wlitems()
print There are d words of length d (count length)
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Words Types and Ratio
Words and Types
Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Sentence Tokenization
Heuristic sentence boundary detection algorithm 2
Place putative sentence boundaries after all occurrences of ( and maybe - )
Move the boundary after following quotation marks if any
Disqualify a period boundary in the following circumstances
If it is preceded by a known abbreviation of a sort that does not normally occur wordfinally but is commonly followed by a capitalized proper name such as Prof or vsIf it is preceded by a known abbreviation and not followed by an uppercase word Thiswill deal correctly with most usages of abbreviations like etc or Jr which can occursentence medially or finally
Disqualify a boundary with a or if
It is followed by a lowercase letter (or a known name)
Regard other putative sentence boundaries as sentence boundaries
2Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Jaganadh G Elements of Text Mining
Sentence Tokenization
Sentence Tokenization with Python and NLTK
from nltkdata import load
tokenizer = load(rsquotokenizerspunktenglishpicklersquo)
text = How can this be implemented There are a lot of subtleties
such as dot being used in abbreviations
sents = tokenizertokenize(text)
for sent in sents
print sent
Jaganadh G Elements of Text Mining
Counting Words
Word Count - Python
def word_count(text)
words = tokenize(text)
word_freq = dict([(word wordscount(word)) for word
in set(words)])
return word_freq
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wc = word_count(text)
for wordcount in wcitems()
print word t t count
Jaganadh G Elements of Text Mining
Finding Word Length
Word Length
def word_length(text)
words = tokenize(text)
word_length =
[word_length__setitem__(len(word)1 +
word_lengthget(len(word)0)) for word in words]
return word_length
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wl = word_length(text)
for length count in wlitems()
print There are d words of length d (count length)
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Words Types and Ratio
Words and Types
Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Sentence Tokenization
Sentence Tokenization with Python and NLTK
from nltkdata import load
tokenizer = load(rsquotokenizerspunktenglishpicklersquo)
text = How can this be implemented There are a lot of subtleties
such as dot being used in abbreviations
sents = tokenizertokenize(text)
for sent in sents
print sent
Jaganadh G Elements of Text Mining
Counting Words
Word Count - Python
def word_count(text)
words = tokenize(text)
word_freq = dict([(word wordscount(word)) for word
in set(words)])
return word_freq
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wc = word_count(text)
for wordcount in wcitems()
print word t t count
Jaganadh G Elements of Text Mining
Finding Word Length
Word Length
def word_length(text)
words = tokenize(text)
word_length =
[word_length__setitem__(len(word)1 +
word_lengthget(len(word)0)) for word in words]
return word_length
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wl = word_length(text)
for length count in wlitems()
print There are d words of length d (count length)
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Words Types and Ratio
Words and Types
Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Counting Words
Word Count - Python
def word_count(text)
words = tokenize(text)
word_freq = dict([(word wordscount(word)) for word
in set(words)])
return word_freq
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wc = word_count(text)
for wordcount in wcitems()
print word t t count
Jaganadh G Elements of Text Mining
Finding Word Length
Word Length
def word_length(text)
words = tokenize(text)
word_length =
[word_length__setitem__(len(word)1 +
word_lengthget(len(word)0)) for word in words]
return word_length
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wl = word_length(text)
for length count in wlitems()
print There are d words of length d (count length)
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Words Types and Ratio
Words and Types
Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Finding Word Length
Word Length
def word_length(text)
words = tokenize(text)
word_length =
[word_length__setitem__(len(word)1 +
word_lengthget(len(word)0)) for word in words]
return word_length
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wl = word_length(text)
for length count in wlitems()
print There are d words of length d (count length)
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Words Types and Ratio
Words and Types
Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Words Types and Ratio
Words and Types
Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Word Proportion
Word Proportion
Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words
so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C
Finding Word Proportion
from __future__ import division
def word_propo(text)
words = tokenize(text)
wc = word_count(text)
propo = dict([(word wc[word]len(words)) for word
in set(words)])
return propo
text = How can this be implemented There are a lot of
subtleties such as dot being used in abbreviations
wp = word_propo(text)
for word propo in wpitems()
print word tt propo
Jaganadh G Elements of Text Mining
Words Types and Ratio
Words and Types
Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Words Types and Ratio
Words and Types
Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Words Types and Ratio
Word Type Ratio
WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the
count of types in corpus C
Finding Word Type Ratio
def word_type_ratio(text)
words = tokenize(text)
ratio = len(words) len(set(words))
return ratio
text = I shot an elephant in my pajamas He saw the fine
fat trout in the brook
ratio = word_type_ratio(text)
print ratio
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Finding top N words
Python code to find top N words from a text
from operator import itemgetter
def top_words(textn=50)
wordfreq = word_count(text)
topwords = sorted(wordfreqiteritems() key = itemgetter(1)
reverse=True)[n]
return topwords
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
topwords = top_words(textn=50)
for word count in topwords
print s t d (wordcount)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 20 words from a text
import numpy as np
import matplotlibpyplot as plt
def plot_freq(text)
tfw = top_words(text n= 20)
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
pltplot(xyrsquoborsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
pltxlabel(Word Ranking)
pltylabel(Word Frequency)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Plotting top N words
Python code to plot top 50 words from a text This plot will show words in the plot
import numpy as np
import matplotlibpyplot as plt
def plot_freq_tag(text)
tfw = top_words(text n= 50)
words = [tfw[i][0] for i in range(len(tfw))]
x = range(len(tfw))
np = len(tfw)
y = []
for item in range(np)
y = y + [tfw[item][1]]
fig = pltfigure()
ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)
axset_title(rsquoTop 50 wordsrsquo)
axplot(x y rsquogo-rsquols=rsquodottedrsquo)
pltxticks(range(0 len(words) + 1 1))
pltyticks(range(0 max(y) + 1 10))
for i label in enumerate(words)
plttext (x[i] y[i] label rotation=45)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_freq(text)
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Top N Words
Plot of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Plotting histogram of top N words
Python code to plot histogram of top 50 words
import numpy as np
import matplotlibpyplot as plt
def plot_hist(text)
tw = top_words(text)
words = [tw[i][0] for i in range(len(tw))]
freq = [tw[j][1] for j in range(len(tw))]
pos = nparange(len(words))
width = 10
ax = pltaxes(frameon=True)
axset_xticks(pos)
axset_yticks(range(0max(freq)10))
axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)
pltbar(posfreqwidth color=rsquobrsquo)
pltshow()
text = open(rsquogpl-20txtrsquorsquorrsquo)read()
plot_hist(text)
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Histogram
Histogram of top 50 words from GPL v2 (without filtering stop words)
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3
def dispersion_plot(textwords)
wordst = tokenize(text)
points = [(xy) for x in range(len(wordst))
for y in range(len(words)) if wordst[x] == words[y]]
if points
xy = zip(points)
else
x = y = ()
pltplot(xygoscalex=2)
pltyticks(range(len(words))wordscolor=b)
pltylim(-1len(words))
plttitle(Lexical Dispersion Plot)
pltxlabel(Word Offset)
pltshow()
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])
3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Lexical Dispersion Plot
Lexical Dispersion Plot plot from GPL text
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Tag Cloud
A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text
from pytagcloud import create_tag_image make_tags
from pytagcloudlangcounter import get_tag_counts
def create_tag_cloud(text)
words = tokenize(text)
doc = join(d for d in words[100])
tags = make_tags(get_tag_counts(doc) maxsize=80)
create_tag_image(tags rsquogplpngrsquo size=(900 600)
fontname=rsquoPhilosopherrsquo)
gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()
create_tag_cloud(gpl)
4httpsgithubcomatizoPyTagCloud
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Tag Cloud
Tag cloud from GPL text
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Word co-occurrence
Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs
A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi
co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a
aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Word co-occourance
w1 w2 w3 wnw1 m11 m12 m13 m1n
w2 m21 m22 m23 m2n
w3 m31 m32 m33 m3n
wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Word co-occurrence
Finding co-occurrence matrix with Python
def cooccurrence_matrix_corpus(corpus)
matrix = defaultdict(lambda defaultdict(int))
for corpora in corpus
for i in xrange(len(corpora)-1)
for j in xrange(i+1 len(corpora))
word1 word2 = [corpora[i]corpora[j]]
matrix[word1][word2] += 1
matrix[word2][word1] += 1
return matrix
corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]
ccm = cooccurrence_matrix_corpus(corpus)
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Word co-occurrence
hong
toy zdenek
czech
movie
julie
like
chan
story
tango
one
sverak
martial
woody
dating
film
first
Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four
associated words For each associated words again associated words are plotted
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Stop Words
Stop Words
In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Stop Words Filter
def stop_filter(words)
stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo
rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo
rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo
rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo
rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo
rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo
rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo
rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo
rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo
rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo
rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo
rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]
stopless = [word in words if word not in stops]
return stopless
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Bag of Words
The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a
Analyzing text by only analyzing frequency of words is called as bag of words model
ahttpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Bag of Words
Example
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1
rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4
after removing stopwords
[0 0 1 2 1 1 1]
[1 1 1 1 0 0 1]
each entry of the vectors refers to count of
the corresponding entry in the dictionary
a
aExample taken from httpenwikipediaorgwikiBag of words model
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Bag of Words
Documents
d1 John likes to watch movies Mary likes too
d2 John also likes to watch football games
Vocabulary Index
V I(t) =
0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Bag of Words
football games john likes mary movies watch
doc1 0 0 1 2 1 1 1
doc2 1 1 1 1 0 0 1
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python and sklearn 5
from sklearnfeature_extractiontext import
CountVectorizer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
print vectorizervocabulary_
print bowtoarray()
5httpscikit-learnorgJaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Bag of Words
Creating Bag of Words with Python Just for sample -(
def bag_of_words(docs)
stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]
token_list = [tokenize(doc) for doc in docs]
vocab = list(set(token_list[0])union(token_list))
vocab =[v for v in vocab if v not in stops and len(v) gt 1]
vocab_idex = dict( [ ( word vocabindex(word) ) for word
in vocab] )
bow = [[tokenscount(word) for word in vocab_idexkeys()]
for tokens in token_list]
print vocab_idex
for bag in bow
print bag
d = (John likes to watch movies Mary likes too
John also likes to watch football games)
bag_of_words(d)
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
TF-IDF
Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus
TF-IDF
tf minus idf(t) = tf(t d)times idf(t)
where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo
tf(t d) =sumxisind
fr(x t)
where
fr(x t) =
1 if x = t0 otherwise
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
TF-IDF
TF-IDF
andidf(t) = log |D|
1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
TF
TF
tf(t d) =sumxisind
fr(x t)
fr(x t) is a simple function
fr(x t) =
1 if x = t0 otherwise
Exampletf(primejohnprime d1) = 1
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Document Vector
To create a document vector space
V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))
To represent rsquod1rsquo and rsquod2rsquo as vectors
V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))
V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
which evaluates toV ~d1 = (0 0 1 2 1 1 1)
V ~d2 = (1 1 1 1 0 0 1)
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Vector Space Matrix
The document vectors can be represented as matrix
M|D|xF
where |D| is the cardinality of the document space
M|D|xF =
[0 0 1 2 1 1 11 1 1 1 0 0 1
]
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Vector Normalization
Normalized Vector
A normalized vector is represented as v = ~v~vp where v is is the unit vector or the
normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a
ahttpenwikipediaorgwikiLp space
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Vector Normalization
Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =
radicv2
1 + v22 + v2
2 + + v2n
With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p
It can be simplified as
~v||p = (
nsumi=1
|~vi|p)1p
6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Vector Normalization
L2 Norm
The norm which we apply here is L2 Norm which is also called as Euclidean norm a
It is a common norm used to measure the length of a vector where p = 2
ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Vector Normalization
v~d1 = (0 0 1 2 1 1 1)vd1 = ~v
~vp
vd1 = v~d1
v~d12vd1 = (0012111)radic
02+02+12+22+12+12+12
vd1 = (0012111)radic8
vd1 = ( 0radic8 0radic
8 1radic
8 2radic
8 1radic
8 1radic
8 1radic
8)
vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
IDF
IDF
idf(t) = log |D|1+|dtisind|
where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Finding IDF
idf(ti) = log|D|
1 + |d ti isin d|= log
2
1= 069314718
idf(football) = log 21+1
= 00
idf(games) = log 21+1
= 00
idf(john) = log 21+2
= minus040546510810816444
idf(likes) = log 21+2
= minus040546510810816444
idf(mary) = log 21+1
= 00
idf(movies) = log 21+1
= 00
idf(watch) = log 21+1
= 00idf(V ) = (00-040546510810816444-040546510810816444000000)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
TF-IDF weight
Finding TF-IDF weight
M|D|timesF times Midf
[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))
]x
idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
TF-IDF weight
tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
TF-IDF Normalization
L2 Normalization
Mtfminusidf =MtfminusidfMtfminusidf2
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
TF-IDF
Practice with Python and sklearn 7
from sklearnfeature_extractiontext import
CountVectorizer TfidfTransformer
vectorizer = CountVectorizer(analyzer=rsquowordrsquo
min_n=1stop_words=rsquoenglishrsquo)
docs = (rsquoJohn likes to watch movies Mary
likes toorsquorsquoJohn also likes to watch football gamesrsquo)
bow = vectorizerfit_transform(docs)
freq_term_matrix =vectorizertransform(docs)
tfidf = TfidfTransformer(norm=l2)
tfd = tfidffit(freq_term_matrix)
print IDF tfidfidf_
tf_idf_matrix = tfidftransform(freq_term_matrix)
print tf_idf_matrixtodense()
for wf in zip(vectorizervocabulary_tfdidf_)
print rsquor =gt rrsquo (w tfdidf_[f])
7httpscikit-learnorgJaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
N-Grams
N-Gram
In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa
Unigrams are single words
Bigrams are sequences of two words
Trigrams are sequences of three words
ahttpenwikipediaorgwikiN-gram
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Bigrams
Bigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]
bigrams = [ join(bigram) for bigram in ibigrams]
print bigrams
[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo
rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Trigrams
Trigrams
P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)
Practice with Python
d1 = John likes to watch movies Mary likes too
words = d1lower()split()
itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]
trigrams = [ join(trigram) for trigram in itrigrams]
print trigrams
[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo
rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
N-Grams
Python code to generate N-Grams from list of words
def ngrams(wordsn=2)
grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]
return grams
words = John likes to watch movies Mary likes too
lower()split()
bigrams = ngrams(wordsn=2)
trigrams = ngrams(wordsn=3)
print bigram
print trigrams
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Mutual Information
Mutual Information
Statistical test to measure strength of word association
I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2
NC(wiwj)C(wi)C(wj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
I(strong tea) = log2P (strongtea)
P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Mutual Information
from __future__ import division
import math
def mutual_info(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
minfo =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
minfo[gram] = (mathlog( len(words) gramcount[ gram ]
wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))
mathlog( 2 )
return minfo
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
t-score
t-score
Statistical test to measure strength of word association
t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic
σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))
asymp C(wiwj)minus 1NC(wi)C(wj)radic
C(wiwj)
where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus
t(strong tea) =C(strongtea)minus 1
NC(strong)C(tea)radic
C(strongtea)
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
t-score
from __future__ import division
import math
def tscore(words)
grams = ngrams(wordsn=2) ngrams function from prev slide
wordcount =
gramcount =
tsc =
[ wordcount__setitem__(word 1 +
wordcountget( word0 )) for word in words ]
[ gramcount__setitem__(gram 1 +
gramcountget( gram0 )) for gram in grams ]
for gram in grams
tsc[gram] = (gramcount[gram] - (1len(words))
wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])
mathsqrt( gramcount[gram])
return tsc
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Document Classification
Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents
semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance
unsupervised document classification is entirely executed without reference toexternal information
8httpenwikipediaorgwikiDocument_classification
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Document Classification
Formal Definition
Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents
langd crang
wherelangd crangisin D times C using a learning
algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ
γ(ci dj) =
1 if dj belongs to ci0 otherwise
Main approaches in document classification are
Naıve Bayes (NB)
Support Vector Machines (SVM)
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Document Classification
A supervised document classification pipeline 9
9Image taken from httpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)
P (E)
The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Naıve Bayes Classification
Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)
P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)
P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)
P (E1E2En)
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Naıve Bayes Applied in Text Classification
Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like
Training Docs kill bomb kidnap music movie tv C
D0 2 1 3 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 2 0 1 0 Terrorism
D3 0 1 0 2 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 2 2 2 Entertainment
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
Naıve Bayes Model for the training set will be like [4pt]
|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095
E 05 12 005555566 011111111 011111111 033333333 027777778 011111111
|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class
number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)
15 = 415
P (kill|Entertainment) = (0+0+0)12 = 0
12
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Building Naıve Bayes Model
To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)
(15+|V |) = 521 = 02380
P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1
18 = 00555
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Testing the NB model
Our test document isTest Docs kill bomb kidnap music movie tv C
Dt 2 1 2 0 0 1 To find the posterior probability
P (ci|W ) = P (ci)timesVprodj=1
P (wj |ci)
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Testing the NB model
P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)
= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521
= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5
P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)
= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111
= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7
The document has classified as rdquoTerrorismrdquo because it got the highest value
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Preventing Underflow
The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289
After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely
The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Naıve Bayes Classifier
There are two different ways to setup a Naıve Bayes Classifier
Multi-variate Bernoulli Model
Multinomial Model
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta
aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Multi-variate Bernoulli Model
Multi-variate Bernoulli Model
If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like
Training Docs kill bomb kidnap music movie tv C
D0 1 1 1 0 0 1 Terrorism
D1 1 1 1 0 0 0 Terrorism
D2 1 1 1 0 1 0 Terrorism
D3 0 1 0 1 1 1 Entertainment
D4 0 0 1 1 1 0 Entertainment
D5 0 0 0 1 1 1 Entertainment
Here you can note that the individual word frequency has been replaced by presence or absenceof the word
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Multinomial Model
Multinomial Model
In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument
In the NB example which we worked out we applied the multinomial model In the model we used simple bag of
words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Support Vector Machine
Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that
〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1
〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Support Vector Machine
Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way
If there is a clear blank space available in the table in between the square and rectangles we can say that
it is linearly separable A line which drawn in the clear space between the figuresexactly equal length
from the margin of scatter region of square and rectangle is called as separating hyperplane Everything
on the one side of the separating hyper plane belongs to one category and everything in the other side
belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter
(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin
The figures closest to the separating hyperplane are known as support vectors10 If the data is not
linearly separable we have to use kernel tricks 11
10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography
11httpwwwstatsoftcomtextbooksupport-vector-machines
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Support Vector Machine
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnnaive_bayes import MultinomialNB
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoMultinomialNB())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a SVM Classifier with Python Sklearn
from sklearndatasets import load_files
from sklearnfeature_extractiontext import CountVectorizer
from sklearnfeature_extractiontext import TfidfTransformer
from sklearnpipeline import Pipeline
from sklearnsvm import LinearSVC
dir_data = usrsharenltk_datacorporamovie_reviews replace with your path
vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)
stop_words=rsquoenglishrsquolowercase=True)
transformer = TfidfTransformer(use_idf=True)
classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)
(rsquoclfrsquoLinearSVC())])
categories = [rsquoposrsquorsquonegrsquo]
training_data = load_files(dir_datacategories=categories
shuffle = True)
_ = classifierfit(training_datadata training_datatarget)
print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Practice Time
Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12
import nltkclassifyutil
from nltkclassify import NaiveBayesClassifier
from nltkcorpus import movie_reviews
def word_feats(words)
return dict([(word True) for word in words])
negids = movie_reviewsfileids(rsquonegrsquo)
posids = movie_reviewsfileids(rsquoposrsquo)
negfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquonegrsquo) for f in negids]
posfeats = [(word_feats(movie_reviewswords(fileids=[f]))
rsquoposrsquo) for f in posids]
negcutoff = len(negfeats)34
poscutoff = len(posfeats)34
trainfeats = negfeats[negcutoff] + posfeats[poscutoff]
classifier = NaiveBayesClassifiertrain(trainfeats)
sent = This is really cool I like it
words = word_feats(sentlower()split())
print classifierclassify(words)
12This code is adopted from httpstreamhackercom20100510
text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Confusion Matrix
Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm
Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like
ActualPositive Negative
PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Accuracy of a Classifier
Accuracy = TP+TNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Accuracy = 562+436562+77+225+436
= 076
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Precision and Recall
Precision which indicates how many of the items that we identified were relevant
Precision = TPTP+FP
Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo
Recall = TPTP+FN
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Positive Precision = 562562+77
= 087
Negative Precision = 436225+436
= 065
Positive Recall = 562562+225
= 071
Negative Recall = 43677+436
= 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Error Rate
Error rate is the percentage of things done wrong
ErrorRate = FP+FNTP+FP+FN+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
ErrorRate = 77+225562+77+225+436 = 023
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Fall-out
It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)
Fall minus out = FPFP+TN
ActualPositive Negative Total
PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300
Fall minus out = 7777+436 = 015
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score
In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a
F1 Score = 2 precisionrecallprecision+recall
ahttpenwikipediaorgwikiF1_score
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
F1 Score Positive = 2 087071087+071 = 078
F1 Score Positive = 2 065084065+084 = 073
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Positive predictive value
Positive predictive value or precision rate is the proportion of positive test results thatare true positives
Positive predictive value = TPTP+FP
Positive predictive value = 562562+77 = 087
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Negative predictive value
Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure
NPV = TNTN+FN
NPV = 436436+225 = 065
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Specificity or True Negative Rate
Specificity = TNFP+TN
Specificity = = 43677+436 = 084
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
False Discovery Rate
False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons
FDR = FPFP+TP
FDR = 7777+562 = 012
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Matthews Correlation Coefficient
The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications
MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)
MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)
= 055
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Receiver Operating Characteristic (ROC)
ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Evaluating Performance of a Classifier
Area Under the Curve
The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Named Entity Recognition
Named Entity Recognition
Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Named Entity Recognition
Entity Recognition with Python NLTK
from nltk import sent_tokenize ne_chunk pos_tag word_tokenize
def extract_entities(text)
entities = []
sents = sent_tokenize(text)
chunks = [ ne_chunk(pos_tag(word_tokenize(sent))
binary=True) for sent in sents]
for chunk in chunks
for tree in chunksubtrees()
if treenode == NE
entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())
entitiesappend(entity)
return entities
if __name__ == __main__
sent = Abraham Lincoln was born February 12 1809 the second child
of Thomas Lincoln and Nancy Lincoln
entities = extract_entities(sent)
print entities
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
Extracting Terms from Text
Extracting Terms with Python Topia Termextract 13
from topiatermextract import extract
extractor = extractTermExtractor()
text = Abraham Lincoln was born February 12 1809 the
second child of Thomas Lincoln and Nancy Lincoln
terms = extractor(text)
terms = [term[0] for term in terms]
print terms
13httppypipythonorgpypitopiatermextract
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
References
Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012
Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012
Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006
Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012
Roger BilisolyPractical Text Mining with Perl Wiley 2008
Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining
References
Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999
Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012
Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012
Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044
Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008
Jaganadh G Elements of Text Mining