Text mining and analytics v6 - p2

Tutorial: Text Data Mining and Analytics: Part 2HICSS 44 – January 2011

Dave King

Copyright 2011 JDA Software Group, Inc.

2

Text Mining: Payoff from Simple Approaches

Many of the applications of data mining to text “have proved remarkably successful without understanding specific properties of text such as the concepts of grammar or the meaning of words. Strictly low-level frequency information is used, such as the number of times a word appears in a document, and then well-known methods of machine learning are applied.”

Source: S. Weiss, et. al. Text Mining: Predictive Methods for Analyzing Unstructured Information, 2005


3

Text Mining:Here’s a fun job!

NewsArticles

??


4

Text Mining:Here’s a fun job!Google News is a computer-generated news site that aggregates headlines from news sources worldwide, groups similar stories together and displays them according to each reader's personalized interests…Google News has no human editors …


5

Text Mining:Text Categorization (Classification)

Probably the most frequently used TM technique. Often employed in applications where there is a flow of dynamic information (emails, news articles, blogs, scientific articles, patents, medical claims, legal data …), requiring automated handling and routing.

NewsArticles

?Category


6

Text Mining:Text Categorization (Classification)

Inductive, supervised machine learning process the classifies or categorizes a given document instance (of unknown classification) into one of a set of predetermined categories.

Documentsw/ unknown

classification

Classification Algorithm

Predetermined Categories

1 2 3 n

Docs w/ known classification – training corpa

TestTrain Validate

FeatureExtraction/Learning

Feature Extraction


7

Text Mining:Classification Algorithm

• Naïve Bayes• Decision Trees• Nearest Neighbor (k-NN)• Support Vector Machine• Neural Nets (e.g. SOM)


8

Text Categorization:An Example

Who is Gary Thuerk?


9


“We invite you to come see the 2020 and hear about the DECSystem-20 family.’’

Gary Thuerk, DEC Marketing, 1978

Source: http://www.newyorker.com/reporting/2007/08/06/070806fa_fact_specter#ixzz16zE3E2zO

DECSYSTEM-2020: a bit-slice processor with up to 512 kilowords of solid state RAM

http://www.newyorker.com/reporting/2007/08/06/070806fa_fact_specter#ixzz16zE3E2zO


10


Answer: He’s the father of Spam – not the Hormel type but the Email type


11

Spam Detection:Size of the Problem• 90 trillion – The number of emails sent on the Internet in 2009.• 247 billion – Average number of email messages per day.• 1.4 billion – The number of email users worldwide.• 100 million – New email users since the year before.• 81% – The percentage of emails that were spam.• 92% – Peak spam levels late in the year.• 24% – Increase in spam since last year.• 200 billion – The number of spam emails per day (assuming 81%

are spam).


12

Spam Detection:Size of the Problem

Source: blog.epostmarks.com/team-blog/2009/3/21/the-true-corporate-and-consumer-costs-of-spam.html

Aspect (Billions) Business Consumers Total %

Productivity 25.5 66.8 92.3 85%IT Costs 5.8 5.8 5%

Help Desks 10.8 10.8 10%Total 42.1 66.8 108.9 100%

Estimated Annual Costs of Spam in the US (in $billions)


13

Spam Detection:Size of the Problem (Yale Univ.)

FY 2010 (Millions) Month

SpamMnthly Total

SpamDaily Avg

Delivered Mnthly Total

Delivered Daily Avg % Spam

1st quarter Jul-09 61.70 2.01 6.20 0.20 90%Aug-09 61.80 2.00 6.30 0.20 90%Sep-09 62.30 2.08 11.80 0.39 81%

2nd quarter Oct-09 67.10 2.16 11.40 0.37 83%Nov-09 62.90 2.09 11.00 0.37 83%Dec-09 44.50 1.40 8.80 0.28 80%

3rd quarter Jan-10 47.90 1.50 10.30 0.33 78%Feb-10 41.40 1.40 11.50 0.37 72%Mar-10 43.40 1.40 10.20 0.33 76%

4th quarter Apr-10 43.30 1.40 12.50 0.42 71%May-10 37.80 1.20 8.40 0.27 78%Jun-10 36.30 1.20 6.50 0.22 82%

http://www.yale.edu/its/metrics/email/index.html

Mea

sure

d in

mill

ions


14

Spam Detection:General Approaches

SPAM Detection./Filter

1

2

#

http://upload.wikimedia.org/wikipedia/commons/6/69/SMTP-transfer-model.svg


15

Spam Detection:General Approaches

• Rules– Is this email from [email protected]?

• Blacklists & Whitelists– Check the subject and body of the message for

particular words or phrases• Problem: Need new rules to handle dynamic data

– Ways to alter the data (add spaces at random, non-alpha characters, misspellings, composite words, …)

mailto:[email protected]


16

Spam Detection:Problem with Rules


17

Beginning Example:Yale University Spam Management

Blocks messages from known spammers using a service called SpamHaus, a real-time database of IP addresses of verified spam sources.

Content-based, central spam detection using SpamAssassin.Messages scored as spam are moved away from a user’s inbox to the Tagged-Spam folder on the server.

Rules used for tagging spam are conservative. For that reason some spam gets through the first two levels of filtering. End users should train email clients to recognize and manage spam. Mail clients like Eudora or Outlook have built-in spam filters that you can train to filter messages you consider spam.

http://wiki.apache.org/spamassassin/FrontPage


18

Spam Detection: Yale University Spam Management

Microsoft Outlook utilizes its SmartScreen Technology which is based on a machine-learning Bayesian technology that employs a probability-based algorithm, to determine whether email is legitimate or spam.

A set of Perl programs that uses the combined score from multiple types of checks to determine if a given message is spam including Bayesian filtering.

http://wiki.apache.org/spamassassin/FrontPage


19

Spam Detection:Genesis of Content-Based Control

“I think it’s possible to stop spam, and that content-based filters are the way to do it. The Achilles’ heel of the spammers is their message. They can circumvent any other barrier you set up. But they have to deliver their message, whatever it is. There is no way they can get around that…

I think we will be able to solve the problem with fairly simple algorithms. In fact, I've found that you can filter present-day spam acceptably well using nothing more than a Bayesian combination of the spam probabilities of individual words. Using a slightly tweaked (as described below) Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false positives.

Paul Graham, A Plan for Spam, 2002


20

Spam Detection:The Goal

ConfusionMatrix

Predicted ClassActual Class Spam Not SpamSpam TP FNNot Spam FP TN

Goal: Minimize false positives

FPR = FP/(FP + TN)

Precision = TP / (TP + FP)Recall = TP / (TP + FN)Accuracy = (TP + TN)/NError = (FP + FN)/NF1 = 2*Recall*Precision/(Recall + Precision)

Where N = TP+FP+FN+TN


21

Spam Detection:Naïve Bayesian Classifier

P(H/D) = P(D/H) * P(H)/P(D)H is the hypothesis and D is the data

P(H) is the prior probability of H: the probability that H is correct before the data D are seen. P(D/H) is the conditional probability of seeing the data D given that the hypothesis H is true. This conditional probability is called the likelihood.

P(D) is the marginal probability of D.

P(H/D) is the posterior probability: the probability that the hypothesis is true, given the data and the previous state of belief about the hypothesis.

Thomas Bayes


22


? P(Spam | Message) compared to P(Not Spam | Message)

Message Category

Nobody owns the water. Not Spam

The quick rabbit jumps fences. Not Spam

Buy pharmaceuticals now. Spam

Make quick money at the online casino. Spam

The quick brown fox jumps. Not Spam

P(Spam | Word) = P(S) * P(W1/S) / P(M)P(Spam | quick) = P(Spam) * P(quick/Spam) P(Spam | quick) = ..4 * .5 = .2

P(Not Spam | Word) = P(NS) * P(W1/NS) / P(M)P(Not Spam | quick) = P(Not Spam) * P(quick/Not Spam) P(Not Spam | quick) = .6 * .67 ~ .4

Training Set


23


? P(Spam | Message) compared to P(Not Spam | Message)

Message Category

Nobody owns the water. Not Spam

The quick rabbit jumps fences. Not Spam

Buy pharmaceuticals now. Spam

Make quick money at the online casino. Spam

The quick brown fox jumps. Not Spam

P(Spam | Words) = P(S) * P(W1/S) * P(W2/S) * ...P(Spam | quick & money ) = P(Spam) * P(quick/Spam) * P(money/Spam)P(Spam | quick & money ) = ..4 * .5 * .5 = .1

P(Not Spam | Words) = P(NS) * P(W1/NS) * P(W2/NS) * ...P(Not Spam | quick & money) = P(Not Spam) * P(quick/Not *Spam) * P(money/Not Spam)P(Not Spam | quick & money) = .6 * .67 * 0 = 0

Training Set


24

Sentiment Analysis:The Issues and Payoffs

Every hour of every day they share their opinions, issues, thoughts and sentiments about products, brands, services and companies.


25

Sentiment Analysis:Some Survey Data• Activity

– 81% of Internet users (or 60% of Americans) have done online research on a product at least once

– 20% (15% of all Americans) do so on a typical day– 32% have provided a rating on a product, service, or person via an online ratings

system, and 30% (including 18% of online senior citizens) have posted an online comment or review regarding a product or service.2

• Impact– Among readers of online reviews of restaurants, hotels, and various services (e.g.,

travel agencies or doctors), between 73% and 87% report that reviews had a significant influence on their purchase

– Consumers report being willing to pay from 20% to 99% more for a 5-star-rated item than a 4-star-rated item (the variance stems from what type of item or service is considered)

Pew Internet & American Life Project Report, 2008.


26

Sentiment Analysis:The Issues and Payoff• This evaluative text data is extremely valuable to customer-facing

organizations– Marketing -- Inform targeted marketing and help determine which marketing

messages resonate with customers– Service -- Provide more rapid response to perceived customer issues and

determine the steps to take to satisfy customers– Products -- Quickly determine whether there are emerging product issues, how

to position products and where development dollars should be focused.• It is also very voluminous – beyond addressing with armies of staff

manually sifting through the data


27

Sentiment Analysis:What is it?

• Also called opinion mining or voice of the customer (VOC)• Involves using text mining to classifying subjective opinions

in text into categories like "positive" or "negative” extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion.

• Text analytics techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.


28

Sentiment Analysis: How do you know if the review is “-” or “+”

plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what's the deal ? watch the movie and " sorta " find out . . . critique : a mind-xxx movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it's simply too jumbled .

having not seen , " who framed roger rabbit " in over 10 years , and not remembering much besides that i liked it then , i decided to rent it recently . watching it i was struck by just how brilliant a film it is . aside from the fact that it's a milestone in animation in movies ( it's the first film to combine real actors and cartoon characters , have them interact , and make it convincingly real ) and a great entertainment it's also quite an effective comedy/mystery . while the plot may be somewhat familiar the characters are original , especially baby herman , and watching them together is a lot of fun . …`who framed roger rabbit' is a rare film . one that not only presented a great challenge to the filmmakers but one that can be enjoyed by the whole family ( although some very young viewers may be a little scared by judge doom ) . do yourself a favor and rent it , `p-p-p-p-please . "


29

Sentiment Analysis:Underlying Assumption• There are opinion words (aka polar words, opinion-bearing words,

and sentiment words) used to express state. – Positive opinion words are used to express desired states (e.g. beautiful,

wonderful, good, and amazing)– Negative opinion words are used to express undesired states (bad, poor, and

terrible)• There are also opinion phrases and idioms ( e.g. cost someone an

arm and a leg)• Collectively, they are called the Opinion Lexicon.


30

Sentiment Analysis:Types

• Sentiment Classification – document level, classified as positive or negative

• Feature-based opinion – sentence level, determines which aspects of an object people like or dislike

• Comparative sentence and relationship mining – sentence level comparisons of one object against another (to determine which is better than the other)


31

Sentiment Analysis:Which type is best?

• From one type to the next (classification, features, comparisons), it becomes more complex to extract the information needed to perform the analysis.

• However, once extracted, standard text mining techniques can be used to classify and compare the opinions expressed in the documents, statements, sentences, and phrases.

• Simple techniques (like naïve Bayesian) often produce excellent results (e.g. 80+% accuracy)


32

Text Mining and Analytics:Applications• JetBlue Airways

– Uses Attensity to analyze the large volume of e-mail messages it receives from customers. – By matching specific comments and comment patterns with structured data, airline personnel can

solve problems rapidly, before they jeopardize the carrier's satisfaction rating. • Rosetta Stone

– Uses IBM SPSS text analytics software to analyze answers to open-ended questions in surveys of current and potential customers.

– Combines text analysis along with other identification information (e.g. products purchased, demographics) to drive decisions on advertising, marketing and product development as well as strategic planning.

• Gaylord Hotels – Uses Clarabridge software to make sense of thousands of customer satisfaction surveys gathered

each day– Spots positive and negative comments that helps track trends in customer satisfaction and spot

problems -- as well as best practices -- tied to particular properties, departments or employees.


33

Text Mining:Clustering (Setting the Stage)

A common problem: Establishing categories or topic structures for• Free-form survey data• Customer complaints/comments, incident reports and

warranty claims• Blogs and discussion forums• Search results

Common answer: Clustering


34

Text Mining:Clustering (Defined)

The unsupervised, automated grouping of records, observations, or cases into classes of similar objects called clusters.

Document Collection

ClusteringAlgorithm

Clusters

1 2 3 n

FreqW1

Freq W2

Similarities stronger within clusters than between (i.e. distances shorter)

C2C3

C1


35

Text Mining:Clustering (Measuring Distance)

In a term-doc matrix treat the docs as vectors and the topics as variables and measure the distance/similarity between them.

D1 D2 D3T1 2 2 1T2 1 2 3

1 2 30

1

2

3

D1D2

D3

T1

T2

Euclidean Distance:SQRT(Sum(Xi-Yi)^2))

Term DS12 DS13 DS23T1 0 1 1T2 1 4 1Sum 1 5 2SQRT 1.00 2.24 1.41


36

Text Mining:Clustering (Measuring Distance)

• Squared Euclidean: Sum of squared differences• City Block or Manhattan: Sum of absolute differences• Minkowski: hth root of the sum of absolute differences

raised to the hth power• Matching Distance: For binary – number of (mis)matches

divided by number of comparisons (like Jaccard Similarity)• Correlation: 1 – 2r where r is corr. coeff.• Cosine: angle between the vectors


37

Text Mining:Clustering Methods

• Hierarchical: Produces a Tree-Like Structure of Clusters (Divisive and Agglomerative)

• Partitioning: Organizes objects into k partitions (k<=n) where each partition is a cluster


38

Text Mining:Clustering Methods

Divisive Agglomerative

Hierarchical Partitioning

1 2 3 K…Clusters

Start Start


39

Text Mining:Clustering (Simple Example)

T1 - The Neatest Guide to Stock Market Investing T2 - Investing For Dummies, 4th Edition T3 - The Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns T4 – The Book of Value Investing T5 - Value Investing: From Graham to Buffett and Beyond T6 - Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not! T7 - Investing in Real Estate, 5th Edition T8 - Stock Investing For Dummies" T9 - Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss

T1 T2 T3 T4 T5 T6 T7 T8 T9book 1 1dads 1 1dummies 1 1

estate 1 1guide 1 1investing 1 1 1 1 1 1 1 1 1market 1 1

real 1 1rich 2 1stock 1 1 1value 1 1

Focused on (exact) indexed words – appears in at least 2 titles and is not a stop word


40

Text Mining:Clustering Method - Hierarchical

A. Calculate distances between docs

B. Select 2 closest docs and put them into a cluster

C. Now determine closest doc among the remaining individual docs and existing clusters [utilizing either single (nearest), complete (farthest) or average linkage]

D. Repeat process until a single cluster is formed

T1 T2 T3 T4 T5 T6 T7 T8 T9T1 0.00T2 2.45 0.00T3 3.31 3.60 0.00T4 2.45 3.32 3.31 0.00T5 3.00 2.65 4.00 2.24 0.00T6 3.61 3.61 4.69 3.60 4.00 0.00T7 2.45 2.00 3.61 2.00 2.65 3.60 0.00T8 2.00 1.41 3.31 2.00 2.64 3.61 2.00 0.00T9 3.87 3.61 4.69 3.61 4.00 4.24 3.00 3.61 0.00

Level Plot


Text Mining:Clustering Method - Hierarchical

41


42

Text Mining:Clustering Method – K-Means• Determine the number of clusters

“k”<=n• Randomly assign k docs to be the initial

cluster center locations (centroids)• Repeat until termination

– For each doc calculate the (Euclidean) distance from the center locations and assign them to the cluster with the nearest center.

– For every cluster, recompute the centroid based on current members

– Check for termination – minimal or no changes in doc assigments

• Return the list of clusters


43

Text Mining:Clustering (K-Means Example)

Cluster 1: T1, T3T1 - The Neatest Guide to Stock Market Investing T3 - The Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns

Cluster 2: T6, T7, T9 T6 - Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not! T7 - Investing in Real Estate, 5th Edition T9 - Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss

Cluster 3: T2, T4, T5, T8T2 - Investing For Dummies, 4th EditionT4 – The Book of Value Investing T5 - Value Investing: From Graham to Buffett and Beyond T8 - Stock Investing For Dummies"


44

Text Mining:Clustering (RSS Feeds Example)


45



46



47


• http://feeds.reuters.com/reuters/entertainment• http://feeds.reuters.com/reuters/technologyNews• http://feeds.foxnews.com/foxnews/scitech• http://feeds.foxnews.com/foxnews/entertainment• http://rss.cnn.com/rss/cnn_showbiz.rss• http://rss.cnn.com/rss/cnn_tech.rss


48

Text Mining:Clustering Example (RSS Newsfeeds)

RSS Feed-Stem Matrix

Stories show govern list video new box mobil univers call compani three … rule timeJames Franco surprised to be Oscar co-host 0 0 1 0 0 0 0 0 0 0 0 0 0Oscar contenders jockey for position 0 0 0 0 0 0 0 0 0 0 0 0 0Tron leads weak pack of newcomers at box office 0 0 0 0 1 2 0 0 0 0 1 0 0Rockers hit right note with MTV for 2011 0 0 0 0 0 0 0 0 0 0 0 0 0Underground actors seek last act for Belarus leader 0 0 0 0 0 0 0 0 0 0 0 0 0Iran sentences director Jafar Panahi to 6 years in jail: report0 0 0 0 0 0 0 0 0 0 0 0 0Inception, Johnny Depp hits with IMDB.com users 0 0 0 0 0 0 0 0 0 0 0 0 0Silent Night most recorded UK Christmas song 0 0 1 0 0 0 0 0 0 1 0 0 1Michael Bolton sings for the saints in Assisi 0 0 0 0 0 0 0 0 0 0 0 0 0Pete Doherty takes dance lessons for French film 0 0 0 0 1 0 0 0 0 0 0 0 0Google to delay launch of TV sets: report 0 0 0 0 0 0 0 0 0 2 0 0 0EBay to buy German online club in European push 0 0 0 0 0 0 0 0 0 1 0 0 0ATT to buy Qualcomm's spectrum licenses for $1.93 billion0 0 0 0 0 0 3 0 0 0 0 0 0Wealth managers use technology to fight rising costs 0 0 0 0 0 0 0 0 0 1 0 0 0…Larry King ends his record-setting run 1 0 0 0 0 0 0 0 0 0 0 0 12010 obsessed with TV gore 0 0 0 0 1 0 0 0 0 0 0 0 1Report: Google TV devices delayed 0 0 0 0 1 0 0 0 0 0 0 0 25 Web titans that withered under Yahoo 0 0 0 0 0 0 0 0 0 0 0 0 0'Tron' shines light on arcade classic 0 0 0 0 0 0 0 0 0 0 0 0 0Tech-toy gifts for kids (young and old) 0 0 0 0 0 0 0 0 0 0 0 0 0Tech-toy gifts for kids (young and old) 0 0 0 0 0 0 0 0 0 0 0 0 0


49

Text Mining:RSS Newsfeeds Dendrogram


50

Text Mining:RSS Newsfeeds K-Means Clusters

Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6 Cluster7 Cluster8 Cluster9 Cluster101 Ent2 Ent27 Ent30 Ent11 Ent16 Ent19 Ent14 Ent17 Ent12 Ent12 Ent25 Ent28 Tech1 Ent23 Ent21 Ent35 Ent15 Ent33 Ent13 Ent103 Ent29 Ent34 Tech15 Ent3 Tech20 Tech25 Ent22 Ent41 Tech17 Ent184 Ent37 Ent36 Tech2 Ent32 Tech22 Tech31 Ent26 Tech10 Ent205 Ent39 Ent6 Tech21 Ent38 Tech30 Ent5 Tech11 Ent246 Tech5 Tech16 Tech3 Ent42 Tech9 Tech12 Tech14 Ent317 Tech6 Tech26 Tech7 Tech18 Tech19 Ent48 Tech32 Tech23 Tech28 Ent79 Tech4 Tech24 Ent9

10 Tech27 Tech1311 Tech2912 Tech8


51

Text Mining:Clustering Process

Many people imagine that it will produce neatly separated clusters like those that (appear in relatively simple examples), but it almost never does. Such ideal clusters are rarely encountered in real data, so we often need to modify our objective from “find the natural clusters in the data” to “organize the cases into groups that are similar in some way.”

Cook and Swayne, Interactive and Dynamic Graphics for Data Analysis


52

Text Mining:Real World Clustering Example• “Text Mining Warranty and Call Center Data: Early Warning for

Product Quality Awareness” (Wallace & Cermack, SUGI29, 2004)• Goal: Develop a system that would enable an early warning, alerting

system for product quality problems (for American Honda Motors)• Problem – most of the information is in text documents

– Warranty: when dealers complete warranty service claims, a comment field is available to further describe the problem.

– Customer Relations: the call center logs parts of conversations and written communications with customers.

– Techline: calls from dealer service technicians to specialized mechanics create more text data.


53

Text Mining:Real World Clustering Example


54

Text Mining:Real World Clustering Example

• Changes in cluster size• Appearance of new

words• Changes in Shape Alerts


55

Text Mining:Real World Clustering Example• Integrated warranty business

rules.• Emerging issues.• Drill-to from emerging issues.• Drill on multiple points.• Analyze by alert.• Ad hoc analysis.• Advanced warranty analysis.

SAS Warranty Analysis 4.2


56

Text Mining:Another Clustering Example


57


http://search.carrot2.org/stable/search?query=salsa


58



59

Text Mining:Information Extraction (Goals)• Type of IR

– Goal is to automatically extract structured information (e.g. entities, concepts and topics) from unstructured text from contextually and semantically well-defined data usually from well-defined domain (sometimes called content analysis)

• Named-Entity Recognition – Subtask of IE, also known as entity identification and entity extraction– Seeks to locate and classify atomic elements in text into predefined categories

(e.g. names of persons, organizations, locations, dates, quantities, monetary values, percentages and so on)

– The end goal is usually to fill in templates codifying the extracted information (e.g. entity relationship structures <entity><rel><entity>)


60

Information Extraction:Common Uses

• Competitive Intelligence• Counter-Terrorism & Criminal Intelligence• Resume Harvesting• Patent Search• Scientific Literature Search (biology & medicine)• Email Scanning


Information Extraction:Named Entity Recognition

61


Text Mining:Information Extraction (Process)

LinguisticProcessing

InformationExtraction

1

2

62


63

Information Extraction:Process (Part-of-Speech Tagging)• Part-of-speech tagging is the

process of converting a sentence, in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag).

• The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb, and so on.

• Variety of tagging strategies, most of which are “trainable.”

Number Tag Count Description1 CC 2265 Coordinating conjunction2 CD 3546 Cardinal number3 DT 8165 Determiner4 EX 88 Existential there5 FW 4 Foreign word6 IN 9857 Preposition or subordinating conjunction7 JJ 5834 Adjective8 JJR 381 Adjective, comparative9 JJS 182 Adjective, superlative

10 LS 13 List item marker11 MD 927 Modal12 NN 13166 Noun, singular or mass13 NNS 244 Noun, plural14 NNP 9410 Proper noun, singular15 NNPS 6047 Proper noun, plural16 PDT 27 Predeterminer17 POS 824 Possessive ending18 PRP 1716 Personal pronoun19 PRP$ 766 Possessive pronoun20 RB 2822 Adverb21 RBR 136 Adverb, comparative22 RBS 35 Adverb, superlative23 RP 216 Particle24 SYM 1 Symbol25 TO 2179 to26 UH 3 Interjection27 VB 2554 Verb, base form28 VBD 3043 Verb, past tense29 VBG 1460 Verb, gerund or present participle30 VBN 2134 Verb, past participle31 VBP 1321 Verb, non-3rd person singular present32 VBZ 2125 Verb, 3rd person singular present33 WDT 445 Wh-determiner34 WP 241 Wh-pronoun35 WP$ 14 Possessive wh-pronoun36 WRB 178 Wh-adverb


64

Information Extraction:Process (Part-of-Speech Tagging)

• The pilot had to bank the plane because it was headed right for the downtown branch bank which was located next to the river bank.

• Taggers (examples)– Training for N-Gram Taggers (sequences of N words): Trigram,

Bigram, Unigram– Employs training and test sets like other classification systems– Utilizes various classification algorithms for training then actual

classification


65

Information Extraction:Process (Part-of-Speech Tagging)

• Sample sentence:– CVS Caremark Corporation agreed to buy the Medicare

Part D unit of Universal American Financial Corporation for about $1.25 billion.

• Tagged sentence:– [('CVS', 'NNP'), ('Caremark', 'NNP'), ('Corporation', 'NNP'),

('agreed', 'VBD'), ('to', 'TO'), ('buy', 'VB'), ('the', 'DT'), ('Medicare', 'NNP'), ('Part', 'NNP'), ('D', 'NNP'), ('unit', 'NN'), ('of', 'IN'), ('Universal', 'NNP'), ('American', 'NNP'), ('Financial', 'NNP'), ('Corporation', 'NNP'), ('for', 'IN'), ('about', 'IN'), ('$', '$'), ('1.25', 'CD'), ('billion', 'CD')]


66

Information Extraction:Process (Entity Recognition)

• Chunking– Basic technique which segments and labels multi-token

sequences– Sequences are non-overlapping – Usually employs a combination of a “templated”

grammar couched as regular expressions along with tagger & classification processes to do the segmenting

• Simple Example – NP Chunker– grammar = "NP:{<DT>?<JJ.*>*<NN.*>+}"


67

Information Extraction:Process (Entity Recognition)(S (NP CVS/NNP Caremark/NNP Corporation/NNP) agreed/VBD to/TO buy/VB (NP the/DT Medicare/NNP Part/NNP D/NNP unit/NN) of/IN (NP Universal/NNP American/NNP Financial/NNP Corporation/NNP) for/IN about/IN $/$ 1.25/CD billion/CD)


68


• Named Entity Recognition – Identify all textual mentions of the named entities

• Hard to rely on precompiled lists of names, locations, … especially in dynamically changing domains

• A starting point is provided by the “named” entity chunkers found in toolkits like NLTK


69


• Example of Entity Recognition– Tree('S', [Tree('ORGANIZATION', [('CVS', 'NNP')]),

Tree('PERSON', [('Caremark', 'NNP'), ('Corporation', 'NNP')]), ('agreed', 'VBD'), ('to', 'TO'), ('buy', 'VB'), ('the', 'DT'), Tree('ORGANIZATION', [('Medicare', 'NNP'), ('Part', 'NNP')]), ('D', 'NNP'), ('unit', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Universal', 'NNP'), ('American', 'NNP')]), ('Financial', 'NNP'), ('Corporation', 'NNP'), ('for', 'IN'), ('about', 'IN'), ('$', '$'), ('1.25', 'CD'), ('billion', 'CD')])


70

Information Extraction:Sample System (Xanalys)


71

Text Mining & Analysis:Tools

General Purpose Info Extract ClusteringBus. Objects InXight Attensity MegaputerConnexor Tech. Basis Tech.GATE Eaagle FTM CRMIBM Infosphere Warehouse Expert System ClarabridgeIBM SPSS PASW Text Analytics IBM Content Analytics EnkataNLTK (Python) Intellexer SysomosPerl Lexalytics OvertoneSAS Text Miner LeximancerWordStat Linguamatics SearchClassification & Clustering Megaputer ISYS SearchAlceste NsteinAiaioo Labs QuenzaIxReveal TemisLextek VisualTextRecommind XANALYSTexifter

Orange (Python Lib)Packages cran.R-project.orgRattleWEKA

Italics - Open Source or Free Text ToolsItalics - Open Source or Free Text Tools digitalresearchtools.pbworks.com/

kdnuggets.com/software/text.html


72

Text Mining and Analysis:Lessons Learned• There are practical applications in business, scientific and

government arenas with substantial payback• Text can be analyzed with many of the same analytical (data mining)

techniques applied to structured data, although the text must first be transformed into structured data for this to occur.

• Many practical applications of text analysis and mining rest on treating documents as “bag of words” and on utilizing simpler versus more complex mining techniques. This techniques often have the same payoffs as more complex techniques

Education

Text mining and analytics v6 - p2