24
PRACTICAL TEXT MINING WITH SQL USING RELATIONAL DATABASES Ralph Winters Data Architect, Actuarial Business Intelligence EmblemHealth June 5th, 2013 11th Annual Text and Social Analytics Summit Cambridge, MA

Practical Text Mining with SQL using Relational Databases

Embed Size (px)

DESCRIPTION

Presentation at the 11th Annual Text and Social Analytics Summit - Cambridge, MA. Integrate unstructured data within a relational database: Learn the feasibility, prototyping, value added, and the goals of Text Analytics. Understand how much data you have and the architecture necessary to leverage existing technology that goes along with your existing relational structure (Oracle, SAS, SQL Server, DB/2, Postgre, MySQL and others). Learn how to utilize sentiment analysis to determine propensity to churn. A best in practice discussion of statistic techniques, clustering, and association.

Citation preview

Page 1: Practical Text Mining with SQL using Relational Databases

PRACTICAL TEXT

MINING WITH SQL

USING RELATIONAL

DATABASES

Ralph Winters

Data Architect,

Actuarial Business Intelligence

EmblemHealth

June 5th, 2013

11th Annual Text and Social Analytics SummitCambridge, MA

Page 2: Practical Text Mining with SQL using Relational Databases

RDMS TODAY

Gartner - clients tell us that combining scored, processed

‘outside data’ with data inside our relational databases is where all

the added value is.

IDC -RDMD database management systems are

expected to nearly double in market growth by 2016 driven by

intelligence demands and expabusiness nded adoption to

tackle big data and unstructured information streams

The relational database management systems (RDBMS)

market continues to confound the skeptics by maintaining strong

growth characteristics despite the belief by some that the market has become 'saturated‘ or that it will be weakened by newer Big Data

technologies

Inmon:

listen carefully to the “big data” vendors and this is what you hear: “Let’s get rid of relational.” It is like courtiers in the castle whispering, “The king must die.” What’s going

on here?.

Page 3: Practical Text Mining with SQL using Relational Databases

Why a relational DB?

Why a relational

Database?

Marry Structured + Unstructured

Data More suitable for statistical

analytics (matrices)

Leverage existing familiar

widespread technology

Improving of predictive

Models

Referential Integrity

Integrated Text/Data

Mining

Page 4: Practical Text Mining with SQL using Relational Databases

Feasibility

What do I need to know?

Costs

Benefits/Risks Industries

Adding Value?

Page 5: Practical Text Mining with SQL using Relational Databases

RDMS

File Interfaces (XML,CSV)

ODBC/JDBC/DBI

Text Vendor supplied Connector

Hadoop Connectors

(SAS, Oracle)

Open Source Text Mining Tools

(R, Java, Perl, LingPipe)

In-Database Text Mining Algorithms

(Oracle*Text,SAS Text Miner,SQL

Server Text Miniing)

RDMS Internal/External Connections

Page 6: Practical Text Mining with SQL using Relational Databases

ANGRY Customer Comments

Short Tailed

Sampling

Not for Long Tailed Data

Comment - KardCo Premier Credit Card Promo Scam . I recently received an KardCo promo promising 25,000 bonus points if you sign up for the KardCo Premier Card and spend $2000 in the 1st three months. and so i call in and apply ...got APPROVED...two weeks later .. Posting on your site DEFINITELY HELPED (it was pointed out by retailer), and sped up response after 6 weeks of mulling around BEFORE we posted our complaint. $100 restaurant certificates 15 days ago I opened a cc w/ KardCo. I thought I did my research on which company is the best, boy was I wrong. I go to use my card for the 1st time lastnight & its declined. Ok.... I call KardCo from the store and I'm placed on hold for 20 mins. Finally I speak to an awful women who tells me my debt to income ratio is too high and I have too many inquires. I pull my credit report once I get home I pull the one from when I opened the card and the most recent one. My revolving debt $100, my credit score increased from 738 to 740 and 96% of my credit is currently available.... 1-800 Customer Service NOT LOCATED IN US! 2 years in a row they don't send me my rewards check

Page 7: Practical Text Mining with SQL using Relational Databases

Full Text Search

Built in to many RDMS

Needs Indexing

Can be Slow

Necessary in some Applications

Complements Categorization

Oracle:

SELECT SCORE(1), comment,

issue_date from custdb

WHERE CONTAINS(text, 'APR', 1) > 0

AND issue_date >= ('01-OCT-97')

ORDER BY SCORE(1) DESC;

Operators: Like, Contains, Regex,

Sounds Like, Distance Measures

Page 8: Practical Text Mining with SQL using Relational Databases

Term Doc

Best 1

Customer 1

Service 1

Highly 2

Recommended 2

Parse Terms from Each Row

Remove StopWords Cross Reference

Document ID & Term Numbers

Output New “Structured”

Table

Map Unstructured-to-Structured

Doc Term1 Term2 Term3 Term4

1 The Best Customer Service

2 Is Highly Recommended

“Wasted Space”

Page 9: Practical Text Mining with SQL using Relational Databases

Extended SQL

User Defined Functions

Stored Process

Many Methods to Pivot Data

select

regexp_split_to_table(lower(line), '\s+')

as word

from

customer_comments

Page 10: Practical Text Mining with SQL using Relational Databases

“Words” Table

One Row for each term in Doc.

Term Index Number

“Document ID”

Verbatim Term Index

+1 Term Index -

1

Must handle Negation!

Page 11: Practical Text Mining with SQL using Relational Databases

Term document matrix

Harder to do analysis in SQL

Wasted Space

Weight Terms Discard Terms

Page 12: Practical Text Mining with SQL using Relational Databases

Term Weighting in SQL

• Log(Number of Docs / Number of Docs which contain term)

Calculate IDF

• Number of times Term occurs in document

Calculate Term Freq

• Mulitply IDF *TF

• Sort by High values

• Select Top N features

Calculate tfidf

create table idf as select

word,num_docs.value as

numdocs,doc_freq.value as

docfreq,

log10(num_docs.value/doc_freq.

value) as idf

from doc_freq,WORK.num_docs

order by idf;

create table doc_freq as

select word,count(distinct

id) as value

from WORDS

group by word

order by value;

create table num_docs as

select count(distinct id)

as value

from WORDS;

Words

Table

Top N

Words

Pivot

on

Rows

Page 13: Practical Text Mining with SQL using Relational Databases

Top N Weighted Words Matrix – Ranked by Highest TD/IDF

Page 14: Practical Text Mining with SQL using Relational Databases

select a.ID,

(compress(a.word) || ' ' ||

compress(b.word)) as pair,

from words a , words b

where a.ID=b.ID and (a.no=b.no_prev)

order by pair;

Generating Bigrams

Page 15: Practical Text Mining with SQL using Relational Databases

Bigrams Output

Run Frequencies on Terms

Gift Card occurs more

frequently than expected

Consider incorporating into

Taxonomy

SAMPLE BIGRAM COUNT

EXPECTED

COUNT

Have Been 23 26

Gift Card 29 10

Called Kardo 21 19

Kardco Card 31 25

Customer Service 36 30

Credit Card 24 29

Member Since 10 13

Credit Limit 13 10

Starlight Card 11 5

Kardco Customer 8 6

Big Ram

Page 16: Practical Text Mining with SQL using Relational Databases

Do repeat callers signal Churn?

..

Research shows improved predictive

Models performance

Correlate with Satisfaction Scores

Relevant Keywords

First Call Responders

pair Status Count satisfaction

CUSTOMER SERVICE A 27 8.47

GIFT CARD A 25 8.34

KARDCO CARD A 24 8.79

CREDIT CARD A 15 8.62

WITH KARDCO A 13 8.28

TRANSFERRED AGAIN I 12 8.30

CREDIT LIMIT A 11 8.35

FROM KARDCO A 10 8.50

PREMIER CARD A 9 8.42

WITH KARDCO I 9 8.48

THREE MONTHS A 9 8.37

CUSTOMER SERVICE I 9 8.36

Page 17: Practical Text Mining with SQL using Relational Databases

select distinct comm1 from Customer Comments

Where prxmatch("m/2nd|3rd|again|resolve/oi",comm1) >0

Customer comment Sat

Hotel cant resolve my dispute. I'm going to cancel 4

Never resolved. Still waiting for a call back 3

So Completely Unhappy with KardCo. It took 3 calls to the service center to finally resolve my billing

problem 5

They gave me a 2nd chance to pay my bill 9

This complaint was never resolved to begin with 5

This is the 2nd year in a row that KardCo said they mailed my rewards refund that I have yet to

recieve. Same Pattern every year, I stop getting paper statements in December even though I am signed up for them and I never get my Check. Then I mysteriously start getting paper statements

again after the period they say they will cut the checks and tell me i am no longer eligable.

6

This is the 3rd time I have complained about this and I may have to take my business elsewhere! 4

Transferred again for the 2nd time. I can't believe it. What happened to Cindy? 1

When ever I compare customer service between companies KardCo is the PREMIER standard. They

are on call 24 hours a day. Their operators are friendly and easy to speak with. They are always on the

customers side and they always work at a situation until they resolve the issue.

10

Looking for the Repeat Callers

Some False positive

Terms “resolve” and “2nd” can be positive

Page 18: Practical Text Mining with SQL using Relational Databases

Satisfaction Score

Outstanding Balance

Predict Churn

Churn Improves

Implement New Scripts for

call center

Number of Times Called

Select all comments with “Gift Card”

Insert Keys into Model Table

Join new Model with existing model tables

How Text Analytics can improve Predictive Model

Page 19: Practical Text Mining with SQL using Relational Databases

STANDARD CLASSIFICATIONS

Advertising and marketing Credit determination

Application processing delay

Credit line Increase/decrease

APR or interest rate Credit reporting

Arbitration Customer service / Customer relations

Balance transfer Delinquent account

Balance transfer fee

Forbearance / Workout plans

Bankruptcy Identity theft / Fraud / Embezzlement

Billing disputes Late fee

Billing statement Other

Cash advance Other fee

Cash advance fee Overlimit fee

Closing/Cancelling account Payoff process

Collection debt dispute Privacy

Collection practices Rewards

Convenience checks Sale of account

Credit card

protection / Debt protection Transaction issue

Unsolicited issuance of credit card

Add “Gift Card” as a

Classification

Page 20: Practical Text Mining with SQL using Relational Databases

“Tweak” Taxonomy

Apply Auto Classification

Evaluate according to

GOLD Standard

Apply CRISP or SEMMA

Methodology and Repeat

Validation CAT Count Customer

Service Baseline

Average Spend

ADV 15 15 15,483 APR 12 12 13,308 BANKRUPT 1 1 13,108 BILLDISP 6 6 12,682 BILLSTAT 6 6 10,617 COLL 1 1 17,720 CUSTSERV 25 25 14,725 DELAY 1 1 13,334 FRAUD 13 13 15,162 GIFTCARD 18 18 16,107 LATEFEE 3 3 18,989 LINEADJ 4 4 13,762 OTHER 125 125 18,482 OTHERFEE 15 15 10,153 PROT 1 1 17,808 REFUND 2 2 16,473 REWARDS 10 10 10,918 TRANS 1 1 14,224 TRAVEL 8 8 10,355

“There is no globally best method for

(automated) text analysis”

Page 21: Practical Text Mining with SQL using Relational Databases

Other Types of Classification

Select id,comm

Case

When compged(‘High Interest Rate APR’,comm1 < 300 then ‘APR’

When compged(‘Best Customer Service’,comm1 < 300 then

‘DELIGHT’

Else ‘OTHER’ end as CAT from CUSTOMER_COMMENTS

Classify by

Keyword

Pairs

Regular Expressions

Boolean

Distance Functions

Fuzzy Matching

Regex

Bayesian algorithms

Page 22: Practical Text Mining with SQL using Relational Databases

Sentiment – Can be easy, can be hard!

Words Table

Join to Polarity

Dictionary

Assign +1 to Positive /-1

to Negative

Sentiment Score

Use Top N Weighted

Terms

Use First and Last Sentences

Vector Size CPU? Complexity Normalized

Use In-Memory

Lookups

Customized

Dictionary

Bayesian

Classifier in SQL

Page 23: Practical Text Mining with SQL using Relational Databases

CAT Count

Average Satisfaction Neg

Pct

Not

Neg

Pct Spend

ADV 15 15,483 7.5 49 51 APR 12 13,308 7.2 72 28 FRAUD 13 15,162 5.2 61 39 GIFTCARD 18 16,107 8.9 24 76 LATEFEE 3 18,989 7.0 12 88

Sentiment – Correlation

Correlating Sentiment Scores with other database metrics can support hypothesis

Page 24: Practical Text Mining with SQL using Relational Databases

THANK YOU!

Contact:

[email protected]

www.linkedin.com/in/ralphwinters