Upload
ralph-winters
View
3.015
Download
1
Embed Size (px)
DESCRIPTION
Presentation at the 11th Annual Text and Social Analytics Summit - Cambridge, MA. Integrate unstructured data within a relational database: Learn the feasibility, prototyping, value added, and the goals of Text Analytics. Understand how much data you have and the architecture necessary to leverage existing technology that goes along with your existing relational structure (Oracle, SAS, SQL Server, DB/2, Postgre, MySQL and others). Learn how to utilize sentiment analysis to determine propensity to churn. A best in practice discussion of statistic techniques, clustering, and association.
Citation preview
PRACTICAL TEXT
MINING WITH SQL
USING RELATIONAL
DATABASES
Ralph Winters
Data Architect,
Actuarial Business Intelligence
EmblemHealth
June 5th, 2013
11th Annual Text and Social Analytics SummitCambridge, MA
RDMS TODAY
Gartner - clients tell us that combining scored, processed
‘outside data’ with data inside our relational databases is where all
the added value is.
IDC -RDMD database management systems are
expected to nearly double in market growth by 2016 driven by
intelligence demands and expabusiness nded adoption to
tackle big data and unstructured information streams
The relational database management systems (RDBMS)
market continues to confound the skeptics by maintaining strong
growth characteristics despite the belief by some that the market has become 'saturated‘ or that it will be weakened by newer Big Data
technologies
Inmon:
listen carefully to the “big data” vendors and this is what you hear: “Let’s get rid of relational.” It is like courtiers in the castle whispering, “The king must die.” What’s going
on here?.
Why a relational DB?
Why a relational
Database?
Marry Structured + Unstructured
Data More suitable for statistical
analytics (matrices)
Leverage existing familiar
widespread technology
Improving of predictive
Models
Referential Integrity
Integrated Text/Data
Mining
Feasibility
What do I need to know?
Costs
Benefits/Risks Industries
Adding Value?
RDMS
File Interfaces (XML,CSV)
ODBC/JDBC/DBI
Text Vendor supplied Connector
Hadoop Connectors
(SAS, Oracle)
Open Source Text Mining Tools
(R, Java, Perl, LingPipe)
In-Database Text Mining Algorithms
(Oracle*Text,SAS Text Miner,SQL
Server Text Miniing)
RDMS Internal/External Connections
ANGRY Customer Comments
Short Tailed
Sampling
Not for Long Tailed Data
Comment - KardCo Premier Credit Card Promo Scam . I recently received an KardCo promo promising 25,000 bonus points if you sign up for the KardCo Premier Card and spend $2000 in the 1st three months. and so i call in and apply ...got APPROVED...two weeks later .. Posting on your site DEFINITELY HELPED (it was pointed out by retailer), and sped up response after 6 weeks of mulling around BEFORE we posted our complaint. $100 restaurant certificates 15 days ago I opened a cc w/ KardCo. I thought I did my research on which company is the best, boy was I wrong. I go to use my card for the 1st time lastnight & its declined. Ok.... I call KardCo from the store and I'm placed on hold for 20 mins. Finally I speak to an awful women who tells me my debt to income ratio is too high and I have too many inquires. I pull my credit report once I get home I pull the one from when I opened the card and the most recent one. My revolving debt $100, my credit score increased from 738 to 740 and 96% of my credit is currently available.... 1-800 Customer Service NOT LOCATED IN US! 2 years in a row they don't send me my rewards check
Full Text Search
Built in to many RDMS
Needs Indexing
Can be Slow
Necessary in some Applications
Complements Categorization
Oracle:
SELECT SCORE(1), comment,
issue_date from custdb
WHERE CONTAINS(text, 'APR', 1) > 0
AND issue_date >= ('01-OCT-97')
ORDER BY SCORE(1) DESC;
Operators: Like, Contains, Regex,
Sounds Like, Distance Measures
Term Doc
Best 1
Customer 1
Service 1
Highly 2
Recommended 2
Parse Terms from Each Row
Remove StopWords Cross Reference
Document ID & Term Numbers
Output New “Structured”
Table
Map Unstructured-to-Structured
Doc Term1 Term2 Term3 Term4
1 The Best Customer Service
2 Is Highly Recommended
“Wasted Space”
Extended SQL
User Defined Functions
Stored Process
Many Methods to Pivot Data
select
regexp_split_to_table(lower(line), '\s+')
as word
from
customer_comments
“Words” Table
One Row for each term in Doc.
Term Index Number
“Document ID”
Verbatim Term Index
+1 Term Index -
1
Must handle Negation!
Term document matrix
Harder to do analysis in SQL
Wasted Space
Weight Terms Discard Terms
Term Weighting in SQL
• Log(Number of Docs / Number of Docs which contain term)
Calculate IDF
• Number of times Term occurs in document
Calculate Term Freq
• Mulitply IDF *TF
• Sort by High values
• Select Top N features
Calculate tfidf
create table idf as select
word,num_docs.value as
numdocs,doc_freq.value as
docfreq,
log10(num_docs.value/doc_freq.
value) as idf
from doc_freq,WORK.num_docs
order by idf;
create table doc_freq as
select word,count(distinct
id) as value
from WORDS
group by word
order by value;
create table num_docs as
select count(distinct id)
as value
from WORDS;
Words
Table
Top N
Words
Pivot
on
Rows
Top N Weighted Words Matrix – Ranked by Highest TD/IDF
select a.ID,
(compress(a.word) || ' ' ||
compress(b.word)) as pair,
from words a , words b
where a.ID=b.ID and (a.no=b.no_prev)
order by pair;
Generating Bigrams
Bigrams Output
Run Frequencies on Terms
Gift Card occurs more
frequently than expected
Consider incorporating into
Taxonomy
SAMPLE BIGRAM COUNT
EXPECTED
COUNT
Have Been 23 26
Gift Card 29 10
Called Kardo 21 19
Kardco Card 31 25
Customer Service 36 30
Credit Card 24 29
Member Since 10 13
Credit Limit 13 10
Starlight Card 11 5
Kardco Customer 8 6
Big Ram
Do repeat callers signal Churn?
..
Research shows improved predictive
Models performance
Correlate with Satisfaction Scores
Relevant Keywords
First Call Responders
pair Status Count satisfaction
CUSTOMER SERVICE A 27 8.47
GIFT CARD A 25 8.34
KARDCO CARD A 24 8.79
CREDIT CARD A 15 8.62
WITH KARDCO A 13 8.28
TRANSFERRED AGAIN I 12 8.30
CREDIT LIMIT A 11 8.35
FROM KARDCO A 10 8.50
PREMIER CARD A 9 8.42
WITH KARDCO I 9 8.48
THREE MONTHS A 9 8.37
CUSTOMER SERVICE I 9 8.36
select distinct comm1 from Customer Comments
Where prxmatch("m/2nd|3rd|again|resolve/oi",comm1) >0
Customer comment Sat
Hotel cant resolve my dispute. I'm going to cancel 4
Never resolved. Still waiting for a call back 3
So Completely Unhappy with KardCo. It took 3 calls to the service center to finally resolve my billing
problem 5
They gave me a 2nd chance to pay my bill 9
This complaint was never resolved to begin with 5
This is the 2nd year in a row that KardCo said they mailed my rewards refund that I have yet to
recieve. Same Pattern every year, I stop getting paper statements in December even though I am signed up for them and I never get my Check. Then I mysteriously start getting paper statements
again after the period they say they will cut the checks and tell me i am no longer eligable.
6
This is the 3rd time I have complained about this and I may have to take my business elsewhere! 4
Transferred again for the 2nd time. I can't believe it. What happened to Cindy? 1
When ever I compare customer service between companies KardCo is the PREMIER standard. They
are on call 24 hours a day. Their operators are friendly and easy to speak with. They are always on the
customers side and they always work at a situation until they resolve the issue.
10
Looking for the Repeat Callers
Some False positive
Terms “resolve” and “2nd” can be positive
Satisfaction Score
Outstanding Balance
Predict Churn
Churn Improves
Implement New Scripts for
call center
Number of Times Called
Select all comments with “Gift Card”
Insert Keys into Model Table
Join new Model with existing model tables
How Text Analytics can improve Predictive Model
STANDARD CLASSIFICATIONS
Advertising and marketing Credit determination
Application processing delay
Credit line Increase/decrease
APR or interest rate Credit reporting
Arbitration Customer service / Customer relations
Balance transfer Delinquent account
Balance transfer fee
Forbearance / Workout plans
Bankruptcy Identity theft / Fraud / Embezzlement
Billing disputes Late fee
Billing statement Other
Cash advance Other fee
Cash advance fee Overlimit fee
Closing/Cancelling account Payoff process
Collection debt dispute Privacy
Collection practices Rewards
Convenience checks Sale of account
Credit card
protection / Debt protection Transaction issue
Unsolicited issuance of credit card
Add “Gift Card” as a
Classification
“Tweak” Taxonomy
Apply Auto Classification
Evaluate according to
GOLD Standard
Apply CRISP or SEMMA
Methodology and Repeat
Validation CAT Count Customer
Service Baseline
Average Spend
ADV 15 15 15,483 APR 12 12 13,308 BANKRUPT 1 1 13,108 BILLDISP 6 6 12,682 BILLSTAT 6 6 10,617 COLL 1 1 17,720 CUSTSERV 25 25 14,725 DELAY 1 1 13,334 FRAUD 13 13 15,162 GIFTCARD 18 18 16,107 LATEFEE 3 3 18,989 LINEADJ 4 4 13,762 OTHER 125 125 18,482 OTHERFEE 15 15 10,153 PROT 1 1 17,808 REFUND 2 2 16,473 REWARDS 10 10 10,918 TRANS 1 1 14,224 TRAVEL 8 8 10,355
“There is no globally best method for
(automated) text analysis”
Other Types of Classification
Select id,comm
Case
When compged(‘High Interest Rate APR’,comm1 < 300 then ‘APR’
When compged(‘Best Customer Service’,comm1 < 300 then
‘DELIGHT’
Else ‘OTHER’ end as CAT from CUSTOMER_COMMENTS
Classify by
Keyword
Pairs
Regular Expressions
Boolean
Distance Functions
Fuzzy Matching
Regex
Bayesian algorithms
Sentiment – Can be easy, can be hard!
Words Table
Join to Polarity
Dictionary
Assign +1 to Positive /-1
to Negative
Sentiment Score
Use Top N Weighted
Terms
Use First and Last Sentences
Vector Size CPU? Complexity Normalized
Use In-Memory
Lookups
Customized
Dictionary
Bayesian
Classifier in SQL
CAT Count
Average Satisfaction Neg
Pct
Not
Neg
Pct Spend
ADV 15 15,483 7.5 49 51 APR 12 13,308 7.2 72 28 FRAUD 13 15,162 5.2 61 39 GIFTCARD 18 16,107 8.9 24 76 LATEFEE 3 18,989 7.0 12 88
Sentiment – Correlation
Correlating Sentiment Scores with other database metrics can support hypothesis