30
From Data to Decisions: Learnings from Real-World Data Mining Dr. Shailesh Kumar Google, Inc. InfoVision 2011

Infovision 2011 Data to Decisions Shailesh Kumar, Google

Embed Size (px)

DESCRIPTION

Infovision 2011 Data to Decisions Shailesh Kumar, Google http://informationexcellence.wordpress.com/category/knowledge-share-sessions/ Infovision 2011 Data to Decisions Shailesh Kumar, Google http://informationexcellence.wordpress.com/2011/10/28/infovision2011-presentations/

Citation preview

Page 1: Infovision 2011 Data to Decisions Shailesh Kumar, Google

From Data to Decisions: Learnings from Real-World

Data Mining

Dr. Shailesh Kumar Google, Inc.

InfoVision 2011

Page 2: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Welcome to the Information Age … … drowning in data and starving for Knowledge

ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTAGCTGTCGCTCGGCTGCATGCCTAGTGCACCTACGCAGTATAAACAATAATAAATTTTACTGTCGTTGACAAGAAACGAGTAACTCGTCCCTCTTCTGCAGACTGCTTATTACGCGACCGTAAGCTAC…

Page 3: Infovision 2011 Data to Decisions Shailesh Kumar, Google

This data explosion is enabled by…

  Better “Sensors” – Higher Resolution, More Spectral Bands, Quick Experimental Turnaround, Crowd Sourcing…

  Higher Bandwidth Communication – Faster Networks and Routers, Better Compression technologies…

  Larger Warehouses – Cheaper Storage, Multi-Level Caching, Scalable Database/Data warehousing technologies…

  Massive Crunching Power – Faster Multi-core processors, Parallel Distributed Computing, MapReduce paradigms…

  Advances in Machine Learning and Data Mining –Sophisticated Learning frameworks, Distributed Data Mining…

Page 4: Infovision 2011 Data to Decisions Shailesh Kumar, Google

From “Data” to “Decision”

Insights Features

Models

Predictions

Domain Knowledge

Business Objectives Business Constraints

Feedback

Data

Decision

Page 5: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Observation Prediction Decision

Credit Card Fraud Input: Past card usage behavior Predict: Fraudulent transaction?

Credit Scoring Input: Past payment behavior Predict: Probability of Default

Retail Cross Sell Input: Past purchase behavior

Predict: Response to a coupon Approve Transaction? Approve Loan? Send Coupon?

Page 6: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Building Machine Learning Models The Process, the Art, and the Science

Collect Raw (Input) Data

Collect Target (Output) Labels (“ground truth”)

Choose: “Model Type” & “Model Complexity”

Engineer and Select “Predictive” features

“Train” a model using Feature-Label training data set

“Evaluate” the trained model on “validation” data and iterate until satisfied

Can be Costly!!

Too Simple: Under-Learn Too Complex: Over-Learn

Bias Variance Tradeoff

“Deploy” the model: Predict class label of all the “un-labeled” data

•  Use Domain Knowledge •  Keep variability that matters •  Remove Redundancy

Page 7: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lessons from Real-world Data Mining

Insights

Features

Labels

Models

Decisions

Page 8: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Looking for a Needle in a Haystack?

  What is the nature of my haystack (data)   What process generated the data?   What assumptions am I making about the data?

  Is it the right needle (insight) to look for?   Is it “actionable”? Is it “useful”? Is it “novel”?   Does it tell me something I didn’t know?

Insight Discovery ≠ Hypothesis Testing

Page 9: Infovision 2011 Data to Decisions Shailesh Kumar, Google

The Traditional Market Basket Analysis Wrong needle in a mysterious haystack!

FREQUENT ITEM-SETS

Size = 1

CANDIDATE ITEM-SETS

Size = 2

FREQUENT ITEM-SETS

Size = 2

CANDIDATE ITEM-SETS

Size = 3

FREQUENT ITEM-SETS

Size = 3

Page 10: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lesson: Know your data (Haystack) What process generated the data?

mixture of, projections of, latent intentions

  already have other products

  buy them from another retailer

  buy them at a different time

  got them as gifts

  ….

Few buy a complete “logical” product group in the same basket

Page 11: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lesson: Extract the essence, let go of data Pair-wise Co-occurrence Statistics

Page 12: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lesson: Look for the right Insight “Frequent” vs. “Logical” Itemset

  Novel – Not obvious from the data (support = 0)   Useful – product bundling, recommendations, layout   Exhaustive – “No insight left behind!” – however “rare”

Airbeds Lighting Folding Furniture

Camping Accessories

Grill Accessories

Inflatables

Water Sports Lighting

Patio Accessories

Furniture

Projection TV Flat Panel TV

Home Theatre Services

Digital Cable TV Home Components

Speakers

Page 13: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lessons from Real-world Data Mining

Insights

Features

Labels

Models

Decisions

Page 14: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Two Mindsets to Modeling

Model-Centric •  Throw all features in! •  Have enough data •  Build Complex models

Feature-centric •  Carefully craft features •  Use Domain Knowledge •  Build Simpler Models

Simple Features

Complex Model

Complex Features

Simple Model

The Law of Conservation of Complexity

Page 15: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lesson: Distribute Complexity well Simplify Models with complex features

Simple Features

Complex Model

Complex Features

Simple Model

Page 16: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lesson: Overcome model limitations

Age < 60

Income < Rs. 32

Education < 20

Inco

me

Age

Education < 20

log (Income) - B x Age < 12

log

(Inco

me)

Age

?

Page 17: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lessons from Real-world Data Mining

Insights Text

Features

Labels

Models

Decisions

Page 18: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lesson: Things are not what they appear What is a word in “Bag-of-Words”?

  Segmentation: What is a word?   New York Stock Exchange 4 words?   “New York” “Stock Exchange” 2 phrases?   “New York Stock Exchange” 1 phrase?

  Disambiguation: What does a word mean?   ‘rock band’, ‘rock climbing’,   ‘rocking chair’, ‘the rock’

  Equivalencing: How “similar” are two terms?   Comparing Apples to Oranges…   Orange Juice, Orange Flag, Orange Blog,   Apple store, Apple pie, The Big Apple

Page 19: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Equivalencing   we filed a suit charging dell of illegal behavior   they submitted a case accusing apple of unauthorized conduct

Disambiguation   i was right to avoid a suit against apple   on my right was a man in a suit drinking apple juice

You shall know a word by the company it keeps -- Firth, J. R. 1957:11

SIMILARITY = 0.995

SIMILARITY = 0.171

Page 20: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lessons from Real-world Data Mining

Insights

Features

Labels

Models

Decisions

Page 21: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Labels are precious – use them well   Labeled data vs. Unlabeled data

  Lots of input data! (e.g. web pages)   Small fraction is labeled! (e.g. spam/not)

  Labels can be   Costly – human judgments, costly experiments, rare events   Noisy – web clicks, crowd sourced,…

  How do we use unlabeled data with labeled data?   Semi-supervised Learning

  Which unlabeled data point to get labeled next?   Active Learning

Page 22: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lessons from Real-world Data Mining

Insights

Features

Labels

Models

Decisions

Page 23: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lesson: Don’t beat data into submission Model Complexity no more than necessary

  How many hidden units in a neural network?   How deep a decision tree?   How much cost for “misclassification elasticity” in SVM?   How many clusters? or modes in mixture of density?

Model is too simple under-learn

Model is too complex memorize

Model is just right generalize

Page 24: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lesson: Divide and Conquer Many simple models > Single complex model

M W N U F P

V Y S Z B E I J

A

K R

H Q O G

L D

T

X C

•  Better “localized features” •  Simpler “local models” •  More interpretable features and models •  Higher Accuracy •  Faster Modeling Time •  Lower Resource Requirements

Page 25: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lessons from Real-world Data Mining

Insights

Features

Labels

Models

Decisions

Page 26: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lesson: Interpret Predictions What is the score? Why is score that way?

Concept Space Prediction Score Overlay

*This is not what we mean by the “art of data mining”

Page 27: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lesson: Learn Globally, Decide Locally

“The Ford-Firestone dispute blew up in August 2000 and is still going strong. In response to claims that their 15-inch Wilderness AT, radial ATX and ATX II tire treads were separating from the tire core leading to grisly, spectacular crashes. Bridgestone/Firestone recalled 6.5 million tires….” -- Forbes

Accidents description Density Overlay

Page 28: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Lesson: Prediction is not enough! Different Reasons, Different Decisions

Probability of defaulting Collection Notes

Page 29: Infovision 2011 Data to Decisions Shailesh Kumar, Google

Summary   Decisions driven more by data than by “gut feeling”

  Converting data to decisions is Art + Science + Engineering

  Insights: Right needles in a well understood Haystack

  Features: Garbage In, Garbage Out

  Models: Generalize, don’t Memorize

  Labels: Explore thoroughly, Exploit efficiently

  Decisions: Right decision for the right reason

  Feedback: Adapt features, models, scores, decisions

Page 30: Infovision 2011 Data to Decisions Shailesh Kumar, Google

In theory, theory and practice are same.

In practice, they are not.

-- Lawrence Peter Berra

Questions?