49
© Copyright 1998-2001, Blue Martini Software. San Mateo California, USA 1 1 Mining E-commerce Data The Good, the Bad, and the Ugly Ronny Kohavi, Ph.D. Senior Director of Data Mining Blue Martini Software [email protected] http://www.bluemartini.com http://www.Kohavi.com KDD 2001 8/28/2001

Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Blue Martini Software. San Mateo California, USA 1

1

Mining E-commerce DataThe Good, the Bad, and the Ugly

Ronny Kohavi, Ph.D.Senior Director of Data MiningBlue Martini [email protected]://www.bluemartini.comhttp://www.Kohavi.com

KDD 20018/28/2001

Page 2: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 2

2OverviewOverview

ò The GoodE-commerce is the killer domain for data mining

ò The BadYou need more than web logs and you must conflate many data sources

ò The UglyPre-processing and post-processingare hard

ò E-metricsò Stories from mining real data

“Peeling the onion”ò Summary

Page 3: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 3

3The Killer DomainThe Killer Domain

Successful data mining benefits from:ò Large amount of data (many records)

If you sell five items an hour on average, you’ll have about 1.8M clicks in the first month. Yahoo! has 1B clicks, or 10GB/hour

ò Rich data with many attributes (wide records)With proper design, many attributes are available

ò Clean data collection (avoid GIGO)Electronically collected, no legacy systems

ò Actionable domain (have real-world impactEasy to change sites, ads, targeted campaigns, cross-sells

ò Measurable return-on-investment

E-commerce has all theright ingredients

Page 4: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 4

4Many RecordsMany Records

ò Clickstreams generate huge amounts of dataò Yahoo! has 1 billion page views a day.

Web log data for page views is 10GB per hour!ò New e-commerce sites, even if small, generate

sufficient data for effective mining quicklyIf you sell five items an hour on average, that’s

5 items * 24 hours * 30 days / 2% conversion * 9 clicks-in-session >

1.6 million page views

Page 5: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 5

5Rich RecordsRich Records

Effective site design can log many attributes about what was shown or purchased:

ò Product and product attributesò Assortment attributes

(when multiple products are shown)ò Promotions shownò Visit attributes (e.g., visit count)ò Customer attributes

(when known through login/registration)

Page 6: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 6

6Clean DataClean Data

ò Collect data directly at webstoreNo legacy systems

ò Collect what is needed by designNot as an afterthought

ò Collect electronically - reliable dataNo humans typing survey data from forms

ò Sample at the right granularity levelArchitecture design principle: sample at thecustomer or session level, never at page view level

Page 7: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 7

7Teaser - Birth DatesTeaser - Birth Dates

A bank discovered that almost 5% of their customers were born on the exact same date

Can you explain?

Page 8: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 8

8Actionable DomainActionable Domain

ò Few data mining discoveries had a realimpact on businesses.Taking action requires changing complex systems, procedures, and human habits - HARD in generalEasier in the electronics world

ò In e-commerce, many discoveries can be made actionable byò Changing web sites (e.g., personalization)ò Targeted campaignsò Changing advertising strategies based on ROI

ò Easy to offer cross-sells or up-sellsContrast with changing actual store layouts

Page 9: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 9

9Measure ROIMeasure ROI

ò In e-commerce, it is easy to evaluate metrics, unlike in brick-and-mortarstores.See Why We Buy: the Science of Shopping byPaco Underhill

ò In e-commerce it is easy to measurethe effect of changes.One can easily set control groups on a web site

ò Response to e-mails and surveys is days, not weeks and months

ò The web is an experimental laboratoryIt is easy to change and measure the effect

Page 10: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 10

10E-mail campaigns: Immediate ROIE-mail campaigns: Immediate ROI

One of of our customers, Gymboree, sent e-mail campaign based on analysis of website data of registered users: 7 email designs to 4 segments

Results:l Very high clickthrough rate

of 22% (normal is 10%)l Average order size was 36%

higher than normall Email with two age groups of

the same gender outperformed that with single age(medium targeting)

l Lifestyle imagesbetter than products

Segment 1 Segment 2 Segment 3

Page 11: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 11

11The BadThe Bad

Web logs provide little data, even in the Extended Common Log Format (ECLF)

ò Hostò Timeò Request, e.g., an html pageò Referrerò User agent (browser identifier)ò IP addressò Cookieò Bytes, status, ...

Firms need web intelligence, not log analysis-- Forrester Report, Nov 1999

Page 12: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 12

12What is on the Web Page?What is on the Web Page?

ò Weblogs designed for analyzing web servers, not for mining e-commerce transactions and clickstreams

ò Given a URL, what was displayed?ò Reverse URL mapping. Very brittle.

http://www.amazon.com/exec/obidos/ASIN/0471376809/qid%3D982141580/105-9856660-9155942

is The Data WebHouse Toolkit

ò Hard to derive attributes of the product, such as soft cover, author, edition, year?

Page 13: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 13

13Dynamic Content is HarderDynamic Content is Harder

Dynamic content, which is becoming more common makes web log analysis harder

ò The same URL will display different itemsò URLs are amazingly long in dynamic sites and

information is in the application server session:http://www.im.aa.com/American?BV_EngineID=dealikcjfekgbfdmcflmcfkhdgfh.7

&BV_Operation=Dyn_RawSmartLink&BV_SessionID=%40%40%40%400822617159.0968100982%40%40%40%40&form%25destination=index-member.tmpl&BV_ServiceName=American

ò Personalized content (e.g., recommended cross-sell) is practically impossible to reconstruct from web logs

Page 14: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 14

14Sessionizing is Heuristic-BasedSessionizing is Heuristic-Based

ò HTTP is statelessò Sessionizing is still a research topic

See Measuring the Accuracy of Sessionizers for Web Usage Analysis Berent, Mobasher, Spiliopoulou, and Wiltshire, in Proceedings of the Web Mining Workshop at the First SIAM International Conference on Data Mining, 2001

ò Recreating user sessions is heuristic based:ò IP addressesò Cookiesò Browser type

Page 15: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 15

15Business EventsBusiness Events

Some events cannot be determined from weblogs:ò Add to shopping cart - needed to compute

value of abandoned shopping cartsò Change quantity of item in cartò Promotion offered on pageò Out of stock shown on the pageò Form informationò Search - common keywords or

keywords that were not found(an important warning to an e-commerce site)

Page 16: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 16

16Example - Search KeywordsExample - Search Keywords

ò On one of our sport-related sites, the top searched keywords were:ò Baseballò Videoò Softballò Volleyballò Pinsò Equestrianò Videosò Postersò Musicò Poster

What is common to thewords in red?

Page 17: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 17

17Example - Search KeywordsExample - Search Keywords

ò On one of our sports-related sites, the top searched keywords (in order) were:ò Baseballò Videoò Softballò Volleyballò Pinsò Equestrianò Videosò Postersò Musicò Poster

Searches for red words yielded zero results!

• Some words just need a synonym

• Some words should send a strongmessage about items the store should carry!

Weblogs do not typically contain sufficient information to extract failed searches.

This isn’t fancy analytics, but it’s crucial.

About 11% of searches fail

Page 18: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 18

18Matching Web Logs to DBMatching Web Logs to DB

ò Given a request, how do you ò Match it to the customer in your database that filled

a registration form?ò Determine if this is the customer’s second visit or

the 100th visit?ò Determine if the customer previously purchased?

ò These common requests are very hard to implement as an afterthought

ò They are even harder when you try to find “scenarios” that match multiple events

Page 19: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 19

19Conversion RatesConversion Rates

ò Most often-requested measures relate to conversion rates (buyers to browsers)

ò Especially useful by referrer (e.g., ad)ò Given an HTTP request that has one of your

ads as the referrer field, how can you tell if it resulted in a sale?

Using hits and page views to judge site successis like evaluating a musical performance by its volume

-- Forrester Report, 1999

Page 20: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 20

20A Real-World Referrer ExampleA Real-World Referrer Example

ò The KDD Cup 2000 data shows the following in their initial rampup period

ò Conversion rates differ by a factor of over 200!ò Knowing the likelihood of purchase

dramatically changes the message to present

Referrer # Sessions % of traffic # Sales Conv rateShopNow 16,178 6.9% 6 0.04%FashionMall 19,685 8.4% 17 0.09%MyCoupons 2,052 0.9% 170 8.28%

Page 21: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 21

21“Bad” Is Not So Bad“Bad” Is Not So Bad

ò Ignore web logsThey are at the wrong granularity level to be useful

ò Log the information yourself at the application layer

ò The application knows what is on the pageò The app controls sessionsò The app can log business eventsò The app can tie a visitor to their customer

information upon loginò Also see Structure and Content Preprocessing

by Rob Cooley for more information

Page 22: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 22

22The “Ugly”The “Ugly”

ò There are several hard problems:ò Crawlersò Handling large amounts of data ò Data transformations for analysisò Marketing-level insight

ò These are excellent research topics

Page 23: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 23

23CrawlersCrawlers

ò Crawlers are programs that visit your siteò Search crawlersò Shopping botsò IE5 offline viewerò E-mail harvesters - Evilò Students learning Perl scripts

ò For understanding your customers, it is very important to filter out crawlers

ò 30% of sessions come from bots/crawlers (most are measure of service bots such as Keynote)

ò Fairly hard problemSome try to hide themselves

Good

Page 24: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 24

24

ò Browsers identify themselves on every request in a field called Useragent

ò Which useragents are bots?Mozilla/4.75 [en] (WinNT; U)Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)Mozilla/3.01 [en] (Win95; I)Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt; MSIECrawler)Googlebot/2.1 (+http://www.googlebot.com/bot.html)Mozilla/4.0 (compatible; MSIE 5.0; Win32)

ò All but the top twoò Criteria

ò Known robot user agentsò User agents which never log in/never purchaseò User agents with only 1-click or more than 100-click sessionsò User agents with no referrer on all requests

Identifying Bots/CrawlersIdentifying Bots/Crawlers

Page 25: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 25

25Data TransformationsData Transformations

ò 80% of the time spent in data analysis is typically spent transforming data

ò What can be done today:ò Automate transfer of data from webstore

environment to data warehouseò Provide data transformation UIò Provided “canned” transformations for common

business problems

ò Business users without “data” or “analyst” in their title cannot spend the time to learn how to transform data

Page 26: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 26

26Business Level InsightBusiness Level Insight

ò Everything is a GO!ò You collected data correctlyò You built a data warehouseò You transformed the dataò You ran a simple Perceptron

(1-layer) neural network thatpredicts the target well

ò The business user asks:What does the 237-dimensional hyperplane represent?

ò Insight must be comprehensible to biz usersSometimes required for legal reasons(e.g., no discrimination)

Page 27: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 27

27E-MetricsE-Metrics

ò Goal: Identify commonalities and differences across sites

ò Data collected over the Christmas 2000 periodò Five e-commerce web sites running Blue

Martini’s softwareò Both US (three sites) and Europe (two sites)ò Removed bots/crawlersò Study available at developer.bluemartini.com

under Data Mining and Analytics articles

Page 28: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 28

28Visit StatisticsVisit Statistics

ò An average visitorò Views 8-10 pagesò Spends 5 minutes on the siteò Spends 35 seconds between pages

ò An average purchasing visitor

ò Views 50 pagesò Spends 30 minutes on the site

ò These numbers are Extremely consistent across all of the sites in the study.

Page 29: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 29

29

Average Visit Duration (Time)

0%

10%

20%

30%

40%

50%

60%

70%

< 1m

1m - 2

m

2m - 3

m

3m - 4

m

4m - 5

m

5m - 6

m

6m - 7

m

7m - 8

m

8m - 9

m

9m - 1

0m

10m - 1

1m

11m - 1

2m

12m - 1

3m

13m - 1

4m

14m - 1

5m

15m - 1

6m

16m - 1

7m

17m - 1

8m

18m - 1

9m

19m - 2

0m> 2

0m

Time

Vis

its (%

)

Visit Duration - TimeVisit Duration - Time

Lesson:Like TV advertising, a web site needs to capture a visitor in under a minute.

Make the home page fast, useful and compelling.

Note the low std-dev for data 2 minutes or higher

Page 30: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 30

30Purchase Visit Duration - TimePurchase Visit Duration - Time

Average Visit Duration (Time) - Purchase Visits

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

< 5m

5m - 1

0m

10m - 1

5m

15m - 2

0m

20m - 2

5m

25m - 3

0m

30m - 3

5m

35m - 4

0m

40m - 4

5m

45m - 5

0m

50m - 5

5m

55m - 6

0m > 60m

Time

Pur

chas

e V

isits

(%)

Page 31: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 31

31Think TimeThink Time

Average Think Time

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

< 5s

10s -1

5s

20s -

25s

30s -

35s

40s -

45s

50s -

55s

60s -

65s

70s -

75s

80s -

85s

90s -

95s

100s

- 105

s

110s

- 115

s> 1

20s

Time

Pag

es (%

)

Page 32: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 32

32

Average Browser Distribution

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Internet Explorer Netscape AOL Browser OtherBrowser

Vis

its

Browser DistributionBrowser Distribution

What browser war?

(AOL/IE hard to tell apart)

Page 33: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 33

33Conversion RatesConversion Rates

Visitor

RegisteredVisitor

Buyer

5% (1%-10%)

33% (15%-50%)

Lesson:Few visitors bother to register.Make it easy, fast and give incentives.

Lesson:Many people who bother to register don’t purchase.You know who they are, so you can target them.

1.7%(0.2% - 4.5%)

Page 34: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 34

34Business EventsBusiness Events

ò Shopping cart abandonment rate34% (15% - 50%)

Lesson:You know what they want, and sometimes know who they are.You can target them.

ò Search6% of visits use search (1%-10%)10% of searches fail to find any results

Lesson:You can find out what they were looking for.You can correct the problem.

Page 35: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 35

35Teaser - PrivacyTeaser - Privacy

ò 92% of Americans are concerned (67% very concerned) about the misuse of their personal information on the Internet.

- FTC Report, May 2000

ò 86% of executives don’t know how many customers view their privacy policies.

- Forrester Report, November 2000

ò Q: What percentage of visitors read theprivacy statement?

ò A: Less than 0.3%

Page 36: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 36

36Teaser - Mysterious Birth YearsTeaser - Mysterious Birth Years

The KDD CUP 98 datacontained anomaliesfor date of birth[Georges and Milley, SIGKDD Explorations 2000]

ò Spikes on years ending in zero (white dots on blue)ò Few individuals born prior to 1910ò Many more individuals who were born on even years (blue)

as on odd years (red)Why?

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1900

1905

1910

1915

1920

1925

1930

1935

1940

1945

1950

1955

1960

1965

1970

1975

1980

1985

1990

1995

Year

Page 37: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 37

37Teaser - Gender MysteryTeaser - Gender Mystery

ò A site has gender on the registration formò Acxiom, a syndicated data provider, also

provides genderò A very large discrepancy found between

ò Males according to registration form andò Acxiom provided data

Why?

Hint: Acxiom only conflicted with females,claiming some females are males.Never in the other direction

Page 38: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 38

38Teaser - Low Conversion RatesTeaser - Low Conversion Rates

ò Recall that Conversion Rate is the ratio of buyers to browsers.

ò High conversion rates are desiredò Reports showed some products

have really low conversion rates?

Why?

Page 39: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 39

39Teaser - High Conversion RatesTeaser - High Conversion Rates

ò Product Conversion Rate is the ratio of product purchases to product views

ò High can conversion rates be over 100%

Page 40: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 40

40Teaser - Who is Winnie?Teaser - Who is Winnie?

Referring site traffic for Gazelle.com, a leg-wear and leg-care web retailer. From KDD Cup 2000Who is Winnie Cooper? What can you do about it?

Note spikein traffic

Top Referrers

0%

20%

40%

60%

80%

100%

2/2/00

2/4/00

2/6/00

2/8/00

2/10/0

02/1

2/00

2/14/0

02/1

6/002/1

8/002/2

0/002/2

2/002/2

4/002/2

6/00

2/28/0

03/1

/003/3

/003/5

/003/7

/003/9

/00

3/11/0

03/1

3/003/1

5/00

3/17/0

03/1

9/00

3/21/0

03/2

3/003/2

5/00

3/27/0

03/2

9/00

3/31/0

0

Session date

Per

cent

of

top

refe

rrer

s

0

1000

2000

3000

4000

5000

6000

Fashion Mall Yahoo ShopNow MyCoupons Winnie-cooper Total from top referrers

Yahoo searches for THONGS and Companies/Apparel/Lingerie

FashionMall.com

ShopNow.com

Winnie-Cooper

MyCoupons.com

Page 41: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 41

41Answer to TeaserAnswer to Teaser

ò Winnie-cooper is a 31 year old guy who wears pantyhose

ò He has a pantyhose siteò 8,700 visitors came from his site

in a few days (!)

ò Actions:

ò Make him a celebrity and interview him about how hard it is for a men to buy pantyhose in stores

ò Personalize for XL sizes

Page 42: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 42

42Resources (I)Resources (I)

ò KDNuggets, Software for Web Mininghttp://www.kdnuggets.com/software/web.html

ò WEBKDD - Workshops in Web Mininghttp://robotics.Stanford.EDU/~ronnyk/WEBKDD2000/index.htmlhttp://robotics.Stanford.EDU/~ronnyk/WEBKDD2001/index.html

ò WEB Mining Tutorialsò E-commerce and Clickstream Mining, Jon Becher and Ron Kohavi,

First SIAM International Conference on Data Mining, 2001http://robotics.Stanford.EDU/~ronnyk/miningTutorialSlides.pdf

ò Web Mining for E-Commerce, Jaideep Srivastava,The Fifth Pacific Asia Conference on Knowledge Discovery and Data Mining, 2001

Page 43: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 43

43Resources (II)Resources (II)

ò The Data Webhouse Toolkit: Building the Web-Enabled Data Warehouse by Ralph Kimball,Richard Merz. ISBN: 0471376809 (Jan 2000)

ò Mastering Data Mining: The Art and Science of Customer Relationship Management by Michael J. A. Berry, Gordon Linoff. ISBN: 0471331236

ò The Data Mining and Knowledge Discovery special issue on Application of Data Mining to Electronic Commerce (volume 5, 1/2) January/April 2001. Special issue:

http://www.wkap.nl/issuetoc.htm/1384-5810+5+1/2+2001

Book ISBN: 0792373030http://www.amazon.com/exec/obidos/ASIN/0792373030

Page 44: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 44

44Resources (III)Resources (III)

ò Web Mining Research: A Surveyhttp://www.acm.org/sigs/sigkdd/explorations/issue2-1/contents.htm#Kosala

ò Web Data Mining course at DePaul University byBamshad Mobasherhttp://maya.cs.depaul.edu/~classes/cs589/lecture.html

ò Integrating E-commerce and Data Mining: Architecture and Challenges, WEBKDD'2000http://robotics.Stanford.EDU/~ronnyk/ronnyk-bib.html

ò Drinking from the Firehose: Converting Raw Web Traffic and E-Commerce Data Streams for Data Mining and Marketing Analysis by Rob Cooleyhttp://www.webusagemining.com/sys-tmpl/webdataminingworkshop/

Page 45: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 45

45Resources (IV)Resources (IV)

ò An Ideal E-Commerce Architecture for Building Web Sites Supporting Analysis and Personalizationhttp://robotics.Stanford.EDU/~ronnyk/ronnyk-bib.html

ò Analyzing Web Site Traffic, Sane Solutionshttp://www.sane.com/products/NetTracker/whitepaper.pdf

ò Web Mining, Accrue Softwarehttp://www.accrue.com/forms/webmining.html

Page 46: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 46

46

ò Direct effect of web on established retailers may not be large, but lessons learned will affect other channels, so additional ROI comes from improvements to other channels

ò The webstore provides an experimental laboratory and a trend-discovery system

ò Which cross-sells work?ò Which ads are effective?ò What are people looking for

(failed searches for pokédex)

Take Home Messages (I of III)Take Home Messages (I of III)

Page 47: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 47

47Take Home Messages (II of III)Take Home Messages (II of III)

ò Good: E-commerce is the killer-domain for data mining with all the right ingredients

ò Bad: Good data collection is hardò Web logs are information poorò New sites should log clickstream and events in the app

ò Ugly:ò Data transformations take longer than you expect.ò You must “peel the onion” for interesting insight

See KDD CUP 2000 at http://www.ecn.purdue.edu/KDDCUP

Page 48: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 48

48Take Home Messages (III)Take Home Messages (III)

ò Always involve the business userMany “interesting” discoveries turn out to be a result of some intentional activity. “Peel the onion.”

ò Business users want simple, comprehensible results

ò Reports are not glamorousbut most often needed

ò Simple algorithms are most usefulespecially if coupled with good visualizations

ò The web is a measurement and experiments labò Half the discoveries will carry over to the “real world”

Some images used herein where obtained from IMSI's MasterClips/Master Photo(C) Collection,1895 Francisco Blvd East, San Rafael 94901-5506, USA

Page 49: Mining E-commerce Data The Good, the Bad, and the Uglyai.stanford.edu/people/ronnyk/kddITrackTalk.pdf · campaign based on analysis of website data of registered users: 7 email designs

© Copyright 1998-2001, Ron Kohavi, Blue Martini Software, San Mateo California, USA 49

49AcknowledgementAcknowledgement

ò Many thanks to the data mining team members at Blue Martini Software

ò Special thanks to Llew, Zijian, Brian, and Eric, who worked on the e-metrics project

ò A copy of this talk is available at www.kohavi.com

ò The reference for the talk/paper isRon Kohavi. Mining e-commerce data: The good, the bad, and the ugly. In Foster Provost and Ramakrishnan Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2001.http://robotics.Stanford.EDU/users/ronnyk/goodBadUglyKDDItrack.pdf