Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 17: Web Log Mining Padhraic Smyth Department of Information and Computer Science University of California, Irvine


  • Upload

  • View

  • Download

Embed Size (px)



Citation preview

Page 1: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lecture 17: Web Log Mining

Padhraic SmythDepartment of Information and Computer Science

University of California, Irvine

Page 2: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine


• Basic concepts in Web log data analysis

• Predictive modeling of Web navigation behavior– Markov modeling methods

• Analyzing search engine data

• Ecommerce aspects of Web log mining

Page 3: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine


• Useful to study human digital behavior, e.g. search engine data can be used for– Exploration e.g. # of queries per session?– Modeling e.g. any time of day dependence?– Prediction e.g. which pages are relevant?

• Applications– Understand social implications of Web usage– Design of better tools for information access– E-commerce applications

Page 4: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

How our Web navigation is recorded…

• Web logs– Record activity between client browser and a specific Web server– Easily available– Can be augmented with cookies (provide notion of “state”)

• Search engine records– Text in queries, which responses were clicked on, etc

• Client-side browsing records– Produced for research purposes as part of a study– Automatically recorded by client-side software– Harder to obtain, but much more accurate than server-side logs

• Other sources– Web site registration, purchases, email, etc– ISP recording of Web browsing

Page 5: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Web Server Log Files

• Server Transfer Log: – transactions between a browser and server are logged– IP address, the time of the request– Method of the request (GET, HEAD, POST…)– Status code, a response from the server– Size in byte of the transaction

• Referrer Log: – where the request originated

• Agent Log: – browser software making the request (spider)

• Error Log: – request resulted in errors (404)

Page 6: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

W3C Extended Log File FormatField Date Description

Date date The date that the activity occurredTime time The time that the activity occurredClient IP address c-ip The IP address of the client that accessed your server

User Name cs-usernameThe name of the autheticated user who access your server, anonymous users are represented by -

Servis Name s-sitename The Internet service and instance number that was accessed by a clientServer Name s-computername The name of the server on which the log entry was generatedServer IP Address s-ip The IP address of the server that accessed your serverServer Port s-port The port number the client is connected toMethod cs-method The action the client was trying to performURI Stem cs-uri-stem The resource accessedURI Query cs-uri-query The query, if any, the client was trying to performProtocol Status sc-status The status of the action, in HTTP or FTP termsWin32 Status sc-win32-status The status of the action, in terms used by Microsoft WindowsBytes Sent sc-bytes The number of bytes sent by the serverBytes Received cs-bytes The number of bytes received by the serverTime Taken time-taken The duration of time, in milliseconds, that the action consumedProtocol Version cs-version The protocol (HTTP, FTP) version used by the clientHost cs-host Display the content of the host header

User Agent cs(User Agent) The browser used on the clientCookie cs(Cookie) The content of the cookie sent or received, if any

Referrer cs(Referrer)The previous site visited by the user. This site provided a link to the current site

cs = client-to-server actions

s = server actionsc = client actions

sc = server-to-client actions

Page 7: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Example of Web Log entries

Apache web log: - - [29/Mar/2002:03:58:06 -0800] "GET

/~sophal/whole5.gif HTTP/1.0" 200 9609 "http://www.csua.berkeley.edu/~sophal/whole.html" "Mozilla/4.0 (compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)" - - [29/Mar/2002:03:59:40 -0800] "GET /~alexlam/resume.html HTTP/1.0" 200 2674 "-" "Mozilla/5.0 (Slurp/cat; [email protected]; http://www.inktomi.com/slurp.html)“ - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/indextop.html HTTP/1.1" 200 3510 "http://www.csua.berkeley.edu/~tahir/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“ - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/animate.js HTTP/1.1" 200 14261 "http://www.csua.berkeley.edu/~tahir/indextop.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“

Page 8: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Routine Server Log Analysis

• Most and least visited web pages• Entry and exit pages• Referrals from other sites or search engines• What are the searched keywords• How many clicks/page views a page received• Error reports, like broken links

Page 9: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Visualization of Web Log Data over Time

Page 10: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Server Log Analysis

Page 11: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Descriptive Summary Statistics

• Histograms, scatter plots, time-series plots– Very important!– Helps to understand the big picture– Provides “marginal” context for any model-building

• models aggregate behavior, not individuals

– Challenging for Web log data

• Examples– Session lengths (e.g., power laws)– Click rates as a function of time, content

Page 12: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine











Session Length L



al F



of L

L = number of page requests in a single sessionfrom visitors to www.ics.uci.eduover 1 week in November 2002(robots removed)

Page 13: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine















ty o

f L

Session Length L

Best fit of simple power law model

Log P(L) = -a Log L + b

or P(L) = b L-a

Page 14: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine














ty o

f L

Session Length L





Page 15: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Web data measurement issues

• Important to understand how data is collected

• Web data is collected automatically via software logging tools– Advantage:

• No manual supervision required

– Disadvantage:• Data can be skewed (e.g. due to the presence of robot traffic)

• Important to identify robots (also known as crawlers, spiders)

Page 16: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

A time-series plot of ICS Website data

Number of page requests per hour as a function of time from page requests in the www.ics.uci.edu Web server logs during the first week of April 2002.

Page 17: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Robot / human identification

• Robot requests identified by classifying page requests using a variety of heuristics– e.g. some robots self-identify themselves in the server logs

(robots.txt)– Robots explore the entire website in breadth first fashion– Humans access web-pages in depth first fashion

• Tan and Kumar (2002) discuss more techniques

Page 18: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Page requests, caching, and proxy servers

• In theory, requester browser requests a page from a Web server and the request is processed

• In practice, there are– Other users– Browser caching– Dynamic addressing in local network– Proxy Server caching

Page 19: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Page requests, caching, and proxy servers

A graphical summary of how page requests from an individual user can be masked at various stages between the user’s local computer and the Web server.

Page 20: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Page requests, caching, and proxy servers

• Web server logs are therefore not so ideal in terms of a complete and faithful representation of individual page views

• There are heuristics to try to infer the true actions of the user: -– Path completion (Cooley et al. 1999)

• e.g. If known B -> F and not C -> F, then session ABCF can be interpreted as ABCBF

• Anderson et al. 2001 for more heuristics

• In general case, hard to know what user viewed

Page 21: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Identifying individual users from Web server logs

• Useful to associate specific page requests to specific individual users

• IP address most frequently used

• Disadvantages– One IP address can belong to several users– Dynamic allocation of IP address

• Better to use cookies– Information in the cookie can be accessed by the Web server

to identify an individual user over time– Actions by the same user during different sessions can be

linked together

Page 22: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Identifying individual users from Web server logs

• Commercial websites use cookies extensively

• 97% of users have cookies enabled permanently on their browsers (source: Amazon.com, 2003)

• However …– There are privacy issues – need implicit user cooperation– Cookies can be deleted / disabled

• Another option is to enforce user registration– High reliability– Can discourage potential visitors

Page 23: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine


• Time oriented (robust)– E.g., by gaps between requests

• not more than 25 minutes between successive requests

• Navigation oriented (good for short sessions and when timestamps unreliable)– Referrer is previous page in session, or– Referrer is undefined but request within 10 secs, or – Link from previous to current page in web site

Page 24: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Client-side data

• Advantages of collecting data at the client side:– Direct recording of page requests (eliminates ‘masking’ due to

caching)– Recording of all browser-related actions by a user (including

visits to multiple websites)– More-reliable identification of individual users (e.g. by login ID

for multiple users on a single computer)

• Preferred mode of data collection for studies of navigation behavior on the Web

• Companies like comScore and Nielsen use client-side software to track home computer users

Page 25: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Client-side data

• Statistics like ‘Time per session’ and ‘Page-view duration’ are more reliable in client-side data

• Some limitations– Still some statistics like ‘Page-view duration’ cannot be totally

reliable e.g. user might go to fetch coffee– Need explicit user cooperation– Typically recorded on home computers – may not reflect a

complete picture of Web browsing behavior

• Web surfing data can be collected at intermediate points like ISPs, proxy servers– Can be used to create user profile and target advertise

Page 26: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Early studies from 1995 to 1997

• Earliest studies on client-side data are Catledge and Pitkow (1995) and Tauscher and Greenberg (1997)

• In both studies, data was collected by logging Web browser commands

• Population consisted of faculty, staff and students

• Both studies found – clicking on the hypertext anchors as the most common action– using ‘back button’ was the second common action

Page 27: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Early studies from 1995 to 1997

• high probability of page revisitation (~0.58-0.61)• Lower bound because the page requests prior to the start of the

studies are not accounted for• Humans are creatures of habit?• Content of the pages changed over time?

• strong recency (page that is revisited is usually the page that was visited in the recent past) effect

• Correlates with the ‘back button’ usage

• Similar repetitive actions are found in telephone number dialing etc

Page 28: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

The Cockburn and McKenzie study from 2002

• Previous studies are relatively old

• Web has changed dramatically in the past few years

• Cockburn and McKenzie (2002) provides a more up-to-date analysis– Analyzed the daily history.dat files produced by the Netscape browser

for 17 users for about 4 months– Population studied consisted of faculty, staff and graduate students

• Study found revisitation rates higher than past 94 and 95 studies (~0.81)– Time-window is three times that of past studies

Page 29: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

The Cockburn and McKenzie study from 2002

• Revisitation rate less biased than the previous studies?

• Human behavior changed from an exploratory mode to a utilitarian mode?– The more pages user visits, the more are the requests for new

pages– The most frequently requested page for each user can account

for a relatively large fraction of his/her page requests

• Useful to see the scatter plot of the distinct number of pages requested per user versus the total pages requested

Page 30: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

The Cockburn and McKenzie study from 2002

The number of distinct pages visited versus page vocabulary size of each of the 17 users in the Cockburn and McKenzie (2002) study (log-log plot)

Page 31: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

The Cockburn and McKenzie study from 2002

Bar chart of the ratio of the number of page requests for the most frequent page divided by the total number of page requests, for 17 users in the Cockburn McKenzie (2002) study

Page 32: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine


• Basic concepts in Web log data analysis

• Predictive modeling of Web navigation behavior– Markov modeling methods

• Analyzing search engine data

• Ecommerce aspects of Web log mining

Page 33: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Markov models for page prediction

• General approach is to use a finite-state Markov chain– Each state can be a specific Web page or a category of Web

pages– If only interested in the order of visits (and not in time), each

new request can be modeled as a transition of states

• Issues– Self-transition– Time-independence

Page 34: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Markov models for page prediction

• For simplicity, consider order-dependent, time-independent finite-state Markov chain with M states

• Let s be a sequence of observed states of length L. e.g. s = ABBCAABBCCBBAA with three states A, B and C. st is state at position t (1<=t<=L). In general,

• first-order Markov assumption

• This provides a simple generative model to produce sequential data


ttt sssPsPsP

2111 ),...,|()()(


ttt ssPsPsP

211 )|()()(

)|(),...,|( 111 tttt ssPsssP

Page 35: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Markov models for page prediction

• If we denote Tij = P(st = j|st-1 = i), we can define a M x M transition matrix

• Properties– Strong first-order assumption– Simple way to capture sequential dependence

• If each page is a state and if W pages, O(W2), W can be of the order 105 to 106 for a CS dept. of a university

• To alleviate, we can cluster W pages into M clusters, each assigned a state in the Markov model

• Clustering can be done manually, based on directory structure on the Web server, or automatic clustering using clustering techniques

Page 36: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Markov models for page prediction

• Tij = P(st = j|st-1 = i) represents the probability that an individual user’s next request will be from category j, given they were in category i

• We can add E, an end-state to the model• E.g. for three categories with end state: -

• E denotes the end of a sequence, and start of a new sequence









Page 37: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Markov models for page prediction

• First-order Markov model assumes that the next state is based only on the current state

• Limitations– Doesn’t consider ‘long-term memory’

• We can try to capture more memory with kth-order Markov chain

• Limitations– Inordinate amount of training data O(Mk+1)

),..,|(),..,|( 111 kttttt sssPsssP

Page 38: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Parameter estimation for Markov model transitions

• Smoothed parameter estimates of transition probabilities are

• If nij = 0 for some transition (i, j) then instead of having a parameter estimate of 0 (ML), we will have allowing prior knowledge to be incorporated

• If nij > 0, we get a smooth combination of the data-driven information (nij) and the prior



ij n


)/( iij nq

Page 39: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Parameter estimation for Markov models

• One simple way to set prior parameter is– Consider alpha as the effective sample size– Partition the states into two sets, set 1 containing all states

directly linked to state i and the remaining in set 2– Assign uniform probability r/K to all states in set 2 (all set 2

states are equally likely)– The remaining (1-r) can be either uniformly assigned among

set 1 elements or weighted by some measure– Prior probabilities in and out of E can be set based on our prior

knowledge of how likely we think a user is to exit the site from a particular state

Page 40: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Predicting page requests with Markov models

• Deshpande and Karypis (2001) propose schemes to prune kth-order Markov state space– Provide systematic but modest improvements

• Another way is to use empirical smoothing techniques that combine different models from order 1 to order k (Chen and Goodman 1996)

Page 41: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Mixtures of Markov Chains

• Cadez et al. (2003) and Sen and Hansen (2003) replace the first-order Markov chain:

with a mixture of first-order Markov chains

where c is a discrete-value hidden variable taking K values k P(c = k) = 1


P(st | st-1, c = k) is the transition matrix for the kth mixture component

• One interpretation of this is user behavior consists of K different navigation behaviors described by the K Markov chains


111 kcPkcssPsssPK


)|(),...,|( 111 tttt ssPsssP

Page 42: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Modeling Web Page Requests with Markov chain mixtures

• MSNBC Web logs– 2 million individuals per day– different session lengths per individual– difficult visualization and clustering problem

• WebCanvas– uses mixtures of Markov chains to cluster individuals based on

their observed sequences– software tool: EM mixture modeling + visualization

Page 43: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Page 44: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine, -, 3/22/00, 10:35:11, W3SVC, SRVR1,, 781, 363, 875, 200, 0, GET, /top.html, -,, -, 3/22/00, 10:35:16, W3SVC, SRVR1,, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,, -, 3/22/00, 10:35:17, W3SVC, SRVR1,, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,, -, 3/22/00, 16:18:50, W3SVC, SRVR1,, 60, 425, 72, 304, 0, GET, /top.html, -,, -, 3/22/00, 16:18:58, W3SVC, SRVR1,, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,, -, 3/22/00, 16:18:59, W3SVC, SRVR1,, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,, -, 3/22/00, 20:54:37, W3SVC, SRVR1,, 140, 199, 875, 200, 0, GET, /top.html, -,, -, 3/22/00, 20:54:55, W3SVC, SRVR1,, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,, -, 3/22/00, 20:54:55, W3SVC, SRVR1,, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,, -, 3/22/00, 20:55:07, W3SVC, SRVR1,, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,, -, 3/22/00, 20:55:36, W3SVC, SRVR1,, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,, -, 3/22/00, 20:55:36, W3SVC, SRVR1,, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,, -, 3/22/00, 20:55:39, W3SVC, SRVR1,, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,, -, 3/22/00, 20:56:03, W3SVC, SRVR1,, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,, -, 3/22/00, 20:56:04, W3SVC, SRVR1,, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,, -, 3/22/00, 20:56:33, W3SVC, SRVR1,, 0, 262, 72, 304, 0, GET, /top.html, -,, -, 3/22/00, 20:56:52, W3SVC, SRVR1,, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,






User 5

User 4

User 3

User 2

User 1

From Web logs to sequences

Page 45: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Clusters of Finite State Machines













Cluster 1 Cluster 2

Cluster 3

Page 46: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Learning Problem

• Assumptions– data is being generated by K different groups– Each group is described by a stochastic finite state machine (SFSM)

• aka, a Markov model with an end-state

• Given– A set of sequences from different users of different lengths

• Learn– A “mixture” of K different stochastic finite state machines

• Solution– EM is very easy: fractional counts of transitions– efficient and accurate, scales as O(KN)

Page 47: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Experimental Methodology

• Model Training:– fit 2 types of models

• mixtures of histograms• mixtures of finite state machines

– Train on a full day’s worth of MSNBC Web data

• Model Evaluation:– “one-step-ahead” prediction on unseen test data

• Test sequences from a different day of Web logs

– negative log probability = predictive entropy

Page 48: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

20 40 60 80 100 120 140 160 180 2002













e lo








Number of mixture components [K]

Predictive Entropy Out-of-Sample

Mixtures of Multinomials

Mixtures of SFSMs

Page 49: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

0 5 10 15 20






t(R) Cluster 1: Category 13

0 5 10 15 20



10Cluster 1: Category 14

0 10 20 30 40-2



4 Cluster 1: Category 8

0 5 10 15 20






t(R) Cluster 2: Category 1

0 5 10



10Cluster 2: Category 7

0 10 20 30 40-2



4 Cluster 2: Category 8

0 5 10 15 20






t(R) Cluster 3: Category 12

0 5 10



10Cluster 3: Category 1

0 5 10 15 20-2



4 Cluster 3: Category 13

0 10 20 30 40






t(R) Cluster 4: Category 2

0 5 10



10 Cluster 4: Category 1

0 5 10



10 Cluster 4: Category 3

0 5 10 15 20




R = Run Length



t(R) Cluster 5: Category 9

0 1 2 3 4 5




R = Run Length

Cluster 5: Category 12

0 1 2 3 4




R = Run Length

Cluster 5: Category 6


Page 50: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Timing Results

0 20 40 60 80 100 120 140 160 180 200-500








e [s


Number of mixture components [K]

N = 70,000



Page 51: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine


• Software tool for Web log visualization– uses Markov mixtures to cluster data for display– in use by msnbc.com administrators at Microsoft– also being applied to non-Web data

• Model-based visualization– random sample of actual sequences– interactive tiled windows displayed for visualization– more effective than

• planar graphs• traffic-flow movie in Microsoft Site Server v3.0

Page 52: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

WebCanvas: Cadez, Heckerman, et al, 2003

Page 53: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Insights from WebCanvas

• From msnbc.com site adminstrators….– significant heterogeneity of behavior– relatively focused activity of many users

• typically only 1 or 2 categories of pages

– many individuals not entering via main page– detected problems with the weather page– missing transitions (e.g., tech <=> business)

Page 54: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine


• Adding time-dependence– adding time-between clicks, time of day effects

• Uncategorized Web pages– coupling page content with sequence models

• Modeling “switching” behaviors– allowing users to switch between models

• Individualized weights (hierarchical Bayes)

• Update: WebCanvas tool will be part of 2004 SQLServer release

Page 55: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Prediction with Markov mixtures

P(st+1 | s[1,t] ) =

Page 56: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Prediction with Markov mixtures

P(st+1 | s[1,t] ) = P(st+1 , k | s[1,t] ) = P(st+1 | k , s[1,t] ) P(k | s[1,t] )

Page 57: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Prediction with Markov mixtures

P(st+1 | s[1,t] ) = P(st+1 , k | s[1,t] ) = P(st+1 | k , s[1,t] ) P(k | s[1,t] )

= P(st+1 | k , st ) P(k | s[1,t] )

Prediction of kth component

Membership, basedon sequence history

=> Predictions are a convex combination of K different component transition matrices,with weights based on sequence history

Page 58: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Related Work

• Mixtures of Markov chains– special case: Poulsen (1990)– general case: Ridgeway (1997), Smyth (1997)

• Clustering of Web page sequences– non-probabilistic approaches (Fu et al, 1999)

• Markov models for prediction– Anderson et al (IJCAI, 2001):

• mixtures of Markov outperform other sequential models for page-request prediction

Page 59: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Predicting page requests with Markov models

• K can be chosen by evaluating the out-of-sample predictive performance based on– Accuracy of prediction– Log probability score– Entropy

• Other variations:– Sen and Hansen 2003– Position-dependent Markov models (Anderson et al. 2001,

2002)– Zukerman et al. 1999

Page 60: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Modeling Clickrate Data

• Data– 200k Alexa users, client-side, over 24 hours– ignore URLs requested– goal is to build a time-series model that characterizes user

click rates

Page 61: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Page 62: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

0 5 10 15 200















Page 63: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

5 5.5 6 6.5 7-60


















Page 64: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

0 5 10 15 200















Page 65: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Markov-Poisson Model

• Doubly stochastic process– Locally constant Poisson rate– indexed by M Markov states

• Fit a model with M = 3 states• absence of a Web session • Web session with slow click rate: 1 minute rate• Web session with rapid click rate: 10 second rate

– Used hierarchical Bayes on individuals

Page 66: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine


• Basic concepts in Web log data analysis

• Predictive modeling of Web navigation behavior– Markov modeling methods

• Analyzing search engine data

• Ecommerce aspects of Web log mining

Page 67: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Analysis of Search Engine Query Logs

# of Sample Query Source SE Time Period

Lau & Horvitz 4690 of 1 Million Excite Sep 1997

Silverstein et al 1 Billion AltaVista 6 weeks in Aug & Sep 1998

Spink et al (series of studies)1Million for each time period

Excite Sep 1997Dec 1999May 2001

Xie & O’Hallaron 110,000 Vivisimo 35 days Jan & Feb 2001

1.9 Million Excite 8 hrs in a day, Dec 1999

Page 68: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Main Results

• Average number of terms in a query is ranging from a low of 2.2 to a high of 2.6

• The most common number of terms in a query is 2

• The majority of users don’t refine their query – The number of users who viewed only a single page increase

29% (1997) to 51% (2001) (Excite)– 85% of users viewed only first page of search results (AltaVista)

• 45% (2001) of queries are about Commerce, Travel, Economy, People (was 20% in 1997)– The queries about adult or entertainment decreased from 20%

(1997) to around 7% (2001)

Page 69: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Xie and O Halloran Study (2002)

• All four studies produced a generally consistent set of findings about user behavior in a search engine context– most users view relatively few pages per query– most users don’t use advanced search features

- Query Length Distributions (bar)

- Poisson Model(dots & lines)

Page 70: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Power-law Characteristics of Common Queries

• Frequency f(r) of Queries with Rank r– 110000 queries from Vivisimo– 1.9 Million queries from Excite

• There are strong regularities in terms of patterns of behavior in how we search the Web

Power-Law in log-log space

Page 71: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine


• Basic concepts in Web log data analysis

• Predictive modeling of Web navigation behavior– Markov modeling methods

• Analyzing search engine data

• Ecommerce aspects of Web log mining

Page 72: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

The next few slides are from Ronny Kohavi, director of data mining and personalization at Amazon.com. His full set of slides are available online – see the PPT slides and related papers on ecommerce and

data mining online at http://robotics.stanford.edu/~ronnyk/ronnyk-


Page 73: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine


• Page request Web logs combined with– Purchase (market-basket) information– User address information (if they make a purchase)– Demographics information (can be purchased)– Emails to/from the customer

• Main focus here is to increase revenue– Data mining widely used an online commerce companies like


• This is a very rich source of problems for data mining– What products should we advertise to this person?– Can we do dynamic pricing?– If a person buys X should we also suggest Y?– Who are our best customers?– etc

Page 74: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Combining Data Sources

• Comprehensive collection of US consumer and telephone data available via the internet

– Multi-sourced database– Demographic, socioeconomic, and lifestyle information. – Information on most U.S. households– Contributors’ files refreshed a minimum of 3-12 times per year.

– Data sources include: County Real Estate Property Records, U.S. Telephone Directories, Public Information, Motor Vehicle Registrations, Census Directories, Credit Grantors, Public Records and Consumer Data, Driver’s Licenses, Voter Registrations, Product Registration Questionnaires, Catalogers, Magazines, Specialty Retailers, Packaged Goods Manufacturers,

Accounts Receivable Files, Warranty Cards

• Much of this data can be accessed in real-time once a customer self-identifies

Page 75: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Map of World Wide Revenue

UK – 98.8%

US – 0.6%

Australia – 0.1%

NOTE: About 50% of the non-UK orders are wedding list purchases

Low Medium High

Although Debenhams online site only ships in the UK, we see some revenue from the rest of the world.

Page 76: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Results from Blue-Martini

People who have a Travel and Entertainment credit card are 48% more likely to be online shoppers (27% for people with premium credit card)

People whose home was built after 1990 are 45% more likely to be online shoppers

Households with income over $100K are 31% more likely to be online shoppers

People under the age of 45 are 17% morelikely to be online shoppers

Online Consumer Demographics

Page 77: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

A higher household income means you are more likely to be an online shopper

Demographics - Income

Page 78: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Demographics – Credit Cards

• The more credit cards, the more likely you are to be an online shopper

Page 79: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Example: Web Traffic


Sept-11 Note significant drop in human traffic, not bot


Registration at Search Engine sites

Internal Perfor-

mance bot

Page 80: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Product Affinities at MEC

• Minimum support for the associations is 80 customers• Confidence: 37% of people who purchased Orbit Sleeping Pad also purchased Orbit Stuff Sack• Lift: People who purchased Orbit Sleeping Pad were 222 times more likely to purchase the Orbit Stuff Sack compared to the general


Product Association Lift Confidence

Orbit Sleeping Pad Cygnet

Sleeping Bag Aladdin 2Backpack

Primus Stove

OrbitStuff Sack

WebsiteRecommended Products

222 37%

Bambini Tights Children’s

Bambini CrewneckSweater Children’s

195 52%

Yeti Crew NeckPullover Children’s

Beneficial T’sOrganic LongSleeve T-Shirt Kids’

Silk CrewWomen’s

SilkLong JohnsWomen’s

304 73%

Micro Check Vee Sweater


Composite Jacket

CascadeEntrant Overmitts

Polartec300 DoubleMitts

51 48%


WindstopperAlpine Hat

Tremblant 575Vest Women’s

Page 81: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Customer Locations Relative to Retail Stores

Map of Canada with store locations.

Black dots show store locations.

Heavy purchasing areas away from retail stores can suggest new retail store locations No stores in several hot areas:

MEC is building a store in Montreal right now.

Page 82: lecture17_Weblogmining.ppt

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine

Building The Customer Signature

• Building a customer signature is a significant effort, but well worth the effort

• A signature summarizes customer or visitor behavior across hundreds of attributes, many which are specific to the site

• Once a signature is built, it can be used to answer many questions.

• The mining algorithms will pick the most important attributes for each question

• Example attributes computed:– Total Visits and Sales– Revenue by Product Family– Revenue by Month– Customer State and Country– Recency, Frequency, Monetary– Latitude/Longitude from the Customer’s Postal Code