40
Insights into Customer Behavior from Clickstream Data Ronald J. Nowling Red Hat, Inc. [email protected] http://rnowling.github.io/

Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Insights into Customer Behavior from Clickstream Data

Ronald J. Nowling Red Hat, Inc.

[email protected] http://rnowling.github.io/

Page 2: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Who Am I?

•  Software Engineer at Red Hat •  Data Science Team, Emerging Technologies – Evaluate solutions in open-source Big Data

space – Ensure software works for Red Hat customers – Promote data science internally through

consulting projects

•  Apache Bigtop PMC

2  

Page 3: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Clickstream Data

3  

Page 4: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Clickstream Data

61 million page views

4  

Page 5: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Clickstream Data

61 million page views 125,000 registered users

5  

Page 6: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Clickstream Data

61 million page views 125,000 registered users

500,000 pages

6  

Page 7: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Clickstream Data

61 million page views 125,000 registered users

500,000 pages 125,000 knowledgebase articles

7  

Page 8: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Potential Applications

•  Build customer profiles to aid sales teams •  Recommendation system for

knowledgebase •  Improve customer portal search •  Guide selection of new knowledgebase

topics by content writers

8  

Page 9: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

9  

StripFormatting

CleanWords Vectorize Cluster

What are the different types of kernel packages in Red Hat Enterprise Linux?=============================================================Issue ------What are the different types of kernel packages in Red Hat Enterprise Linux?

Environment---------------Red Hat Enterprise Linux

Resolution------------Red Hat Enterprise Linux contains the following kernel packages:

Page 10: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

10  

StripFormatting

CleanWords Vectorize Cluster

What are the different types of kernel packages in Red Hat Enterprise Linux

IssueWhat are the different types of kernel packages in Red Hat Enterprise Linux

EnvironmentRed Hat Enterprise Linux

ResolutionRed Hat Enterprise Linux contains the following kernel packages some may not apply to your architecture and not all are available in all major releases kernel contains the kernel and following key features

Page 11: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

11  

StripFormatting

CleanWords Vectorize Cluster

What are the different types of kernel packages in Red Hat Enterprise Linux

IssueWhat are the different types of kernel packages in Red Hat Enterprise Linux

EnvironmentRed Hat Enterprise Linux

ResolutionRed Hat Enterprise Linux contains the following kernel packages some may not apply to your architecture and not all are available in all major releases kernel contains the kernel and following key features

Page 12: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

12  

StripFormatting

CleanWords Vectorize Cluster

What are the different type of kernel package in Red Hat Enterprise Linux

IssueWhat are the different type of kernel package in Red Hat Enterprise Linux

EnvironmentRed Hat Enterprise Linux

ResolutionRed Hat Enterprise Linux contain the follow kernel package some may not apply to your architecture and not all are available in all major release kernel contain the kernel and follow key feature

Page 13: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

13  

StripFormatting

CleanWords Vectorize Cluster

What are the different type of kernel package in Red Hat Enterprise Linux

IssueWhat are the different type of kernel package in Red Hat Enterprise Linux

EnvironmentRed Hat Enterprise Linux

ResolutionRed Hat Enterprise Linux contain the follow kernel package some may not apply to your architecture and not all are available in all major release kernel contain the kernel and follow key feature

Page 14: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

14  

StripFormatting

CleanWords Vectorize Cluster

different type kernel package Red Hat Enterprise Linux

Issue different type kernel package Red Hat Enterprise Linux

EnvironmentRed Hat Enterprise Linux

ResolutionRed Hat Enterprise Linux contain kernel package apply architecture available major release kernel containkernel follow key feature

Page 15: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

15  

StripFormatting

CleanWords Vectorize Cluster

kernel: 5red: 4hat: 4enterprise: 4linux: 4package: 3contain: 3

different: 2type: 2intel: 2environment: 1resolution: 1follow: 1system: 1

Page 16: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

16  

StripFormatting

CleanWords Vectorize Cluster

kernel: 5red: 4hat: 4enterprise: 4linux: 4package: 3contain: 3

different: 2type: 2intel: 2environment: 1resolution: 1follow: 1system: 1

Page 17: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

17  

StripFormatting

CleanWords Vectorize Cluster

kernel: 5red: 4hat: 4enterprise: 4linux: 4package: 3contain: 3

Page 18: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

18  

StripFormatting

CleanWords Vectorize Cluster

Page 19: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

19  

StripFormatting

CleanWords Vectorize Cluster

Page 20: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Topics

openshift gear cartridge online node brokervm rhev virtualization diskglusterfs storage volume brick rhs glusterd node client mount georhel support driver hp hardware version firmware card intel

20  

Page 21: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Topics

openshift gear cartridge online node brokervm rhev virtualization diskglusterfs storage volume brick rhs glusterd node client mount georhel support driver hp hardware version firmware card intel

21  

Page 22: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Topics

openshift gear cartridge online node brokervm rhev virtualization diskglusterfs storage volume brick rhs glusterd node client mount georhel support driver hp hardware version firmware card intel

22  

Page 23: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Topics

openshift gear cartridge online node brokervm rhev virtualization diskglusterfs storage volume brick rhs glusterd node client mount georhel support driver hp hardware version firmware card intel

23  

Page 24: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Topics

openshift gear cartridge online node brokervm rhev virtualization diskglusterfs storage volume brick rhs glusterd node client mount georhel support driver hp hardware version firmware card intel

24  

Page 25: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Topic Article Counts

25  

Page 26: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Clickstream Processing

ParseRaw Daily Page Views

Clean & Filter

Raw Daily Page Views

Raw Daily Page Views

Parse

Parse Clean & Filter

Clean & Filter

Accounts

Aggregate Topic View Counts

Project onto Topics

26  

Page 27: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Clickstream Processing

ParseRaw Daily Page Views

Clean & Filter

Raw Daily Page Views

Raw Daily Page Views

Parse

Parse Clean & Filter

Clean & Filter

Accounts

Aggregate Topic View Counts

Project onto Topics

27  

Page 28: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Clickstream Processing

ParseRaw Daily Page Views

Clean & Filter

Raw Daily Page Views

Raw Daily Page Views

Parse

Parse Clean & Filter

Clean & Filter

Accounts

Aggregate Topic View Counts

Project onto Topics

28  

Page 29: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Clickstream Processing

ParseRaw Daily Page Views

Clean & Filter

Raw Daily Page Views

Raw Daily Page Views

Parse

Parse Clean & Filter

Clean & Filter

Accounts

Aggregate Topic View Counts

Project onto Topics

29  

Page 30: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Clickstream Processing

ParseRaw Daily Page Views

Clean & Filter

Raw Daily Page Views

Raw Daily Page Views

Parse

Parse Clean & Filter

Clean & Filter

Accounts

Aggregate Topic View Counts

Project onto Topics

30  

Page 31: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Customer Profiles

•  Dominant topics – JBoss – Red Hat Enterprise Virtualization – Hardware support – Gluster – Booting into rescue mode – Packages

31  

Page 32: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Customer Profiles

•  Supporting topics – Logging – LDAP – Samba – High resource usage – File systems / LVM / block devices – Networking

32  

Page 33: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Customer Profiles

•  JBoss and RHEV appear in combination with a number of other products

•  Some products only appear by themselves with supporting topics (logging, networking, filesystems) – OpenShift – Gluster

33  

Page 34: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Topic Enrichments

34  

Page 35: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Malformed TSV Files

•  Gzip files need to be read sequentially •  Tab-separated, no quoting (in theory!) •  Escaped tabs and newlines within records •  E.g., \\n or \\t

•  Improperly escaped tabs and newlines •  E.g., \\\t vs \\\\t

•  Extraneous unmatched quote marks •  E.g., ‘some_user

35  

Page 36: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Lessons Learned

•  Consider custom Hadoop input formats for tricky file formats

•  Verify everything – what works in general may not work for you – Stemming – Filtering most frequent words – K-Means vs LDA

36  

Page 37: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Lessons Learned

•  K-Means –  Improve accuracy: Multiple runs, more

iterations

•  Watch out for memory leaks – Un-persist cached RDDs – Un-persist broadcasted variables

•  Parquet for performance

37  

Page 38: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Potential Applications

•  Build customer profiles to aid sales teams •  Recommendation system for

knowledgebase •  Improve customer portal search •  Guide selection of new knowledgebase

topics for content writers

38  

Page 39: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

Resources

http://rnowling.github.io/

39  

Page 40: Insights into Customer Behavior from Clickstream Datarnowling.github.io/static/rnowling_spark_summit_east_2016.pdfClickstream Data 61 million page views 125,000 registered users 500,000

QUESTIONS

40