38
1 Copyright © 2014 Splunk Inc. Tom LaGatta Data Scientist, Splunk Olivier de Garrigues Sr Prof Services Consultant, Splunk Splunk for Data Science

Splunk for DataScience (.conf2014)

Embed Size (px)

Citation preview

Page 1: Splunk for DataScience (.conf2014)

1

Copyright © 2014 Splunk Inc.

Tom LaGatta Data Scientist, Splunk

Olivier de Garrigues Sr Prof Services Consultant, Splunk

Splunk  for  Data  Science  

Page 2: Splunk for DataScience (.conf2014)

2

Disclaimer During the course of this presentation, we may make forward-looking statements regarding future events or the expected performance of the company. We caution you that such statements reflect our current expectations and estimates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-looking statements, please review our filings with the SEC. The forward-looking statements made in this presentation are being made as of the time and date of its live presentation. If reviewed after its live presentation, this presentation may not contain current or accurate information. We do not assume any obligation to update any forward-looking statements we may make. In addition, any information about our roadmap outlines our general product direction and is subject to change at any time without notice. It is for informational purposes only and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligation either to develop the features or functionality described or to include any such feature or functionality in a future release.

Page 3: Splunk for DataScience (.conf2014)

3

3 Key Takeaways

Splunk  is  great  for  doing  Data  Science!  

Splunk  complements  other  tools  in  the  

Data  Science  toolkit.  

Data  Science  is  about  extrac:ng  ac:onable  insights  from  data.  

1 2 3

Page 4: Splunk for DataScience (.conf2014)

4

About Us • Tom LaGatta, Data Scientist – Tom joined Splunk in Spring 2014 as a Data Scientist specializing in

Probability and Statistics. Tom is an expert on the mathematics of inference, and he enjoys functional programming in languages like Clojure, Haskell & R. At Splunk, Tom is helping to develop our internal and external Data Science program and curriculum. Tom has a PhD in Mathematics from the University of Arizona, and until recently was a Courant Instructor at the Courant Institute at New York University. Tom is based in New York City.

• Olivier de Garrigues, Senior Professional Services Consultant – Olivier is based in London on the EMEA Professional Services team and has helped out more than 40 customers in 10 countries on various Splunk projects in the past year and a half. Prior to this, he worked as a quantitative analyst with extensive use of MATLAB and R. He developed a keen interest in machine learning and enjoys dreaming about how to make Splunk better for data scientists, and helped develop the R Project App. Olivier holds an MS in Mathematics of Finance from Columbia University.

Page 5: Splunk for DataScience (.conf2014)

5

Splunk  for  Data  Science  

Page 6: Splunk for DataScience (.conf2014)

6

What is Data Science? Data Science is about extracting actionable insights from data. • Helps people make better decisions. • Can be used for automated decision-making. • Data Science is cross-functional, and blends techniques & theories from:

– CS / Programming – Math and Statistics – Machine Learning – Data Mining / Databases – Data Visualization

• Don’t be afraid of Data Science!

– Substantive / Domain Expertise – Social Science – Communication and Presentation – Accounting, Finance and KPIs – Business Analytics

Page 7: Splunk for DataScience (.conf2014)

7

Data Science & Analytics Teams There is no “one size fits all” data scientist. Data Science & Analytics teams are made up of people with complementary skill sets.

Source: Schutt & O’Neil. Doing Data Science. 2013

Page 8: Splunk for DataScience (.conf2014)

8

Splunk for Data Science Splunk is great for doing Data Science! • Integrate, query & visualize all the data:

– Platform for machine data – Connects with any other data source

• Easy-to-use Analytics capabilities. • Powerful algorithms out-of-the-box. • Sharp visualizations and dashboards. • Deliver results to both IT & Business users. • Complements other Data Science tools (next slide).

Page 9: Splunk for DataScience (.conf2014)

9

Splunk and Data Science Tools Splunk complements other tools in the Data Science toolkit: • Hadoop: the workhorse of the Data Science world. Using Hunk, you can integrate Hadoop & HDFS seamlessly into Splunk.

• R & Python: the preferred languages of Data Science. Execute R & Python scripts in your Splunk queries using the R Project App & SDK for Python.

• SQL & other RDBMS: valuable stores for customer & product data. Use Splunk’s DB Connect App to mash relational data up with machine data.

• External tools: export finalized data from Splunk using the ODBC Driver. – Tip: do all your data processing in Splunk/Hunk, and export only the final results.

• D3 Custom Visualizations: sharp dashboards & reports using Splunk.

Page 10: Splunk for DataScience (.conf2014)

10

Splunk and Data Science Use Cases

Green Use Cases (easy out of the box) Yellow Use Cases (needs tinkering) Trend Forecasting D3 Custom Visualizations A/B Testing Predictive Modeling Root Cause Analysis Sentiment Analysis Anomaly Detection Conversion Funnel/Pathing Market Segmentation More Algorithms via R & Python Topic Modeling Capacity Planning Correlate Data from 2+ Sources Data Munging & Normalization KPIs & Executive Dashboards

Splunk is a powerful tool for lots of Data Science use cases:

Page 11: Splunk for DataScience (.conf2014)

11

Data  Science  Use  Cases  

Page 12: Splunk for DataScience (.conf2014)

12

Use Case: Trend Forecasting Trend Forecasting: Given past & realtime data, predict future values & events. • Common applications:

– Forecast revenue & other KPIs – Web server traffic & product downloads – Customer conversion rates – Estimate MTTR & server outages – Resource & capacity planning (AWS App) – Security threats (Enterprise Security App)

• The “true” course of events can (and will) take only one of many divergent paths. But which one…?

• Be mindful of rare events & black swans!

Page 13: Splunk for DataScience (.conf2014)

13

Splunk Solution: predict!predict command: forecast future trajectories of time series. • Implements a Kalman filter to identify seasonal trends.

• Gives an “uncertainty envelope” as a buffer around the trend.

• Tip: Always run the predict command on LOTS of past data. Capture low-frequency and high-frequency trends.

• Remember: the future is always uncertain…

• Remember: all forecasts are probabilistic. The predict command qualifies its estimate with an “uncertainty envelope”: this also accounts for past measurement error.

Page 14: Splunk for DataScience (.conf2014)

14

Splunk Solution: Predict App David Carasso’s Predict App: forecast future values of individual events.

– 8 minute walkthrough: https://www.youtube.com/watch?v=ROvaqJigNFg

• Implements a Naïve Bayes classifier. • You have to train models! • Train a model to predict any target field using any reference field(s): fields ref1, ref2, ..., target| train my_model from target!

• Guess target field for incoming events: guess my_model into target

• Temporal or non-temporal prediction (include _time among reference fields).

Page 15: Splunk for DataScience (.conf2014)

15

Concept: Supervised Learning & Classification Supervised learning: use observed training data to classify values of unknown testing data. • predict command (Kalman filter):

Training data = timechart of past & realtime values. Testing data = time range for future values.

• Predict App (Naïve Bayes classifier): Training data = events with reference & target fields. Testing data = events with reference fields but not target field.

• Tip: only deploy models & algorithms after extensive testing & evaluation. • More powerful learning algorithms using R Project App or SDK for Python.

Page 16: Splunk for DataScience (.conf2014)

16

Demo: Predict App • Train a model to predict movie Rating based on MovieID, UserID, Genre, Tag

index=movielens Timestamp < 1199188800 UserID=593* | eval original_rating = case(Rating<3,"Dislike", Rating=3,"Neutral", Rating>3,"Like") | fields original_rating MovieID UserID Genre Tag | train rating_model from original_rating!

• Guess Rating for test data based on trained model index=movielens Timestamp > 1199188800 UserID=593* | guess rating_model into guessed_rating | top original_rating guessed_rating!

• Accuracy of model: correct on 97.6% of values.

• Tip: always train on LOTS of training data.

• Evaluate before deploying.

Page 17: Splunk for DataScience (.conf2014)

17

Use Case: Sentiment Analysis Sentiment Analysis: the assignment of “emotional” labels to textual data. • Can be simple +1 vs. -1, or more sophisticated: “happy”, “angry”, “sad”, etc. • Analyze tweets, emails, news articles, logs or any other textual data! – Social data correlates with other factors.

• Typically done via supervised learning: – Train a model on labeled corpus of text. – Test the model on incoming text data.

• Read more about Sentiment Analysis: – Chapter 14 of Big Data Analytics Using Splunk (pp. 255-282). – Michael Wilde & David Carasso. Social Media & Sentiment Analysis. .conf2012

3rd 8th 4th 1st 2nd

2011 Irish General Election

17% 1.8% 10% 36% 19%

r=.79

Page 18: Splunk for DataScience (.conf2014)

18

Splunk Solution: Sentiment Analysis App David Carasso’s Sentiment Analysis App assigns binary sentiment values to textual data (logs, tweets, email, etc.). • Naïve Bayes classifier under the hood. • Twitter & IMDB models out of the box. • Can guess language of authorship, and “heat”, a measure of emotional charge.

• Tip: compare relative sentiment changes across time & groups.

• How to train your own models: http://answers.splunk.com/answers/59743

Page 19: Splunk for DataScience (.conf2014)

19

Demo: Sentiment Analysis App

Page 20: Splunk for DataScience (.conf2014)

20

Use Case: Anomaly Detection • An anomaly (or outlier) is an event which is vastly dissimilar to other events. • Anomaly Detection is one of Splunk’s most common use cases. Examples:

– Transactions which occur faster than humanly possible. – DDoS attacks from IP address ranges. – High-value customer purchase patterns.

• Quick techniques for finding statistical outliers: – Non-average outliers: more than 2*stdev from the avg. – Non-typical outliers: more than 1.5*IQR above perc75 or below perc25.

• Tip: save these as eventtypes for automated outlier detection. • Once anomalies have been found, dig deeper to discover root causes.

Page 21: Splunk for DataScience (.conf2014)

21

Splunk Solution: cluster • Anomalies are dissimilar to other events (by definition). • We can use clustering algorithms to help us detect anomalies:

– Non-anomalous events typically form a few large clusters. – Anomalous events typically form lots of small clusters.

• Cluster your data, sort ascending: cluster showcount=true labelonly=true | sort cluster_count cluster_label!

• Remember: there is no “right way” to find all anomalies. Explore your data!

Page 22: Splunk for DataScience (.conf2014)

22

Concept: Unsupervised Learning & Clustering • A clustering algorithm is any process which groups together similar things (events, people, etc), and separates dissimilar things (events, people, etc).

• Clustering is unsupervised: choose labels based on patterns in the data. • Clustering is in the eye of the beholder:

– Lots of different clustering algorithms. – Lots of different similarity functions.

• Do not confuse with: – Computer cluster: a group of computers

working together as a single system. – Splunk cluster: a group of Splunk indexers

replicating indexes & external data.

Page 23: Splunk for DataScience (.conf2014)

23

Demo: cluster!

Page 24: Splunk for DataScience (.conf2014)

24

Splunk Solution: Other Commands • anomalies:

– Assigns an “unexpectedness” score to each event.

• anomalousvalue: – Assigns an “anomaly score” to

events with anomalous values.

• outlier: – Removes or truncates outliers.

• kmeans: – Powerful clustering algorithm.

You choose k = # of clusters.

Page 25: Splunk for DataScience (.conf2014)

25

Splunk Solution: Prelert (Partner App) • Manages Anomaly Detection directly.

– Pre-built dashboards, alerts, API. – Use cases: Security, IT Ops / APM, DevOps – Godfrey Sullivan: "beautifully adjacent and

complimentary to what Splunk does”

• Can download from Splunk Apps. – May save you time with Anomaly Detection. – Can also be good source of inspiration

for your own Anomaly Detection dashboards.

• Keep in mind Prelert is a paid app: – Cost: $225/month @ 5GB

Page 26: Splunk for DataScience (.conf2014)

26

Use Case: Market Segmentation • Market Segmentation: group customers according to common needs and priorities, and develop strategies to target them. – Market segments are internally homogeneous, and externally heterogeneous.

i.e., market segments are clusters of customers.

• Many reasons for Market Segmentation: – Different market segments require different strategies. – Customers in same segment have similar product

preferences. Different segments, different preferences. – Segments should be reasonably stable, to allow for

historical analysis (good for Data Science).

• Use Splunk’s clustering algorithms to identify and label market segments!

Page 27: Splunk for DataScience (.conf2014)

27

Data  Visualiza7ons  

Page 28: Splunk for DataScience (.conf2014)

28

Intro to Data Visualization • Data Visualization is the creation and study of the visual representation of data, and is a vital part of Data Science.

• The goal of data visualization is to communicate information: – Visualizations communicate complex

ideas with clarity, precision, and efficiency. – Transmission speed of the optic nerve

is about 9Mb/sec – fast image processing. – Pattern matching, edge detection. – Visualizations pack lots of information

into small spaces. More than text alone!

Page 29: Splunk for DataScience (.conf2014)

29

Telling Stories with Data Visualizations • We process data in linear narratives: even dashboards go top-to-bottom. • Visualizations help pierce the monotony of text, number & data streams. • Think about the story you’re telling:

– Empathize with the viewer. – What’s their takeaway?

• A good visualization tells its own story: “Island Nation Obtains Favourable Balance of Trade; Goes On To Rule The World.”

• Weave multiple visualizations together to tell more effective stories.

William Playfair (1786)

Page 30: Splunk for DataScience (.conf2014)

30

Source: New York Times. May 17, 2012

Splunk

Page 31: Splunk for DataScience (.conf2014)

31

Source: New York Times. May 17, 2012

Splunk

Page 32: Splunk for DataScience (.conf2014)

32

Tips for Effective Data Visualizations • #1 tip: Plot the most important keys on x & y axes.

– You choose “most important.” – You might need >1 visualization.

• Manipulate size, color and shape to convey additional information.

• Annotate, label and add icons ✔︎ • Use chart overlay to correlate data sources. Mix histograms & line charts ↑↑↑ • Manipulate numerical scale: linear vs. log scales (previous 2 slides). • Read more about Data Visualization:

– Tableau’s whitepaper, Visual Analysis Best Practices (2013). – Edward Tufte’s The Visual Display of Quantitative Information (2001).

Page 33: Splunk for DataScience (.conf2014)

33

• Splunk now supports D3 visualizations with some minor customization.

• Satoshi’s talk: “I want that cool viz in Splunk!”

• Resources for Custom Visualizations: – Splunk Web Framework Toolkit

https://apps.splunk.com/app/1613/ – Splunk 6.x Dashboard Examples

https://apps.splunk.com/app/1603/ – Custom SimpleXML Extensions

http://apps.splunk.com/app/1772/ – Lots more D3 visualizations for use:

https://github.com/mbostock/d3/wiki/Gallery

D3 Custom Visualizations in Splunk

Page 34: Splunk for DataScience (.conf2014)

34

Demo: Sankey Chart

Page 35: Splunk for DataScience (.conf2014)

35

How-to for Sankey Charts • Install the Custom SimpleXML Extensions app: http://apps.splunk.com/app/1772/ • Create your own app, and install Sankey chart components:

– Drop autodiscover.js in $SPLUNK_HOME/etc/apps/<YOURAPP>/appserver/static – Copy & paste /sankeychart/ subfolder into $SPLUNK_HOME/etc/apps/<YOURAPP>/

appserver/static/components – Restart Splunk.

• In your dashboard: –  Include script="autodiscover.js" in <form> or <dashboard> opening tag –  Insert XML snippet from 2- or 3-node Sankey dashboard example – Change 2 instances of “custom_simplexml_extensions” to <YOURAPP>. – Update search and “data-options” parameters (nodes) in XML to reflect your data.

Page 36: Splunk for DataScience (.conf2014)

36

Know Your Audience • Finally, keep in mind your audience: who are they, what questions do they care about, and how do they want to consume the data? – Executive: KPIs, charts, tables with icons ✔︎ – Marketing Analyst: KPIs & metrics. Sharp

images for their own reports & decks. Tableau. – Data Scientist: output clean data to organized

data stores (Hunk, HDFS, SQL, NoSQL). – Sysadmin: sparklines, gauges for activity &

MTTR, tables with highlighted anomalies. – Security Ops: maps with detailed overlays,

drill down on anomalous events.

• Bring it back to the business problem & use case!

Page 37: Splunk for DataScience (.conf2014)

37

3 Key Takeaways

Splunk  is  great  for  doing  Data  Science!  

Splunk  complements  other  tools  in  the  

Data  Science  toolkit.  

Data  Science  is  about  extrac:ng  ac:onable  insights  from  data.  

1 2 3

Page 38: Splunk for DataScience (.conf2014)

38

List of References Good books on Data Science: •  Schutt & O’Neil. Doing Data Science. O’Reilly 2013 •  Provost & Fawcett. Data Science for Business. O’Reilly 2013 •  Max Shron. Thinking With Data. O’Reilly 2014 •  Edward Tufte. The Visual Display of Quantitative Information. Graphics Press 2001 •  Zumel & Mount. Practical Data Science with R. Manning 2014 •  Hastie et al. Elements of Statistical Learning. Springer-Verlag 2009 (free PDF!)

Using Splunk for Data Science: •  Zadrozny, Kodali (and Stout). Big Data Analytics Using Splunk. Apress 2013 •  David Carasso. Exploring Splunk. CITO Research 2012 •  David Carasso. Data Mining with Splunk. .conf2012 •  Michael Wilde & David Carasso. Social Media & Sentiment Analysis. .conf2012

Good free references: •  Tableau. Visual Analysis Best Practices. Tableau 2013 •  King & Magoulas. 2013 Data Science Salary Survey. O’Reilly 2013 •  DJ Patil. Building Data Science Teams. O’Reilly 2013 •  Cathy O’Neil. On Being A Data Skeptic. O’Reilly 2013