37
Copyright © 2014, SAS Institute Inc. All rights reserved. DATA SCIENCE: HYPE AND REALITY PATRICK HALL

DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

DATA SCIENCE: HYPE AND REALITYPATRICK HALL

Page 2: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

About me

SAS Enterprise Miner, 2012

Cloudera Data Scientist, 2014

Presenter
Presentation Notes
Hard science and math background – not a statistician Twitter and Quora – don’t believe the internet, but you can use it Coolness
Page 3: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

3

No, I mix my martinis with gin.

Do you use Kolmogorov–Smirnov often?

Statistician

Data Scientist

Presenter
Presentation Notes
Some data science jokes
Page 4: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

4

So, you have no SQL experience?

That’s right, I have NoSQL experience.

Statistician

Data Scientist

Page 6: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Intro to data science

Page 7: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Historical roots

J. W. Tukey, The Future of Data Analysis, 1962

International Federation of Classification Societies, 1996

William Cleveland, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics, 2001

Presenter
Presentation Notes
“Unlike other articles in the AMS, this paper presents no derivations, proves no theorems, espouses no optimality conditions. Instead it argues that alternatives to optimal procedures often are required for the statistical analysis of real-world data and it is the business of statisticians to attend to such solutions.” Published before other papers that had been previously accepted. --- In computer science, the term had been since at least 1960. Used by Peter Naur in his 1974 survey of computer of computer methods. Used intentionally in the IFCS conference proceedings title. --- In his report, Cleveland establishes six technical areas which he believed to encompass the field of data science: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation, and theory. --- 2008 DJ Patil and Jeff Hammerbacher call themselves data scientists at Linkedin and Facebook respectively: scalable, production systems
Page 8: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Data science Venn diagram 1.0

Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Drew Conway, 2010

Presenter
Presentation Notes
Best definition of a data scientist I know of.
Page 9: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Data science Venn diagram 2.0

http://joelgrus.com/wp-content/uploads/2013/06/VennDiagram2.png

Page 10: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

10Source: "Of the unicorn" by Special Collections, University of Houston Libraries - http://digital.lib.uh.edu/u?/p15195coll18,33. Licensed under CC0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Oftheunicorn.jpg#/media/File:Oftheunicorn.jpg

Presenter
Presentation Notes
Unicorn cannot suffice as a job description or level of skill one must attain to be a data scientist. Are medical doctors unicorns? Maybe some are – but most are people who are very dedicated and hard working … it should be the same level of standards for a data scientist. Unicorns are great, and if you find one good for you – the best thing about them is that they can communicate between different specialties Team of specialist might be more sustainable and realistic approach
Page 11: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Intro to machine learning

Page 12: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Data science Venn diagram 1.0

Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Presenter
Presentation Notes
Best definition of a data scientist I know of.
Page 13: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

SEMI-SUPERVISED LEARNING

Prediction and classification*Clustering*EM TSVMManifoldregularization Autoencoders

Multilayer perceptronRestricted Boltzmannmachines

SUPERVISED LEARNING

RegressionLASSO regressionLogistic regressionRidge regression

Decision treeGradient boostingRandom forests

Neural networks SVMNaïve BayesNeighborsGaussianprocesses

UNSUPERVISEDLEARNING

A priori rulesClustering

k-means clusteringMean shift clustering Spectral clustering

Kernel densityestimationNonnegative matrixfactorizationPCA

Kernel PCASparse PCA

Singular valuedecompositionSOM

Don’t know YKnow Y Sometimes

know Y

A closer look at machine learning

Presenter
Presentation Notes
“Field of study that gives computers the ability to learn without being explicitly programmed” Less assumptions about data models Leo Breiman 2001, Statistical Modeling: The Two Cultures It doesn’t solve all your problems – just another tool in the toolkit Why is it used today? - Nonlinear phenomenon captured in images, texts, and semi-structured data - Wide data - Sparse data - Highly correlated data - High cardinality categorical variables
Page 14: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Sacrificing interpretability for accuracy

Hill and plateau sample data

Traditional regression Decision tree

Neural network

Page 15: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

The shocking truth revealed!

http://www.kdnuggets.com/2015/10/deep-learning-vapnik-einstein-devil-yandex-conference.html

Presenter
Presentation Notes
“God does not role dice” – Einstein insisted on elegant solutions. Most of ML is messy and brute force. … Professor Vapnik: “God is clever, the devil uses brute force.”
Page 16: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Most time is spent cleaning and preprocessing the data!

Page 17: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Small data tools

Page 18: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

• Multicore CPU• GPU• Solid state drive (SSD)• 64+ GB of RAM• Scalable algorithms

Workstation

Data

MPI Based

Software client

Data

Software server

Data scientist

Data scientist

Presenter
Presentation Notes
The typical cutoff for needing Hadoop is 1 TB, but not if you are doing sophisticated analytics SGD scalable, single-threaded
Page 19: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

How do we turn our insights into a production system?

Page 20: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Identify/Formulate Problem

Data Preparation/Exploration

Model Building

Deploy Model

Evaluate/Monitor Model

ESTIMATION VS. PREDICTION DIFFERENT MINDSETS

RegressionDiscriminant Analysis

Assumptions Parsimony

Interpretation

What happened? Why?

Production Deployment

Predictive Accuracy

What will happen?

Machine Learning

Page 21: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

!!!???!!!

The ‘IT’ folks

The ‘Analytics’ folks

I just built 850 new models.

When can you put

them into production?

Presenter
Presentation Notes
Also many human barriers to production implementations
Page 22: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

big data tools

Page 23: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

MPI Based

• Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases

• Distributed analytics platform Disk-enabled: Hadoop MapReduce In-memory: H20.ai

SAS® High-Performance Analytics SAS® LASR Analytic ServerSpark ML/MLlib

Data scientist

Distributed data and software on multiple servers

Software client

Presenter
Presentation Notes
If something was going to take 1000000 seconds and with your 4 node cluster it is going to take 300000 seconds, is that really worth it? These distributed systems are difficult to build and manage. Would your money be better spent on a TB of RAM on a single server? Maybe. General issues of distributed computing
Page 24: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Data growth

0,00

5,00

10,00

15,00

20,00

25,00

30,00

35,00

40,00

45,00

50,00

1991 1996 2001 2006 2011 2016

Wor

ld’s

Dat

a in

Zet

taby

tes

SOURCE: Oracle 2012

Presenter
Presentation Notes
Been accumulating massive amounts of data New types of data
Page 25: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Data growth

(1 zettabyte = 1 billion terabytes)

Presenter
Presentation Notes
Just in case you were not familiar Zettabyte is a billion terabytes
Page 26: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Typical server hard drive was 500GB with a transfer

rate of 98 MB/sec

In 2008

An entire Disk could be transferred in 85 minutes

Typical Server Hard Drive was 4TB with a transfer rate of 150

MB/sec

In 2013

An entire disk could be transferred in 440

minutes

Presenter
Presentation Notes
Relative disk speeds slowing down
Page 27: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

$0,00

$0,20

$0,40

$0,60

$0,80

$1,00

$1,20

2000 2005 2010

Average Price 1MB RAM

Presenter
Presentation Notes
RAM is much cheaper
Page 28: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

0

500

1000

1500

2000

2500

3000

3500

4000

1978 1982 1985 1989 1995 1997 1999 2000 2005 2008

CPU Speed in MHz

Presenter
Presentation Notes
Not cost effective to make single processors any faster http://www.maximumpc.com/article/features/a_brief_history_cpus_31_awesome_years_x86?page=0,6
Page 29: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

• Disk capacities are getting bigger, but disks are not spinning faster

• Processors are not running much faster, but they have more cores

• RAM is becoming affordable

Page 30: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

So …

• To handle all of this new data we distribute it on clusters of computers

• Most modern analytical architectures take advantage of in-memory, distributed processing

Page 31: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Presenter
Presentation Notes
“commodity hardware”
Page 32: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Hadoop and Spark

• Bulk ETL

• Batch processing

• Deployment

• Online transactions

• Advanced AnalyticsMapReduce is a difficult framework for iterative, sophisticated algorithms

Page 33: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

https://github.com/szilard/benchm-ml

Presenter
Presentation Notes
2013: Mahout mostly deprecated https://issues.apache.org/jira/browse/MAHOUT-1250 2014: “Apache Mahout … has reached the end of its road” – and Oryx is not popular either http://gigaom.com/2014/02/28/cloudera-is-rebuilding-machine-learning-for-hadoop-with-oryx/ 2014: Significant parts of MLlib API deprecated “To comment on the versioning stuff here, ‘deprecated’ doesn't mean unsupported, it just means we encourage using something else. So the old MLlib API will remain in 1.x, and will continue getting tested and bug-fixed, but it will not get new features.” -- Matei Zaharia https://issues.apache.org/jira/browse/SPARK-3530 2015: Spark underperforms in comparison to other open source tools https://github.com/szilard/benchm-ml 2015: “Sorry, but Spark and Hadoop are really not that good.” https://www.reddit.com/r/bigdata/comments/3t2sjr/sorry_but_spark_and_hadoop_are_really_not_that/ 2015: “5 Things We Hate About Spark” http://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html
Page 34: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

• “Hadoop Corporate Adoption Remains Low”

• Death of RDBMS exaggerated

• Big data adoption will require time

Presenter
Presentation Notes
“In a poll of 284 global IT and business leaders, 54% said they had no plans to invest in Hadoop at this time. About 18% said they planned to invest within the next two years.” Sampling Redundancy
Page 35: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Parting shot

Page 36: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Use the scientific method.http://www.sas.com/en_us/insights/articles/analytics/keeping-the-science-in-data-science.html

Page 37: DATA SCIENCE: HYPE AND REALITY · • Distributed storage platform Hadoop Distributed File System (HDFS) Massively parallel (MPP) databases • Distributed analytics platform Disk-enabled:

Copyr igh t © 2014, SAS Ins t i tu te Inc . A l l r i gh ts reserved.

Keep the Science in Data Sciencehttp://www.sas.com/en_us/insights/articles/analytics/keeping-the-science-in-data-science.html

An Introduction to Machine Learninghttp://blogs.sas.com/content/sascom/2015/08/11/an-introduction-to-machine-learning/

SAS Data Mining Communityhttps://communities.sas.com/

Quora Github Twitterwww.quora.com github.com/jphall663 @jpatrickhall

github.com/sassoftware

Where you can find me …