39
1 ВВЕДЕНИЕ В DATA SCIENCE: ПРАКТИЧЕСКИЙ ПОДХОД К АНАЛИТИКЕ БОЛЬШИХ ДАННЫХ ИВАН ХВОСТИШКОВ, EMC 2 3 МАРТА 2016 – ЦЕНТР РАЗРАБОТКИ DEUTSCHE BANK, МОСКВА

Introduction to Data Science: A Practical Approach to Big Data Analytics

Embed Size (px)

Citation preview

Page 1: Introduction to Data Science: A Practical Approach to Big Data Analytics

1

ВВЕДЕНИЕ В DATA SCIENCE: ПРАКТИЧЕСКИЙ ПОДХОД К АНАЛИТИКЕ БОЛЬШИХ ДАННЫХ

ИВАН ХВОСТИШКОВ, EMC2

3 МАРТА 2016 – ЦЕНТР РАЗРАБОТКИ DEUTSCHE BANK, МОСКВА

Page 2: Introduction to Data Science: A Practical Approach to Big Data Analytics

2

FOUR “V” OF BIG DATA

Volume Velocity

Variety Variability

Big Data

Page 3: Introduction to Data Science: A Practical Approach to Big Data Analytics

3

DATA SCIENCE VS. BUSINESS INTELLIGENCE

Data Science

Business Intelligence

Future

Low

High

Past Time

Businessvalue

Page 4: Introduction to Data Science: A Practical Approach to Big Data Analytics

4

DATA SCIENCE AND INNOVATION

ExploratoryAgile

Low

High

OperationalStable

Businessvalue Real-Time

DS DSEDW

Non real-time Very long time

Page 5: Introduction to Data Science: A Practical Approach to Big Data Analytics

5

INDUSTRY VERTICALSEXAMPLES

Health Care Public Services

Life Sciences

IT Infrastructure

Online Services

Page 6: Introduction to Data Science: A Practical Approach to Big Data Analytics

6

MACHINE LEARNING ALGORITHMSBASIC OVERVIEW

Unsupervised• K-means clustering• Association RulesSupervised• Linear regression• Logistic regression• Naïve Bayesian Classifier• Decision Trees• Time series analysis• Text analytics

learning structure from unlabeled data

Page 7: Introduction to Data Science: A Practical Approach to Big Data Analytics

7

K-MEANS CLUSTERING

• Choose centroids, assign cluster to each datum point• See also: k-nearest neighbors (regression, classification)

CLUSTERING SIMILAR DOCUMENTS, EVENTS

Page 8: Introduction to Data Science: A Practical Approach to Big Data Analytics

8

ASSOCIATION RULES

• {bread, eggs} -> {milk}• Freqent itemset, Support

– How often occur together– e. g. 50% of transactions

• Confidence– Relation of X to {X, Y}– e. g. 80% = interesting

APRIORI – EARLY ALGORITHM

Page 9: Introduction to Data Science: A Practical Approach to Big Data Analytics

9

LINEAR REGRESSIONfdq_rate = –0.9 + 0.66 CurrentUnem + 1.06 ChgInUnemp1yr + 0.22 HiCostMortRate

* What if scenario

*

Page 10: Introduction to Data Science: A Practical Approach to Big Data Analytics

10

LOGISTIC REGRESSION

Receiver Operation Classifier

Page 11: Introduction to Data Science: A Practical Approach to Big Data Analytics

11

NAÏVE BAYESSIAN CLASSIFIER

Page 12: Introduction to Data Science: A Practical Approach to Big Data Analytics

12

DECISION TREES• Entropy-based approach

• Conditional Entropy

• See also: SVM

Page 13: Introduction to Data Science: A Practical Approach to Big Data Analytics

13

TIME SERIES ANALYSIS• ARMA model – Autoregressive Moving Average

• ARIMA

Page 14: Introduction to Data Science: A Practical Approach to Big Data Analytics

14

TEXT ANALYSIS

• Bag of words• Reverse index• Relevance (precision / recall) - TF• Inverse document frequency (IDF)• TF-IDF (improved relevance)• PageRank, …

CONCEPTS

Page 15: Introduction to Data Science: A Practical Approach to Big Data Analytics

15

USE RIGHT TOOLSWHEN ALL YOU HAVE IS A HAMMER, EVERYTHING LOOKS LIKE A NAIL

Page 16: Introduction to Data Science: A Practical Approach to Big Data Analytics

16

BIG DATA LANDSCAPE IS BIG

Page 17: Introduction to Data Science: A Practical Approach to Big Data Analytics

17

SQL, NOSQL, HADOOP• SQL databases were not designed to scale

easily– Cost, > 10 TB? – OLTP vs OLAP

• NoSQL databases – Big Data approach– Native format, tight integration– Compute is still bottleneck

• Hadoop – put early, transform later– ETL vs. ELT– Sandboxing, loose integration patterns

Page 18: Introduction to Data Science: A Practical Approach to Big Data Analytics

18

HADOOP ECOSYSTEM

Page 19: Introduction to Data Science: A Practical Approach to Big Data Analytics

19

HAWQEX-GREENPLUM

* See also: Hive, Impala

Page 20: Introduction to Data Science: A Practical Approach to Big Data Analytics

20

SPARK

Page 21: Introduction to Data Science: A Practical Approach to Big Data Analytics

21

IN-MEMORY DATA GRIDAPACHE GEODE AKA GEMFIRE

Page 22: Introduction to Data Science: A Practical Approach to Big Data Analytics

22

INDUSTRIAL PROJECT EXAMPLEE-GOV.KZ

Saint PetersburgMoscow Astana

Almaty

Data SizePublic data: 1 TBArticles: 5 000 000Comments: 100 000 000

Private data: 70 TB

Page 23: Introduction to Data Science: A Practical Approach to Big Data Analytics

23

QUALITY ANALYSIS SYSTEMPROBLEM STATEMENT

Kazakhstan Government Services and Information Online

World Wide Web

Relevance

Sentiment

Page 24: Introduction to Data Science: A Practical Approach to Big Data Analytics

24

Resource 2

Resource 3

Resource 4

Resource 5

Resource 1EMC2

parsers

NIT parsers

Hive import

Results dump

Solr import

DATA WORKFLOW

Model execution

BI Dashboard

Page 25: Introduction to Data Science: A Practical Approach to Big Data Analytics

25

NUTCHSeed urls

CrawlDBIndexDB

Parsed text and data

Fetched content

WWW

Fetch list

Parse the content

Update CrawlDB

Fetch urls from the list

Generate new segment

Inject seed urls

Page 26: Introduction to Data Science: A Practical Approach to Big Data Analytics

26

CRAWLING VS. SCRAPPING

Crawling• Returns traffic back to the site

Scrapping• Doesn’t return traffic• Extract value

Page 27: Introduction to Data Science: A Practical Approach to Big Data Analytics

27

MACHINE LEARNING INSTRUMENTS

TreeTagger

Vowpal Wabbit Word2vec / Paragraph2vec

Page 28: Introduction to Data Science: A Practical Approach to Big Data Analytics

28

R

Page 29: Introduction to Data Science: A Practical Approach to Big Data Analytics

29

CLASSIFICATION METHOD• Logistic Regression• Multiclass classification• One-vs-All• Accuracy

Positive Negative Neutral

X0 0 0 1

X1 0 0 1

… … …

xn 1 0 0

Page 30: Introduction to Data Science: A Practical Approach to Big Data Analytics

30

MODEL WORKFLOW

• Cleaning • Lemmatisatio

n • Preparing

Step 1

• One-vs-all models

• Combination• Accuracy

Step 2 • Application• Re-training if

necessary

Step 3

Page 31: Introduction to Data Science: A Practical Approach to Big Data Analytics

31

Page 32: Introduction to Data Science: A Practical Approach to Big Data Analytics

32

Page 33: Introduction to Data Science: A Practical Approach to Big Data Analytics

33

Page 34: Introduction to Data Science: A Practical Approach to Big Data Analytics

34

PITFALLS• Private data access• Data growth – 10-100x• Hadoop cluster planning• Nutch scrapping integration is not easy• Oozie is cumbersome• Hive is not for BI, use HAWQ

Page 35: Introduction to Data Science: A Practical Approach to Big Data Analytics

35

Page 36: Introduction to Data Science: A Practical Approach to Big Data Analytics

36

DATA SCIENTIST

Data Scientist

Quantitative

Curious & Creative

Communicative & CollaborativeSkeptical

Technical

Page 37: Introduction to Data Science: A Practical Approach to Big Data Analytics

37

Discovery

Data Preparation

Model Planning

Model Building

Communicate Results

Operationalize

DATA ANALYTICS LIFECYCLE

70-80% of time

Page 38: Introduction to Data Science: A Practical Approach to Big Data Analytics

38

RESOURCES• Deep Learning• Visualization• Machine Learning Course

https://www.coursera.org/learn/machine-learning

• Data Science and Big Data Analyticshttp://eu.wiley.com/WileyCDA/WileyTitle/productCd-111887613X.html

• Online Twitter Sentiments Analysis http://sentiment140.com/

• Amazon MTurk• Meet-ups!