Introduction to Data Science: A Practical Approach to Big Data Analytics

1

ВВЕДЕНИЕ В DATA SCIENCE: ПРАКТИЧЕСКИЙ ПОДХОД К АНАЛИТИКЕ БОЛЬШИХ ДАННЫХ

ИВАН ХВОСТИШКОВ, EMC2

3 МАРТА 2016 – ЦЕНТР РАЗРАБОТКИ DEUTSCHE BANK, МОСКВА

2

FOUR “V” OF BIG DATA

Volume Velocity

Variety Variability

Big Data

3

DATA SCIENCE VS. BUSINESS INTELLIGENCE

Data Science

Business Intelligence

Future

Low

High

Past Time

Businessvalue

4

DATA SCIENCE AND INNOVATION

ExploratoryAgile

Low

High

OperationalStable

Businessvalue Real-Time

DS DSEDW

Non real-time Very long time

5

INDUSTRY VERTICALSEXAMPLES

Health Care Public Services

Life Sciences

IT Infrastructure

Online Services

…

6

MACHINE LEARNING ALGORITHMSBASIC OVERVIEW

Unsupervised• K-means clustering• Association RulesSupervised• Linear regression• Logistic regression• Naïve Bayesian Classifier• Decision Trees• Time series analysis• Text analytics

learning structure from unlabeled data

7

K-MEANS CLUSTERING

• Choose centroids, assign cluster to each datum point• See also: k-nearest neighbors (regression, classification)

CLUSTERING SIMILAR DOCUMENTS, EVENTS

8

ASSOCIATION RULES

• {bread, eggs} -> {milk}• Freqent itemset, Support

– How often occur together– e. g. 50% of transactions

• Confidence– Relation of X to {X, Y}– e. g. 80% = interesting

APRIORI – EARLY ALGORITHM

9

LINEAR REGRESSIONfdq_rate = –0.9 + 0.66 CurrentUnem + 1.06 ChgInUnemp1yr + 0.22 HiCostMortRate

* What if scenario

*

10

LOGISTIC REGRESSION

Receiver Operation Classifier

11

NAÏVE BAYESSIAN CLASSIFIER

12

DECISION TREES• Entropy-based approach

• Conditional Entropy

• See also: SVM

13

TIME SERIES ANALYSIS• ARMA model – Autoregressive Moving Average

• ARIMA

14

TEXT ANALYSIS

• Bag of words• Reverse index• Relevance (precision / recall) - TF• Inverse document frequency (IDF)• TF-IDF (improved relevance)• PageRank, …

CONCEPTS

15

USE RIGHT TOOLSWHEN ALL YOU HAVE IS A HAMMER, EVERYTHING LOOKS LIKE A NAIL

16

BIG DATA LANDSCAPE IS BIG

17

SQL, NOSQL, HADOOP• SQL databases were not designed to scale

easily– Cost, > 10 TB? – OLTP vs OLAP

• NoSQL databases – Big Data approach– Native format, tight integration– Compute is still bottleneck

• Hadoop – put early, transform later– ETL vs. ELT– Sandboxing, loose integration patterns

18

HADOOP ECOSYSTEM

19

HAWQEX-GREENPLUM

* See also: Hive, Impala

20

SPARK

21

IN-MEMORY DATA GRIDAPACHE GEODE AKA GEMFIRE

22

INDUSTRIAL PROJECT EXAMPLEE-GOV.KZ

Saint PetersburgMoscow Astana

Almaty

Data SizePublic data: 1 TBArticles: 5 000 000Comments: 100 000 000

Private data: 70 TB

23

QUALITY ANALYSIS SYSTEMPROBLEM STATEMENT

Kazakhstan Government Services and Information Online

World Wide Web

Relevance

Sentiment

24

Resource 2

Resource 3

Resource 4

Resource 5

Resource 1EMC2

parsers

NIT parsers

Hive import

Results dump

Solr import

DATA WORKFLOW

Model execution

BI Dashboard

25

NUTCHSeed urls

CrawlDBIndexDB

Parsed text and data

Fetched content

WWW

Fetch list

Parse the content

Update CrawlDB

Fetch urls from the list

Generate new segment

Inject seed urls

26

CRAWLING VS. SCRAPPING

Crawling• Returns traffic back to the site

Scrapping• Doesn’t return traffic• Extract value

27

MACHINE LEARNING INSTRUMENTS

TreeTagger

Vowpal Wabbit Word2vec / Paragraph2vec

28

R

29

CLASSIFICATION METHOD• Logistic Regression• Multiclass classification• One-vs-All• Accuracy

Positive Negative Neutral

X0 0 0 1

X1 0 0 1

… … …

xn 1 0 0

30

MODEL WORKFLOW

• Cleaning • Lemmatisatio

n • Preparing

Step 1

• One-vs-all models

• Combination• Accuracy

Step 2 • Application• Re-training if

necessary

Step 3

31

32

33

34

PITFALLS• Private data access• Data growth – 10-100x• Hadoop cluster planning• Nutch scrapping integration is not easy• Oozie is cumbersome• Hive is not for BI, use HAWQ

35

36

DATA SCIENTIST

Data Scientist

Quantitative

Curious & Creative

Communicative & CollaborativeSkeptical

Technical

37

Discovery

Data Preparation

Model Planning

Model Building

Communicate Results

Operationalize

DATA ANALYTICS LIFECYCLE

70-80% of time

38

RESOURCES• Deep Learning• Visualization• Machine Learning Course

https://www.coursera.org/learn/machine-learning

• Data Science and Big Data Analyticshttp://eu.wiley.com/WileyCDA/WileyTitle/productCd-111887613X.html

• Online Twitter Sentiments Analysis http://sentiment140.com/

• Amazon MTurk• Meet-ups!



http://eu.wiley.com/WileyCDA/WileyTitle/productCd-111887613X.html

http://eu.wiley.com/WileyCDA/WileyTitle/productCd-111887613X.html

http://sentiment140.com/

39

[email protected]

mailto:[email protected]

Data & Analytics

Introduction to Data Science: A Practical Approach to Big Data Analytics