Upload
ivan-khvostishkov
View
655
Download
3
Embed Size (px)
Citation preview
1
ВВЕДЕНИЕ В DATA SCIENCE: ПРАКТИЧЕСКИЙ ПОДХОД К АНАЛИТИКЕ БОЛЬШИХ ДАННЫХ
ИВАН ХВОСТИШКОВ, EMC2
3 МАРТА 2016 – ЦЕНТР РАЗРАБОТКИ DEUTSCHE BANK, МОСКВА
2
FOUR “V” OF BIG DATA
Volume Velocity
Variety Variability
Big Data
3
DATA SCIENCE VS. BUSINESS INTELLIGENCE
Data Science
Business Intelligence
Future
Low
High
Past Time
Businessvalue
4
DATA SCIENCE AND INNOVATION
ExploratoryAgile
Low
High
OperationalStable
Businessvalue Real-Time
DS DSEDW
Non real-time Very long time
5
INDUSTRY VERTICALSEXAMPLES
Health Care Public Services
Life Sciences
IT Infrastructure
Online Services
…
6
MACHINE LEARNING ALGORITHMSBASIC OVERVIEW
Unsupervised• K-means clustering• Association RulesSupervised• Linear regression• Logistic regression• Naïve Bayesian Classifier• Decision Trees• Time series analysis• Text analytics
learning structure from unlabeled data
7
K-MEANS CLUSTERING
• Choose centroids, assign cluster to each datum point• See also: k-nearest neighbors (regression, classification)
CLUSTERING SIMILAR DOCUMENTS, EVENTS
8
ASSOCIATION RULES
• {bread, eggs} -> {milk}• Freqent itemset, Support
– How often occur together– e. g. 50% of transactions
• Confidence– Relation of X to {X, Y}– e. g. 80% = interesting
APRIORI – EARLY ALGORITHM
9
LINEAR REGRESSIONfdq_rate = –0.9 + 0.66 CurrentUnem + 1.06 ChgInUnemp1yr + 0.22 HiCostMortRate
* What if scenario
*
10
LOGISTIC REGRESSION
Receiver Operation Classifier
11
NAÏVE BAYESSIAN CLASSIFIER
12
DECISION TREES• Entropy-based approach
• Conditional Entropy
• See also: SVM
13
TIME SERIES ANALYSIS• ARMA model – Autoregressive Moving Average
• ARIMA
14
TEXT ANALYSIS
• Bag of words• Reverse index• Relevance (precision / recall) - TF• Inverse document frequency (IDF)• TF-IDF (improved relevance)• PageRank, …
CONCEPTS
15
USE RIGHT TOOLSWHEN ALL YOU HAVE IS A HAMMER, EVERYTHING LOOKS LIKE A NAIL
16
BIG DATA LANDSCAPE IS BIG
17
SQL, NOSQL, HADOOP• SQL databases were not designed to scale
easily– Cost, > 10 TB? – OLTP vs OLAP
• NoSQL databases – Big Data approach– Native format, tight integration– Compute is still bottleneck
• Hadoop – put early, transform later– ETL vs. ELT– Sandboxing, loose integration patterns
18
HADOOP ECOSYSTEM
19
HAWQEX-GREENPLUM
* See also: Hive, Impala
20
SPARK
21
IN-MEMORY DATA GRIDAPACHE GEODE AKA GEMFIRE
22
INDUSTRIAL PROJECT EXAMPLEE-GOV.KZ
Saint PetersburgMoscow Astana
Almaty
Data SizePublic data: 1 TBArticles: 5 000 000Comments: 100 000 000
Private data: 70 TB
23
QUALITY ANALYSIS SYSTEMPROBLEM STATEMENT
Kazakhstan Government Services and Information Online
World Wide Web
Relevance
Sentiment
24
Resource 2
Resource 3
Resource 4
Resource 5
Resource 1EMC2
parsers
NIT parsers
Hive import
Results dump
Solr import
DATA WORKFLOW
Model execution
BI Dashboard
25
NUTCHSeed urls
CrawlDBIndexDB
Parsed text and data
Fetched content
WWW
Fetch list
Parse the content
Update CrawlDB
Fetch urls from the list
Generate new segment
Inject seed urls
26
CRAWLING VS. SCRAPPING
Crawling• Returns traffic back to the site
Scrapping• Doesn’t return traffic• Extract value
27
MACHINE LEARNING INSTRUMENTS
TreeTagger
Vowpal Wabbit Word2vec / Paragraph2vec
28
R
29
CLASSIFICATION METHOD• Logistic Regression• Multiclass classification• One-vs-All• Accuracy
Positive Negative Neutral
X0 0 0 1
X1 0 0 1
… … …
xn 1 0 0
30
MODEL WORKFLOW
• Cleaning • Lemmatisatio
n • Preparing
Step 1
• One-vs-all models
• Combination• Accuracy
Step 2 • Application• Re-training if
necessary
Step 3
31
32
33
34
PITFALLS• Private data access• Data growth – 10-100x• Hadoop cluster planning• Nutch scrapping integration is not easy• Oozie is cumbersome• Hive is not for BI, use HAWQ
35
36
DATA SCIENTIST
Data Scientist
Quantitative
Curious & Creative
Communicative & CollaborativeSkeptical
Technical
37
Discovery
Data Preparation
Model Planning
Model Building
Communicate Results
Operationalize
DATA ANALYTICS LIFECYCLE
70-80% of time
38
RESOURCES• Deep Learning• Visualization• Machine Learning Course
https://www.coursera.org/learn/machine-learning
• Data Science and Big Data Analyticshttp://eu.wiley.com/WileyCDA/WileyTitle/productCd-111887613X.html
• Online Twitter Sentiments Analysis http://sentiment140.com/
• Amazon MTurk• Meet-ups!