Upload
francis-pieraut
View
943
Download
0
Tags:
Embed Size (px)
Citation preview
Unhidden Agenda
● Big Data Big Picture
● Big Data Dead Valley Dilemma
● Elastic Map Reduce (EMR) numbers
● Scaling Learning (MPI & hadoop)
Big Data =
Lot of Data (evidence)
+
CPU bounded (forgotten)
Big Data =
Lot of Data (evidence)
-
IO bounded (reality)
IO bounded
CPU<100%Data
● HD/Bus speed● Network● File server
Big Data Scalability(ex: hadoop)
= Cluster
+
Locality + node failure(Data move close to CPU)
The Big Data Dilemma
Big Data Dead ValleyTe
chno
Mat
urtit
y /
Ris
k
Enterprise size
SMB
Enterprise
Start-ups
Techno Maturity Risk
Big Data =
SMALL MARKET
(B2B vs B2C)
Small Market......hum?
WHY?????
MaturityData, Process, QA, infra, talent, $, Long term vision
Data->Analytics ->BI-> Big-Data -> Data-Mining -> ML
Data Access & Quality
User data privacy, IT outsourcing protection, Data Quality
Enterprise Slowness
1. Boston CXO Forum 24 October : Best Practice on Global Innovation (IBM, EMC, P&G, Intuit)
Exploit vs Explore - M&A 2. Brad Feld (Managing Director at Foundry Group)
Hierarchy vs network
Big Data Dead ValleyTe
chno
Mat
urtit
y /
Ris
k
Enterprise Maturity
SMB
Enterprise
Start-ups
Techno Maturity Risk
QMarketing exampleLeveraging hadoop● map = hits to session● reduce = sessions to ROI
Online Marketing Management
Channel % budget ROI----------------------------------------------PPC 50% ?Organic 20% ?Email Campaign 20% ?Social Media 10% ?
ROI Dashboard
All abstractions leakAbstract -> Procrastinate!
http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )
Minimize A Tower of AbstractionSimplify & lower the layer of abstraction
Examples:
● Work on file not BD if possible● HD direct connect on server● Low level linux command lines (cut, grep, sed etc.)● High level languages : python
Abstraction = 20X benefits
EMR vs AWS & S3 1.0(no data locality optimization + network &
~IO bounded)
EMR = 45 min AWS = 4 min
EMR vs AWS & S3 2.0
EMR = 5+10 min* AWS = ~4 min
*30 min prepro ;)EMR = 5+4 if (big files & compress files)
Scaling Machine Learning
● Scaling Data-Preprocessing = Hadoop● Small dataset = GPU● Train with Big Dataset = ?? Communication Infrastructures =
MPI & MapReduce (John Langford http://hunch.net/?p=2094)
MPI allreduce
Hadoop vs MPI
MPI● No fault tolerance by default● Poor understanding of where data is (manual split on nodes + bad
communication & prog complexity)● Limit scale to ~100 nodes in practice (sharing unavoidable)● Cluster shared -> slower nodes issues before disk/node failure
MapReduce ● Setup and teardown costs are significant (interaction schedular &
communicating the prog + large number of node)● Worst: mapreduce wait for free nodes + many mapreduce iteration +
reach high quality prediction● Flaw: required refactoring code in map/reduce
Hadoop-compatible AllReduce - Vowpall Rabbit (Hadoop + MPI)
● MPI = All reduce (all nodes same state)● MapReduce = Conceptual Simplicity● MPI: No need to refactor code● MapReduce: Data Locality (Map only)● MPI: Ability to use local storage (or RAM): temp file on
local disk + allow to be cached in RAM by OS● MapReduce: Automatic cleanup of local resources (tmp
files)● MPI: Fast Optimization approach remain within the
conceptual scope: AllReduce = fct call● MapReduce robustness (speculative execution to deal
with slow nodes)
Summary
● Big Data Big Picture○ BigData : Cluster + IO bounded (Locality)
● Big Data Dead Valley Dilemma (MMID)○ Small Market/Maturity/Data:access,quality/Slowness
● EMR (aws) = Slow● Minimize Tower or abstraction● Scaling MP: bottleneck = ML
○ MPI:no fault tolerance + where is the data?○ Hadoop: slow setup & teardown + Require
Refactoring○ Hadoop compatible AllReduce
Reference MPI & hadoop
blog:http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.htmlhttp://hunch.net/?p=2094 Video & slides presentaiton John Langford Learning From Lots Of Data (full)
CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research
Slides: http://lisaweb.iro.umontrea...
Implementation :
vowpal_wabbit