The big data dead valley dilemma and much more

The Big Data Dead Valley Dilemma and Much More

[email protected] Founder QMining

@fraka6

Unhidden Agenda

● Big Data Big Picture

● Big Data Dead Valley Dilemma

● Elastic Map Reduce (EMR) numbers

● Scaling Learning (MPI & hadoop)

Big Data =

Lot of Data (evidence)

+

CPU bounded (forgotten)

Big Data =

Lot of Data (evidence)

-

IO bounded (reality)

IO bounded

CPU<100%Data

● HD/Bus speed● Network● File server

Big Data Scalability(ex: hadoop)

= Cluster

+

Locality + node failure(Data move close to CPU)

The Big Data Dilemma

Big Data Dead ValleyTe

chno

Mat

urtit

y /

Ris

k

Enterprise size

SMB

Enterprise

Start-ups

Techno Maturity Risk

Big Data =

SMALL MARKET

(B2B vs B2C)

Small Market......hum?

WHY?????

MaturityData, Process, QA, infra, talent, $, Long term vision

Data->Analytics ->BI-> Big-Data -> Data-Mining -> ML

Data Access & Quality

User data privacy, IT outsourcing protection, Data Quality

Enterprise Slowness

1. Boston CXO Forum 24 October : Best Practice on Global Innovation (IBM, EMC, P&G, Intuit)

Exploit vs Explore - M&A 2. Brad Feld (Managing Director at Foundry Group)

Hierarchy vs network

Big Data Dead ValleyTe

chno

Mat

urtit

y /

Ris

k

Enterprise Maturity

SMB

Enterprise

Start-ups

Techno Maturity Risk

QMarketing exampleLeveraging hadoop● map = hits to session● reduce = sessions to ROI

Online Marketing Management

Channel % budget ROI----------------------------------------------PPC 50% ?Organic 20% ?Email Campaign 20% ?Social Media 10% ?

ROI Dashboard

All abstractions leakAbstract -> Procrastinate!

http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )

http://www.aleax.it/pycon_abst.pdf

http://www.aleax.it/pycon_abst.pdf

Minimize A Tower of AbstractionSimplify & lower the layer of abstraction

Examples:

● Work on file not BD if possible● HD direct connect on server● Low level linux command lines (cut, grep, sed etc.)● High level languages : python

Abstraction = 20X benefits

EMR vs AWS & S3 1.0(no data locality optimization + network &

~IO bounded)

EMR = 45 min AWS = 4 min

EMR vs AWS & S3 2.0

EMR = 5+10 min* AWS = ~4 min

*30 min prepro ;)EMR = 5+4 if (big files & compress files)

Scaling Machine Learning

● Scaling Data-Preprocessing = Hadoop● Small dataset = GPU● Train with Big Dataset = ?? Communication Infrastructures =

MPI & MapReduce (John Langford http://hunch.net/?p=2094)

http://hunch.net/?p=2094

MPI allreduce

Hadoop vs MPI

MPI● No fault tolerance by default● Poor understanding of where data is (manual split on nodes + bad

communication & prog complexity)● Limit scale to ~100 nodes in practice (sharing unavoidable)● Cluster shared -> slower nodes issues before disk/node failure

MapReduce ● Setup and teardown costs are significant (interaction schedular &

communicating the prog + large number of node)● Worst: mapreduce wait for free nodes + many mapreduce iteration +

reach high quality prediction● Flaw: required refactoring code in map/reduce

Hadoop-compatible AllReduce - Vowpall Rabbit (Hadoop + MPI)

● MPI = All reduce (all nodes same state)● MapReduce = Conceptual Simplicity● MPI: No need to refactor code● MapReduce: Data Locality (Map only)● MPI: Ability to use local storage (or RAM): temp file on

local disk + allow to be cached in RAM by OS● MapReduce: Automatic cleanup of local resources (tmp

files)● MPI: Fast Optimization approach remain within the

conceptual scope: AllReduce = fct call● MapReduce robustness (speculative execution to deal

with slow nodes)

Summary

● Big Data Big Picture○ BigData : Cluster + IO bounded (Locality)

● Big Data Dead Valley Dilemma (MMID)○ Small Market/Maturity/Data:access,quality/Slowness

● EMR (aws) = Slow● Minimize Tower or abstraction● Scaling MP: bottleneck = ML

○ MPI:no fault tolerance + where is the data?○ Hadoop: slow setup & teardown + Require

Refactoring○ Hadoop compatible AllReduce

Reference MPI & hadoop

blog:http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.htmlhttp://hunch.net/?p=2094 Video & slides presentaiton John Langford Learning From Lots Of Data (full)

CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research

Slides: http://lisaweb.iro.umontrea...

Implementation :

vowpal_wabbit

http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.html






https://www.youtube.com/watch?v=9egRkV1mOx8

https://www.youtube.com/watch?v=9egRkV1mOx8

http://lisaweb.iro.umontreal.ca/transfert/lisa/users/JohnLangford_presentation/slides.pdf

https://github.com/JohnLangford/vowpal_wabbit

https://github.com/JohnLangford/vowpal_wabbit

hum...

Questions?

[email protected]

Technology

The big data dead valley dilemma and much more