60
Big Data Platforms Mihai Budiu , Oct 6 2014

Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

Embed Size (px)

Citation preview

Page 1: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

Big Data Platforms

Mihai Budiu

, Oct 6 2014

Page 2: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

2

My work• Ph.D. from Carnegie Mellon, 2003• Hardware synthesis• Reconfigurable hardware• Compilers and computer architecture

• Researcher at Microsoft Research Silicon Valley 2004-2014• Computer security• Cloud computing infrastructure:

• distributed computation platforms • monitoring and debugging• performance analysis

• Big data analysis and visualization • Large scale machine learning

Page 3: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

3

500 Years Ago

Tycho Brahe(1546-1601)

Johannes Kepler(1571-1630)

Page 4: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

4

The Laws of Planetary Motion

Tycho’s measurements Kepler’s laws

Page 5: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

5

The Large Hadron Collider

25 PB/year WLHC Grid: 200K computing cores

Page 6: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

6

Genetic Code

Page 7: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

7

Astronomy

Page 8: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

8

Weather

Page 9: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

9

The Webs

Internet

Facebook friends graph

Page 10: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

10

Big Data

Page 11: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

11

Big Computers

Page 12: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

12

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

Page 13: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

13

Design Space

Throughput(batch)

Latency(interactive)

Internet

Datacenter

Data-parallel

Sharedmemory

DryadSearch

HPC

Grid

Transaction

Sketch

Page 14: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

14

Dryad• Eurosys 2007• Continuously deployed in

Microsoft since 2006• Execution engine of Bing

analytics• > 105 machines•Many PB of data analyzed daily

Dryad painting by Evelyn de Morgan

Page 15: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

15

Dryad = Execution Layer

Job (application)

Dryad

Cluster

Pipeline

Shell

Machine≈

Page 16: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

16

2-D Piping• Unix Pipes: 1-D

grep | sed | sort | awk | perl

• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50

Page 17: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

17

Virtualized 2-D Pipelines

Page 18: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

18

Virtualized 2-D Pipelines

Page 19: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

19

Virtualized 2-D Pipelines

Page 20: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

20

Virtualized 2-D Pipelines

Page 21: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

21

Virtualized 2-D Pipelines• 2D DAG• multi-machine• virtualized

Page 22: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

22

Dryad Job Structure

grep

sed

sortawk

perlgrep

grepsed

sort

sort

awk

Inputfiles

Vertices (processes)

Outputfiles

ChannelsStage

Page 23: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

23

Dryad System Architecture

Files, TCP, FIFO, Networkjob schedule

data plane

control plane

NS,Sched RE RERE

V V V

job manager cluster

Page 24: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

GM code

vertex code

Staging1. Build

2. Send .exe

3. Start manager

5. Generate graph

7. Serializevertices

8. MonitorVertex execution

4. Querycluster resources

Nameserver6. Initialize vertices

Remoteexecutionservice

Page 25: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

25

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

Page 26: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

26

Distributed Collections

Partition

Collection

.Net objects

Page 27: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

27

LINQ

Dryad

=> DryadLINQ

Page 28: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

28

LINQ = .Net+ Queries

Collection<T> collection;bool IsLegal(Key);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

Page 29: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

29

Collection<T> collection;bool IsLegal(Key k);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

DryadLINQ = LINQ + Dryad

C#

collection

results

C# C# C#

Vertexcode

Queryplan(Dryad job)Data

Page 30: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

30

Language Summary

WhereSelectGroupByOrderByAggregateJoin

Page 31: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

31

Very expressive

var result = input.SelectMany(r => Mapper(r)) .GroupBy(r => Key(r)) .Select(g => Reducer(g));

Map-Reduce

Distributed sorting

Iterative machine-learning (EM)

Page 32: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

32

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

Page 33: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

33

Debugging DryadLINQ jobs

Page 34: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

34

Distributed performance counters

Page 35: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

35

Training Kinect

Depth map Body parts

Classifier

Xbox GPU

Page 36: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

36

Learn from Many Examples

DecisionTree

Classifier

Machine learning

Page 37: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

37

Talk Outline

• Motivation• Dryad: A distributed runtime• DryadLINQ: A compiler for Dryad• Tools and applications• Sketch: A billion-row spreadsheet

Page 38: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

Bandwidth hierarchy

Page 39: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

39

Principles

• Visualizations are bounded data displays• All computations are sketches

• Sketch is a runtime for (1) running streaming (sketching) algorithms(2) implementing visualizations with bounded data renderings

Page 40: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

40

Streaming algorithms

• Sketches = randomized streaming algorithms • Input = set of size n• Result same independent of the order• Memory = O(log(n))• Multi-pass

• Linear input transformations

Page 41: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

4 billion rows on 155 machines

Page 42: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

42

Spreadsheet operations• Browsing/scrolling• Filtering• Using predicates• Heavy hitters• Sampling

• Searching• Sorting• Computing new columns• Set operations (intersection, union, etc.)• Charting

Page 43: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

Histograms

Page 44: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

Heat Maps

Page 45: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

Sketch distributed service

45data

Sketchservice

data

Sketchservice

data

Sketchservice

data

Sketchservice

Page 46: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

46

DataSets = distributed objects

Network

46

Client

Servers

DataSet<T>

Application

T T T T T T T T T T T

Page 47: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

47

Sketch Spreadsheet architecture

DataSet<Table>

SQL Server CSV Files Column store Cosmos Storage layer

Table operations

GUI

Distributed objects

Spreadsheet logic

Spreadsheet display

Page 48: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

48

DataSet API

interface IDataSet<T> { IDataSet<S> Map<S>(Func<T,S> f); IDataSet<Pair<T,S>> Zip(IDataSet<S> other); R Sketch(ISketch<T, R> sketch);}interface ISketch<T,R> {

R Create(T data);R Combine(List<R> parts);

}

Page 49: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

49

DataSet Implementations

Application

Network

Client Parallel

Proxy Proxy

GUI

Parallel

Local Local Local Local

Parallel

Local Local

Parallel

Datasetinterface

Rack aggregation

Core parallelism

Cluster parallelism

RMI layer

Proxy

ref ref ref

Parallel

Server 0 Server 1 Server n

Rack 0 Rack r

Address space

T T T T T T

Page 50: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

Proxy

Local Local

Parallel

Proxy

Local Local

Parallel

T T S Sff

Map(f)

Page 51: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

51

Sketch(s)

Proxy

Local Local

ParallelR R

R

R

s.Combine

T T

s.Create

interface ISketch<T,R> {R Create(T data);R Combine(List<R> parts);

}

Page 52: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

52

Zip

Proxy

Local Local

Parallel

Proxy

Local Local

Parallel

T T S S

Proxy

Local Local

Parallel

T,S T,S

Page 53: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

53

Histograms

CDF

2Dhistogram

Page 54: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

54

Compute

Computing a histogram

Client

Server 1

Server n

Histogram

1D + 2Dcomposit

esketch

Datarangesketch

Render

Displayhistogra

m

User click tr th

ta

Page 55: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

55

Some numbers

• Window Server 2012 R2 • 8-core 2.1GHz

AMD Opteron 2373 EE • > 16GB RAM• 3 x 1TB disks using RAID-0• 155 machines • 5 racks • 1Gbps Ethernet

Page 56: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

56

1 2 4 8 16 24 32 64 128

155

0

100

200

300

400

500

600 No aggregation network

With aggregation network

Null Sketch

Machines

Tim

e (m

s)

Page 57: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

57

Histogram computation

• 26M rows/machine• Scale-out

1 2 4 8 16 24 32 64 128

155

0200400600800

1000120014001600

machines

Tim

e (m

s)

Page 58: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

58

Conclusions

• Big data is here to stay• Better tools are needed• Quest for high-level abstractions for

building distributed systems• Execution graphs• Distributed collections• Higher-order transformations• Distributed stateful objects• Sketching algorithms

Page 59: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

59

Page 60: Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer

Execution

Application

Data-Parallel Computation

60

Storage

Language

Map-Reduce

GFSBigTable

CosmosAzure

SQL Server

Dryad

DryadLINQScope

Sawzall,FlumeJava

Hadoop

HDFSS3

Pig, Hive≈SQL LINQ, SQLSawzall, Java