Lawrence Livermore Labs talk 2011

04/10/2023 © MapR Confidential 1

MapR Architecture and Machine Learning

1


Outline

• MapR system overview• Map-reduce review• MapR architecture• Performance Results• Map-reduce on MapR

• Machine learning on MapR


Map-Reduce

Input Output

Shuffle


Bottlenecks and Issues

• Read-only files• Many copies in I/O path• Shuffle based on HTTP• Can’t use new technologies• Eats file descriptors

• Spills go to local file space• Bad for skewed distribution of sizes


MapR Improvements

• Faster file system• Fewer copies• Multiple NICS• No file descriptor or page-buf competition

• Faster map-reduce• Uses distributed file system• Direct RPC to receiver• Very wide merges


MapR Innovations

• Volumes• Distributed management• Data placement

• Read/write random access file system• Allows distributed meta-data• Improved scaling• Enables NFS access

• Application-level NIC bonding• Transactionally correct snapshots and mirrors


Each container contains Directories & files Data blocks

Replicated on servers No need to manage

directly

MapR's ContainersFiles/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks

Containers are 16-32 GB segments of disk, placed on nodes


Container locations and replication

CLDB

N1, N2

N3, N2

N1, N2

N1, N3

N3, N2

N1

N2

N3

Container location database (CLDB) keeps track of nodes hosting each container


Containers represent 16 - 32GB of data Each can hold up to 1 Billion files and directories 100M containers = ~ 2 Exabytes (a very large cluster)

250 bytes DRAM to cache a container 25GB to cache all containers for 2EB cluster

But not necessary, can page to disk Typical large 10PB cluster needs 2GB

Container-reports are 100x - 1000x < HDFS block-reports Serve 100x more data-nodes Increase container size to 64G to serve 4EB cluster

Map/reduce not affected

MapR Scaling


MapR's Streaming Performance

Read Write0

250

500

750

1000

1250

1500

1750

2000

2250

Read Write0

250

500

750

1000

1250

1500

1750

2000

2250

HardwareMapRHadoopMB

persec

Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB

11 x 7200rpm SATA 11 x 15Krpm SAS

Higher is better


Terasort on MapR

1.0 TB0

10

20

30

40

50

60

3.5 TB0

50

100

150

200

250

300

MapRHadoop

Elapsed time (mins)

10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm

Lower is better


MUCH faster for some operations

# of files (millions)

Teststopped

hereCreateRate

Same 10 nodes …


NFS mounting models

• Export to the world• NFS gateway runs on selected gateway hosts

• Local server• NFS gateway runs on local host• Enables local compression and check summing

• Export to self• NFS gateway runs on all data nodes, mounted

from localhost


Export to the world

NFSServerNFS

ServerNFSServerNFS

ServerNFSClient


Client

NFSServer

Local server

Application

Cluster Nodes


ClusterNode

NFSServer

Universal export to self

Application

Cluster Nodes


ClusterNode

NFSServer

Application

ClusterNode

NFSServer

Application

ClusterNode

NFSServer

Application

Nodes are identical


Sharded text indexing

• Mapper assigns document to shard• Shard is usually hash of document id

• Reducer indexes all documents for a shard• Indexes created on local disk• On success, copy index to DFS• On failure, delete local files

• Must avoid directory collisions • can’t use shard id!

• Must manage local disk space


Conventional data flows

MapReducer

Input documents

Localdisk Search

EngineLocal

disk

Clustered index storage

Failure of a reducer causes garbage to accumulate in the

local disk

Failure of search engine requires

another download of the index from clustered storage.


SearchEngine

Simplified NFS data flows

MapReducer

Input documents

Clustered index storage

Failure of a reducer is cleaned up by

map-reduce framework

Search engine reads mirrored index directly.


Application to machine learning

• So now we have the hammer

• Let’s see some nails!


K-means

• Classic E-M based algorithm• Given cluster centroids,• Assign each data point to nearest centroid• Accumulate new centroids• Rinse, lather, repeat


Aggregatenew

centroids

K-means, the movie

Assignto

Nearestcentroid

Centroids

Input


But …


Averagemodels

Parallel Stochastic Gradient Descent

Trainsub

model

Model

Input


Updatemodel

Variational Dirichlet Assignment

Gathersufficientstatistics

Model

Input


Old tricks, new dogs

• Mapper• Assign point to cluster• Emit cluster id, (1, point)

• Combiner and reducer• Sum counts, weighted sum of points• Emit cluster id, (n, sum/n)

• Output to HDFS

Read fromHDFS to local disk by distributed cache

Written by map-reduce

Read from local disk from distributed cache


Old tricks, new dogs

• Mapper• Assign point to cluster• Emit cluster id, 1, point

• Combiner and reducer• Sum counts, weighted sum of points• Emit cluster id, n, sum/n

• Output to HDFSMapR FS

Read fromNFS

Written by map-reduce


Click modeling architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS


Poor man’s Pregel

• Mapper

• Lines in bold can use conventional I/O via NFS

31

while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input formatemit summary


Trivial visualization interface

• Map-reduce output is visible via NFS

• Legacy visualization just works

$ R> x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”)> plot(error ~ t, x)> q(save=‘n’)


Conclusions

• We used to know all this• Tab completion used to work• 5 years of work-arounds have clouded our

memories

• We just have to remember the future

Technology

Lawrence Livermore Labs talk 2011