Overview of Spark project Presented by Yin Zhu (yinz@ust.hk)yinz@ust.hk Materials from Hadoop in...

Overview of Spark projectPresented by Yin Zhu (yinz@ust.hk)

Materials from

• http://spark-project.org/documentation/

• Hadoop in Practice by A. Holmes

• My demo code: https://bitbucket.org/blackswift/spark-example

25 March 2013

OutlineReview of MapReduce (15’)

Going Through Spark (NSDI’12) Slides (30’)

Demo (15’)

Review of MapReducePagerank implemented in Scala and Hadoop

Why Pagerank?

>More complicated than the “hello world” WordCount

>Widely used in search engines, very large scale input (the whole web graph!)

>Iterative algorithm (typical style for most numerical algorithms for data mining)

New score: 0.15 + 0.5*0.85 = 0.575

A functional implementationdef pagerank(links: Array[(UrlId, Array[UrlId])], numIters: Int):Map[UrlId, Double] = { val n = links.size var ranks = (Array.fromFunction(i => (i, 1.0)) (n)).toMap // init: each node has rank 1.0 for (iter <- 1 to numIters) { // Take some interactions val contrib = links .flatMap(node => {

val out_url = node._1 val in_urls = node._2 val score = ranks(out_url) / in_urls.size // the score each outer link

recieves in_urls.map(in_url => (in_url, score) )

} ) .groupBy(url_score => url_score._1) // group the (url, score) pairs by url .map(url_scoregroup => // sum the score for each unique url (url_scoregroup._1, url_scoregroup._2.foldLeft (0.0) ((sum,url_score) => sum+url_score._2))) ranks = ranks.map(url_score => (url_score._1, if (contrib.contains(url_score._1)) 0.85 * contrib(url_score._1) + 0.15 else url_score._2)) } ranks }

Reduce

Hadoop implementationhttps://github.com/alexholmes/hadoop-book/tree/master/src/main/java/com/manning/hip/ch7/pagerank

Hadoop/MapReduce implementation

Fault tolerance

the result after each iteration is saved to Disk

Speed/Disk IO

disk IO is proportional to the # of iterations

ideally the link graph should be loaded for only once

Resilient Distributed DatasetsA Fault-Tolerant Abstraction forIn-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley,Michael Franklin, Scott Shenker, Ion Stoica

UC Berkeley UC BERKELEY

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters

But as soon as it got popular, users wanted more:

»More complex, multi-stage applications(e.g. iterative machine learning & graph processing)

»More interactive ad-hoc queries

Response: specialized frameworks for some of these apps (e.g. Pregel for graph

processing)

MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

In MapReduce, the only way to share data across jobs is stable storage

Examples

iter. 1 iter. 2 . . .

HDFSread

HDFSwrite

HDFSread

HDFSwrite

query 1

query 2

query 3

result 1

result 2

result 3

HDFSread

Slow due to replication and disk I/O,but necessary for fault tolerance

iter. 1 iter. 2 . . .

Goal: In-Memory Data Sharing

query 1

query 2

query 3

one-timeprocessing

10-100× faster than network/disk, but how to get FT?

Challenge

How to design a distributed memory abstraction that is both fault-tolerant

and efficient?

Solution: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory

»Immutable, partitioned collections of records

»Can only be built through coarse-grained deterministic transformations (map, filter, join, …)

Efficient fault recovery using lineage»Log one operation to apply to many

elements»Recompute lost partitions on failure»No cost if nothing fails

Three core concepts of RDDTransformations

define a new RDD from an existing RDD

Cache and Partitioner

Put the dataset into memory/other persist media, and specify the locations of the sub datasets

Actions

carry out the actual computation

RDD and its lazy transformationsRDD[T]: A sequence of objects of type T

Transformations are lazy:

https://github.com/mesos/spark/blob/master/core/src/main/scala/spark/PairRDDFunctions.scala

/*** Return an RDD of grouped items.*/

def groupBy[K: ClassManifest](f: T => K, p: Partitioner): RDD[(K, Seq[T])] = { val cleanF = sc.clean(f) this.map(t => (cleanF(t), t)).groupByKey(p) }

/*** Group the values for each key in the RDD into a single sequence. Allows controlling the* partitioning of the resulting key-value pair RDD by passing a Partitioner.*/

def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = { def createCombiner(v: V) = ArrayBuffer(v)

def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v

def mergeCombiners(b1: ArrayBuffer[V], b2: ArrayBuffer[V]) = b1 ++= b2

val bufs = combineByKey[ArrayBuffer[V]]( createCombiner _, mergeValue _, mergeCombiners _, partitioner) bufs.asInstanceOf[RDD[(K, Seq[V])]] }

Actions: carry out the actual computation

https://github.com/mesos/spark/blob/master/core/src/main/scala/spark/SparkContext.scala

runJob

Example

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist() or .cache()

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

Task Scheduler for actionsDryad-like DAGs

Pipelines functionswithin a stage

Locality & data reuse aware

Partitioning-awareto avoid shuffles

groupBy

Stage 3

Stage 1

Stage 2

= cached data partition

query 1

query 2

query 3

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

RDDs track the graph of transformations that built them (their lineage) to rebuild lost data

E.g.:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

HadoopRDD FilteredRDD MappedRDD

Fault Recovery Results

1 2 3 4 5 6 7 8 9 100

20406080

100120140

57 56 58 5881

57 59 57 59

Iteration

s) Failure happens

Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms

»These naturally apply the same operation to many items

Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP

(Pregel), iterative MapReduce (Haloop), bulk incremental, …

Support new apps that these models don’t

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Coarse

Low High

K-V stores,databases,RAMCloud

Best for batchworkloads

Best fortransactional

workloads

HDFS RDDs

OutlineSpark programming interface

Implementation

How people are using Spark

Spark Programming InterfaceDryadLINQ-like API in the Scala language

Usable interactively from Scala interpreter

Provides:»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build

new RDDs), actions (compute and output results)

»Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()Block 1

Block 2

Block 3

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

results

Msgs. 1

Msgs. 2

Msgs. 3

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|

links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _)}

Optimizing Placement

links & ranks repeatedly joined

Can co-partition them (e.g. hash both on URL) to avoid shuffles

Can also use app knowledge, e.g., hash on DNS name

links = links.partitionBy( new URLPartitioner())

reduce

Contribs0

Contribs2

Ranks0

(url, rank)

Links(url,

neighbors)

Ranks2

reduce

Ranks1

PageRank Performance

01020304050607080 72

Hadoop

Basic Spark

Spark + Con-trolled Partition-ing

ImplementationRuns on Mesos [NSDI 11]to share clusters w/ Hadoop

Can read from any Hadoop input source (HDFS, S3, …)

SparkHadoo

Node Node Node Node

No changes to Scala language or compiler»Reflection + bytecode analysis to correctly ship

www.spark-project.org

Behavior with Insufficient RAM

0% 25% 50% 75% 100%0

10068.8

Percent of working set in memory

Scalability

25 50 1000

HadoopHadoopBinMemSpark

Number of machines

25 50 1000

300 27

Hadoop HadoopBinMemSpark

Number of machines

Logistic Regression K-Means

Breaking Down the Speedup

In-mem HDFS

In-mem local file

Spark RDD0

Text InputBinary Input

Spark Operations

Transformations

(define a new RDD)

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions(return a result

to driver program)

collectreducecountsave

lookupKey

Open Source Community15 contributors, 5+ companies using Spark,3+ applications projects at Berkeley

User applications:»Data mining 40x faster than Hadoop

(Conviva)»Exploratory log analysis (Foursquare)»Traffic prediction via EM (Mobile Millennium)»Twitter spam classification (Monarch)»DNA sequence analysis (SNAP)». . .

Related WorkRAMCloud, Piccolo, GraphLab, parallel DBs

»Fine-grained writes requiring replication for resilience

Pregel, iterative MapReduce»Specialized models; can’t run arbitrary / ad-hoc queries

DryadLINQ, FlumeJava»Language-integrated “distributed dataset” API, but

cannot share datasets efficiently across queries

Nectar [OSDI 10]»Automatic expression caching, but over distributed FS

PacMan [NSDI 12]»Memory cache for HDFS, but writes still go to

network/disk

ConclusionRDDs offer a simple and efficient programming model for a broad range of applications

Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery

Try it out at www.spark-project.org

Overview of Spark project Presented by Yin Zhu (yinz@ust.hk)yinz@ust.hk Materials from Hadoop in...

Documents

COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cset.hk

Zhu GenomeRes2009 89yeastTFs

An Efficient Brush Model for Physically-Based 3D Painting Nelson S.-H. CHU (cpegnel@ust.hk) Chiew-Lan TAI (taicl@ust.hk) The Hong Kong University of Science

Alex YAU alexyau@ust.hk. Important Notes Website for Assignment #2 You

Contact Information · 2016. 7. 20. · Alexis Lau: alau@ust.hk. Robert Gibson: rgibson@ust.hk . Big History as a Pedagogical Framework in Teaching for Sustainability at The Hong

Lingyan Zhu

Craneopuntura Dr. Zhu

Powder Technology, 2013 - ust.hk

Small Summaries for Big Data Ke Yi HKUST yike@ust.hk

NGA Project Strong-Motion Database - ust.hk

Zhu Thesis

China's 2004 Economic Census and 2006 Benchmark - ust.hk

Hua YE (yehua@ust.hk ) Xiaogang WU (sowu@ust.hk ) Division

Julia q zhu

Zhu Zhu Pets (Lorena Gladić)

PRE-DEPARTURE BRIEFING SESSIONbmvh52.ust.hk/files/exchange/Spring 2020 Ex-Out Pre... · Room 2001, 2/F, Lift 4, Academic Building study.abroad@ust.hk / 2358-8178. ... NOT the availability

Inverters Zhu

An Easily Accessible Ionic Aggregation ... - ias2.ust.hk

James Kai-sing KUNG (sojk@ust.hk) Hong Kong University of

Shining in every facet · 2012. 2. 28. · Tel: (852) 2358-6109 Fax: (852) 2705-9119 Email: donation@ust.hk Media Relations Tel: (852) 2358-8555 Fax: (852) 2358-0537 Email: media@ust.hk