Spark Training in Bangalore

In-Memory Cluster Computing forIterative and Interactive ApplicationsPresented By Kelly Technologies www.kellytechno.com

Commodity clusters have become an important computing platform for a variety of applicationsIn industry: search, machine translation, ad targeting, In research: bioinformatics, NLP, climate simulation, High-level cluster programming models like MapReduce power many of these appsTheme of this work: provide similarly powerful abstractions for a broader class of applications www.kellytechno.com

Current popular programming models for clusters transform data flowing from stable storage to stable storageE.g., MapReduce: www.kellytechno.com

Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)Interactive data mining tools (R, Excel, Python)Spark makes working sets a first-class concept to efficiently support these apps www.kellytechno.com

Provide distributed memory abstractions for clusters to support apps with working setsRetain the attractive properties of MapReduce:Fault tolerance (for crashes & stragglers)Data localityScalabilitySolution: augment data flow model with resilient distributed datasets (RDDs) www.kellytechno.com

We conjecture that Sparks combination of data flow with RDDs unifies many proposed cluster programming modelsGeneral data flow models: MapReduce, Dryad, SQLSpecialized models for stateful apps: Pregel (BSP), HaLoop (iterative MR), Continuous Bulk ProcessingInstead of specialized APIs for one type of app, give user first-class control of distrib. datasets www.kellytechno.com

Spark programming modelExample applicationsImplementationDemoFuture work www.kellytechno.com

Resilient distributed datasets (RDDs)Immutable collections partitioned across cluster that can be rebuilt if a partition is lostCreated by transforming data in stable storage using data flow operators (map, filter, group-by, )Can be cached across parallel operationsParallel operations on RDDsReduce, collect, count, save, Restricted shared variablesAccumulators, broadcast variables www.kellytechno.com

Load error messages from a log into memory, then interactively search for various patternslines = spark.textFile(hdfs://...)errors = lines.filter(_.startsWith(ERROR))messages = errors.map(_.split(\t)(2))cachedMsgs = messages.cache()Block 1Block 2Block 3cachedMsgs.filter(_.contains(foo)).countcachedMsgs.filter(_.contains(bar)).count. . .tasksresultsCache 1Cache 2Cache 3Base RDDTransformed RDDCached RDDParallel operationResult: full-text search of Wikipedia in

An RDD is an immutable, partitioned, logical collection of recordsNeed not be materialized, but rather contains information to rebuild a dataset from stable storagePartitioning can be based on a key in each record (using hash or range partitioning)Built using bulk transformations on other RDDsCan be cached for future reuse www.kellytechno.com

www.kellytechno.com

Transformations (define a new RDD)map filter sample union groupByKey reduceByKey joincache

Parallel operations (return a result to driver)reduce collect count savelookupKey

RDDs maintain lineage information that can be used to reconstruct lost partitionsEx:

cachedMsgs = textFile(...).filter(_.contains(error)) .map(_.split(\t)(2)) .cache() www.kellytechno.com

Consistency is easy due to immutabilityInexpensive fault tolerance (log lineage rather than replicating/checkpointing data)Locality-aware scheduling of tasks on partitionsDespite being restricted, model seems applicable to a broad variety of applications www.kellytechno.com

www.kellytechno.com

ConcernRDDsDistr. Shared Mem.ReadsFine-grainedFine-grainedWritesBulk transformationsFine-grainedConsistencyTrivial (immutable)Up to app / runtimeFault recoveryFine-grained and low-overhead using lineageRequires checkpoints and program rollbackStraggler mitigationPossible using speculative executionDifficultWork placementAutomatic based on data localityUp to app (but runtime aims for transparency)

DryadLINQLanguage-integrated API with SQL-like operations on lazy datasetsCannot have a dataset persist across queriesRelational databasesLineage/provenance, logical logging, materialized viewsPiccoloParallel programs with shared distributed tables; similar to distributed shared memoryIterative MapReduce (Twister and HaLoop)Cannot define multiple distributed datasets, run different map/reduce pairs on them, or query data interactivelyRAMCloudAllows random read/write to all cells, requiring logging much like distributed shared memory systems www.kellytechno.com

Goal: find best line separating two sets of points++++++++++targetrandom initial line www.kellytechno.com

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w) www.kellytechno.com

www.kellytechno.com

Chart1

128174

637214

1245242

2559283

3818354

Hadoop

Spark

Number of Iterations

Running Time (s)

Sheet1

HadoopSpark

1128174

5637214

101245242

202559283

303818354

To resize chart data range, drag lower right corner of range.

MapReduce data flow can be expressed using RDD transformationsres = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals))Or with combiners:res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val)) www.kellytechno.com

val lines = spark.textFile(hdfs://...)

val counts = lines.flatMap(_.split(\\s)) .reduceByKey(_ + _)

counts.save(hdfs://...) www.kellytechno.com

Graph processing framework from Google that implements Bulk Synchronous Parallel modelVertices in the graph have stateAt each superstep, each node can update its state and send messages to nodes in future stepGood fit for PageRank, shortest paths, www.kellytechno.com

Input graphVertex state 1Messages 1Superstep 1Vertex state 2Messages 2Superstep 2. . .Group by vertex IDGroup by vertex ID www.kellytechno.com

Input graphVertex ranks 1Contributions 1Superstep 1 (add contribs)Vertex ranks 2Contributions 2Superstep 2 (add contribs). . .Group & add by vertexGroup & add by vertex www.kellytechno.com

Separate RDDs for immutable graph state and for vertex states and messages at each iterationUse groupByKey to perform each stepCache the resulting vertex and message RDDsOptimization: co-partition input graph and vertex state RDDs to reduce communication www.kellytechno.com

Twitter spam classification (Justin Ma)EM alg. for traffic prediction (Mobile Millennium)K-means clusteringAlternating Least Squares matrix factorizationIn-memory OLAP aggregation on Hive dataSQL on Spark (future work) www.kellytechno.com

Spark runs on the Mesos cluster manager [NSDI 11], letting it share resources with Hadoop & other appsCan read from any Hadoop input source (e.g. HDFS)~6000 lines of Scala code thanks to building on Mesos www.kellytechno.com

Scala closures are Serializable Java objectsSerialize on driver, load & run on workersNot quite enoughNested closures may reference entire outer scopeMay pull in non-Serializable variables not used insideSolution: bytecode analysis + reflectionShared variables implemented using custom serialized form (e.g. broadcast variable contains pointer to BitTorrent tracker) www.kellytechno.com

Modified Scala interpreter to allow Spark to be used interactively from the command lineRequired two changes:Modified wrapper code generation so that each line typed has references to objects for its dependenciesPlace generated classes in distributed filesystemEnables in-memory exploration of big data www.kellytechno.com

Further extend RDD capabilitiesControl over storage layout (e.g. column-oriented)Additional caching options (e.g. on disk, replicated)Leverage lineage for debuggingReplay any task, rebuild any intermediate RDDAdaptive checkpointing of RDDsHigher-level analytics tools built on top of Spark www.kellytechno.com

By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analyticsRDDs provide:Lineage info for fault recovery and debuggingAdjustable in-memory cachingLocality-aware parallel operationsWe plan to make Spark the basis of a suite of batch and interactive data analysis tools www.kellytechno.com

Set of partitionsPreferred locations for each partitionOptional partitioning scheme (hash or range)Storage strategy (lazy or cached)Parent RDDs (forming a lineage DAG) www.kellytechno.com

Presented ByKelly Technologies www.kellytechno.com

*Explain that trends in data collection rates vs processor and IO speeds are driving this*Also applies to Dryad, SQL, etcBenefits: easy to do fault tolerance and *Also applies to Dryad, SQL, etcBenefits: easy to do fault tolerance and *RDDs = first-class way to manipulate and persist intermediate datasets*Key idea:**You write a single program similar to DryadLINQDistributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across opsVariables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimizationMention cached vars useful for some workloads that wont be shown hereMention its all designed to be easy to distribute in a fault-tolerant fashion

****Note that dataset is reused on each gradient computation**This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)****Mention its designed to be fault-tolerant****

Documents

Spark Training in Bangalore