Upload
kellytechnologies
View
21
Download
0
Embed Size (px)
DESCRIPTION
Spark Training Institutes: kelly technologies is the best Spark class Room training institutes in Bangalore. Providing Spark training by real time faculty in Bangalore.
Citation preview
In-Memory Cluster Computing forIterative and Interactive ApplicationsPresented By Kelly Technologies www.kellytechno.com
Commodity clusters have become an important computing platform for a variety of applicationsIn industry: search, machine translation, ad targeting, In research: bioinformatics, NLP, climate simulation, High-level cluster programming models like MapReduce power many of these appsTheme of this work: provide similarly powerful abstractions for a broader class of applications www.kellytechno.com
Current popular programming models for clusters transform data flowing from stable storage to stable storageE.g., MapReduce: www.kellytechno.com
Current popular programming models for clusters transform data flowing from stable storage to stable storageE.g., MapReduce: www.kellytechno.com
Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)Interactive data mining tools (R, Excel, Python)Spark makes working sets a first-class concept to efficiently support these apps www.kellytechno.com
Provide distributed memory abstractions for clusters to support apps with working setsRetain the attractive properties of MapReduce:Fault tolerance (for crashes & stragglers)Data localityScalabilitySolution: augment data flow model with resilient distributed datasets (RDDs) www.kellytechno.com
We conjecture that Sparks combination of data flow with RDDs unifies many proposed cluster programming modelsGeneral data flow models: MapReduce, Dryad, SQLSpecialized models for stateful apps: Pregel (BSP), HaLoop (iterative MR), Continuous Bulk ProcessingInstead of specialized APIs for one type of app, give user first-class control of distrib. datasets www.kellytechno.com
Spark programming modelExample applicationsImplementationDemoFuture work www.kellytechno.com
Resilient distributed datasets (RDDs)Immutable collections partitioned across cluster that can be rebuilt if a partition is lostCreated by transforming data in stable storage using data flow operators (map, filter, group-by, )Can be cached across parallel operationsParallel operations on RDDsReduce, collect, count, save, Restricted shared variablesAccumulators, broadcast variables www.kellytechno.com
An RDD is an immutable, partitioned, logical collection of recordsNeed not be materialized, but rather contains information to rebuild a dataset from stable storagePartitioning can be based on a key in each record (using hash or range partitioning)Built using bulk transformations on other RDDsCan be cached for future reuse www.kellytechno.com
www.kellytechno.com
Transformations (define a new RDD)map filter sample union groupByKey reduceByKey joincache
Parallel operations (return a result to driver)reduce collect count savelookupKey
RDDs maintain lineage information that can be used to reconstruct lost partitionsEx:
cachedMsgs = textFile(...).filter(_.contains(error)) .map(_.split(\t)(2)) .cache() www.kellytechno.com
Consistency is easy due to immutabilityInexpensive fault tolerance (log lineage rather than replicating/checkpointing data)Locality-aware scheduling of tasks on partitionsDespite being restricted, model seems applicable to a broad variety of applications www.kellytechno.com
www.kellytechno.com
ConcernRDDsDistr. Shared Mem.ReadsFine-grainedFine-grainedWritesBulk transformationsFine-grainedConsistencyTrivial (immutable)Up to app / runtimeFault recoveryFine-grained and low-overhead using lineageRequires checkpoints and program rollbackStraggler mitigationPossible using speculative executionDifficultWork placementAutomatic based on data localityUp to app (but runtime aims for transparency)
DryadLINQLanguage-integrated API with SQL-like operations on lazy datasetsCannot have a dataset persist across queriesRelational databasesLineage/provenance, logical logging, materialized viewsPiccoloParallel programs with shared distributed tables; similar to distributed shared memoryIterative MapReduce (Twister and HaLoop)Cannot define multiple distributed datasets, run different map/reduce pairs on them, or query data interactivelyRAMCloudAllows random read/write to all cells, requiring logging much like distributed shared memory systems www.kellytechno.com
Spark programming modelExample applicationsImplementationDemoFuture work www.kellytechno.com
Goal: find best line separating two sets of points++++++++++targetrandom initial line www.kellytechno.com
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}
println("Final w: " + w) www.kellytechno.com
www.kellytechno.com
Chart1
128174
637214
1245242
2559283
3818354
Hadoop
Spark
Number of Iterations
Running Time (s)
Sheet1
HadoopSpark
1128174
5637214
101245242
202559283
303818354
To resize chart data range, drag lower right corner of range.
MapReduce data flow can be expressed using RDD transformationsres = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals))Or with combiners:res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val)) www.kellytechno.com
val lines = spark.textFile(hdfs://...)
val counts = lines.flatMap(_.split(\\s)) .reduceByKey(_ + _)
counts.save(hdfs://...) www.kellytechno.com
Graph processing framework from Google that implements Bulk Synchronous Parallel modelVertices in the graph have stateAt each superstep, each node can update its state and send messages to nodes in future stepGood fit for PageRank, shortest paths, www.kellytechno.com
Input graphVertex state 1Messages 1Superstep 1Vertex state 2Messages 2Superstep 2. . .Group by vertex IDGroup by vertex ID www.kellytechno.com
Input graphVertex ranks 1Contributions 1Superstep 1 (add contribs)Vertex ranks 2Contributions 2Superstep 2 (add contribs). . .Group & add by vertexGroup & add by vertex www.kellytechno.com
Separate RDDs for immutable graph state and for vertex states and messages at each iterationUse groupByKey to perform each stepCache the resulting vertex and message RDDsOptimization: co-partition input graph and vertex state RDDs to reduce communication www.kellytechno.com
Twitter spam classification (Justin Ma)EM alg. for traffic prediction (Mobile Millennium)K-means clusteringAlternating Least Squares matrix factorizationIn-memory OLAP aggregation on Hive dataSQL on Spark (future work) www.kellytechno.com
Spark programming modelExample applicationsImplementationDemoFuture work www.kellytechno.com
Spark runs on the Mesos cluster manager [NSDI 11], letting it share resources with Hadoop & other appsCan read from any Hadoop input source (e.g. HDFS)~6000 lines of Scala code thanks to building on Mesos www.kellytechno.com
Scala closures are Serializable Java objectsSerialize on driver, load & run on workersNot quite enoughNested closures may reference entire outer scopeMay pull in non-Serializable variables not used insideSolution: bytecode analysis + reflectionShared variables implemented using custom serialized form (e.g. broadcast variable contains pointer to BitTorrent tracker) www.kellytechno.com
Modified Scala interpreter to allow Spark to be used interactively from the command lineRequired two changes:Modified wrapper code generation so that each line typed has references to objects for its dependenciesPlace generated classes in distributed filesystemEnables in-memory exploration of big data www.kellytechno.com
Spark programming modelExample applicationsImplementationDemoFuture work www.kellytechno.com
Further extend RDD capabilitiesControl over storage layout (e.g. column-oriented)Additional caching options (e.g. on disk, replicated)Leverage lineage for debuggingReplay any task, rebuild any intermediate RDDAdaptive checkpointing of RDDsHigher-level analytics tools built on top of Spark www.kellytechno.com
By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analyticsRDDs provide:Lineage info for fault recovery and debuggingAdjustable in-memory cachingLocality-aware parallel operationsWe plan to make Spark the basis of a suite of batch and interactive data analysis tools www.kellytechno.com
Set of partitionsPreferred locations for each partitionOptional partitioning scheme (hash or range)Storage strategy (lazy or cached)Parent RDDs (forming a lineage DAG) www.kellytechno.com
Presented ByKelly Technologies www.kellytechno.com
*Explain that trends in data collection rates vs processor and IO speeds are driving this*Also applies to Dryad, SQL, etcBenefits: easy to do fault tolerance and *Also applies to Dryad, SQL, etcBenefits: easy to do fault tolerance and *RDDs = first-class way to manipulate and persist intermediate datasets*Key idea:**You write a single program similar to DryadLINQDistributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across opsVariables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimizationMention cached vars useful for some workloads that wont be shown hereMention its all designed to be easy to distribute in a fault-tolerant fashion
****Note that dataset is reused on each gradient computation**This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)****Mention its designed to be fault-tolerant****