36
In-Memory Cluster Computing for Iterative and Interactive Applications Presented By Kelly Technologies www.kellytechno.com

Spark Training in Bangalore

Embed Size (px)

DESCRIPTION

Spark Training Institutes: kelly technologies is the best Spark class Room training institutes in Bangalore. Providing Spark training by real time faculty in Bangalore.

Citation preview

  • In-Memory Cluster Computing forIterative and Interactive ApplicationsPresented By Kelly Technologies www.kellytechno.com

  • Commodity clusters have become an important computing platform for a variety of applicationsIn industry: search, machine translation, ad targeting, In research: bioinformatics, NLP, climate simulation, High-level cluster programming models like MapReduce power many of these appsTheme of this work: provide similarly powerful abstractions for a broader class of applications www.kellytechno.com

  • Current popular programming models for clusters transform data flowing from stable storage to stable storageE.g., MapReduce: www.kellytechno.com

  • Current popular programming models for clusters transform data flowing from stable storage to stable storageE.g., MapReduce: www.kellytechno.com

  • Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)Interactive data mining tools (R, Excel, Python)Spark makes working sets a first-class concept to efficiently support these apps www.kellytechno.com

  • Provide distributed memory abstractions for clusters to support apps with working setsRetain the attractive properties of MapReduce:Fault tolerance (for crashes & stragglers)Data localityScalabilitySolution: augment data flow model with resilient distributed datasets (RDDs) www.kellytechno.com

  • We conjecture that Sparks combination of data flow with RDDs unifies many proposed cluster programming modelsGeneral data flow models: MapReduce, Dryad, SQLSpecialized models for stateful apps: Pregel (BSP), HaLoop (iterative MR), Continuous Bulk ProcessingInstead of specialized APIs for one type of app, give user first-class control of distrib. datasets www.kellytechno.com

  • Spark programming modelExample applicationsImplementationDemoFuture work www.kellytechno.com

  • Resilient distributed datasets (RDDs)Immutable collections partitioned across cluster that can be rebuilt if a partition is lostCreated by transforming data in stable storage using data flow operators (map, filter, group-by, )Can be cached across parallel operationsParallel operations on RDDsReduce, collect, count, save, Restricted shared variablesAccumulators, broadcast variables www.kellytechno.com

  • Load error messages from a log into memory, then interactively search for various patternslines = spark.textFile(hdfs://...)errors = lines.filter(_.startsWith(ERROR))messages = errors.map(_.split(\t)(2))cachedMsgs = messages.cache()Block 1Block 2Block 3cachedMsgs.filter(_.contains(foo)).countcachedMsgs.filter(_.contains(bar)).count. . .tasksresultsCache 1Cache 2Cache 3Base RDDTransformed RDDCached RDDParallel operationResult: full-text search of Wikipedia in
  • An RDD is an immutable, partitioned, logical collection of recordsNeed not be materialized, but rather contains information to rebuild a dataset from stable storagePartitioning can be based on a key in each record (using hash or range partitioning)Built using bulk transformations on other RDDsCan be cached for future reuse www.kellytechno.com

  • www.kellytechno.com

    Transformations (define a new RDD)map filter sample union groupByKey reduceByKey joincache

    Parallel operations (return a result to driver)reduce collect count savelookupKey

  • RDDs maintain lineage information that can be used to reconstruct lost partitionsEx:

    cachedMsgs = textFile(...).filter(_.contains(error)) .map(_.split(\t)(2)) .cache() www.kellytechno.com

  • Consistency is easy due to immutabilityInexpensive fault tolerance (log lineage rather than replicating/checkpointing data)Locality-aware scheduling of tasks on partitionsDespite being restricted, model seems applicable to a broad variety of applications www.kellytechno.com

  • www.kellytechno.com

    ConcernRDDsDistr. Shared Mem.ReadsFine-grainedFine-grainedWritesBulk transformationsFine-grainedConsistencyTrivial (immutable)Up to app / runtimeFault recoveryFine-grained and low-overhead using lineageRequires checkpoints and program rollbackStraggler mitigationPossible using speculative executionDifficultWork placementAutomatic based on data localityUp to app (but runtime aims for transparency)

  • DryadLINQLanguage-integrated API with SQL-like operations on lazy datasetsCannot have a dataset persist across queriesRelational databasesLineage/provenance, logical logging, materialized viewsPiccoloParallel programs with shared distributed tables; similar to distributed shared memoryIterative MapReduce (Twister and HaLoop)Cannot define multiple distributed datasets, run different map/reduce pairs on them, or query data interactivelyRAMCloudAllows random read/write to all cells, requiring logging much like distributed shared memory systems www.kellytechno.com

  • Spark programming modelExample applicationsImplementationDemoFuture work www.kellytechno.com

  • Goal: find best line separating two sets of points++++++++++targetrandom initial line www.kellytechno.com

  • val data = spark.textFile(...).map(readPoint).cache()

    var w = Vector.random(D)

    for (i (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

    println("Final w: " + w) www.kellytechno.com

  • www.kellytechno.com

    Chart1

    128174

    637214

    1245242

    2559283

    3818354

    Hadoop

    Spark

    Number of Iterations

    Running Time (s)

    Sheet1

    HadoopSpark

    1128174

    5637214

    101245242

    202559283

    303818354

    To resize chart data range, drag lower right corner of range.

  • MapReduce data flow can be expressed using RDD transformationsres = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals))Or with combiners:res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val)) www.kellytechno.com

  • val lines = spark.textFile(hdfs://...)

    val counts = lines.flatMap(_.split(\\s)) .reduceByKey(_ + _)

    counts.save(hdfs://...) www.kellytechno.com

  • Graph processing framework from Google that implements Bulk Synchronous Parallel modelVertices in the graph have stateAt each superstep, each node can update its state and send messages to nodes in future stepGood fit for PageRank, shortest paths, www.kellytechno.com

  • Input graphVertex state 1Messages 1Superstep 1Vertex state 2Messages 2Superstep 2. . .Group by vertex IDGroup by vertex ID www.kellytechno.com

  • Input graphVertex ranks 1Contributions 1Superstep 1 (add contribs)Vertex ranks 2Contributions 2Superstep 2 (add contribs). . .Group & add by vertexGroup & add by vertex www.kellytechno.com

  • Separate RDDs for immutable graph state and for vertex states and messages at each iterationUse groupByKey to perform each stepCache the resulting vertex and message RDDsOptimization: co-partition input graph and vertex state RDDs to reduce communication www.kellytechno.com

  • Twitter spam classification (Justin Ma)EM alg. for traffic prediction (Mobile Millennium)K-means clusteringAlternating Least Squares matrix factorizationIn-memory OLAP aggregation on Hive dataSQL on Spark (future work) www.kellytechno.com

  • Spark programming modelExample applicationsImplementationDemoFuture work www.kellytechno.com

  • Spark runs on the Mesos cluster manager [NSDI 11], letting it share resources with Hadoop & other appsCan read from any Hadoop input source (e.g. HDFS)~6000 lines of Scala code thanks to building on Mesos www.kellytechno.com

  • Scala closures are Serializable Java objectsSerialize on driver, load & run on workersNot quite enoughNested closures may reference entire outer scopeMay pull in non-Serializable variables not used insideSolution: bytecode analysis + reflectionShared variables implemented using custom serialized form (e.g. broadcast variable contains pointer to BitTorrent tracker) www.kellytechno.com

  • Modified Scala interpreter to allow Spark to be used interactively from the command lineRequired two changes:Modified wrapper code generation so that each line typed has references to objects for its dependenciesPlace generated classes in distributed filesystemEnables in-memory exploration of big data www.kellytechno.com

  • Spark programming modelExample applicationsImplementationDemoFuture work www.kellytechno.com

  • Further extend RDD capabilitiesControl over storage layout (e.g. column-oriented)Additional caching options (e.g. on disk, replicated)Leverage lineage for debuggingReplay any task, rebuild any intermediate RDDAdaptive checkpointing of RDDsHigher-level analytics tools built on top of Spark www.kellytechno.com

  • By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analyticsRDDs provide:Lineage info for fault recovery and debuggingAdjustable in-memory cachingLocality-aware parallel operationsWe plan to make Spark the basis of a suite of batch and interactive data analysis tools www.kellytechno.com

  • Set of partitionsPreferred locations for each partitionOptional partitioning scheme (hash or range)Storage strategy (lazy or cached)Parent RDDs (forming a lineage DAG) www.kellytechno.com

  • Presented ByKelly Technologies www.kellytechno.com

    *Explain that trends in data collection rates vs processor and IO speeds are driving this*Also applies to Dryad, SQL, etcBenefits: easy to do fault tolerance and *Also applies to Dryad, SQL, etcBenefits: easy to do fault tolerance and *RDDs = first-class way to manipulate and persist intermediate datasets*Key idea:**You write a single program similar to DryadLINQDistributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across opsVariables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimizationMention cached vars useful for some workloads that wont be shown hereMention its all designed to be easy to distribute in a fault-tolerant fashion

    ****Note that dataset is reused on each gradient computation**This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)****Mention its designed to be fault-tolerant****