Spark

Spark : Lightening Fast Cluster Compu6ng Framework !

Ni#sh Upre# [email protected]

Agenda for the Day

•  How do we mine useful informa#on from massive datasets ?

•  Overview of Problems in Big Data Space. •  Apache Hadoop and its Limita#ons. •  Introduce Apache Spark. •  Switching Gears : Exploring Spark Internals •  Discuss an ac#ve research area, “BlinkDB” : Queries with Bounded Reponses #me on very large data.

•  Conclusion

Data is KEY for Organiza6ons.

Data Mining is challenging. Mining is even more challenging

with Massive Datasets !

Big Data Problem for Amazon “Grouping books on same topic”

Solu6on : k-‐Means Algorithm

Not so simple with large datasets … How to solve this problem ?

Back In 2000 Google had a problem…

•  Processing massive amount of raw data ( crawled documents, web request logs )

•  Specialized Hardware (Ver#cal Scaling) was super expensive.

•  The computa#on thus needed to be distributed. Solu6on : Google’s Map Reduce Paradigm by Jeffrey Dean and Sanjay Ghemawat. Essen6ally a Distributed Cluster Compu6ng Framework.

What are the core ideas behind Map Reduce and why does it scale to

Massive Datasets?

Why is Map Reduce so important ?

•  Using commodity hardware for computa#on. •  Overcome Commodity Hardware limita2ons: Provides solu#on for an environment where failures are very frequent.

•  Pushing computa#on to data (unlike the other way round)

•  Provides abstrac2on to focus on domain logic : A programming / cluster compu#ng model for distributed compu#ng using func#onal primi#ves map and reduce.

Simplest Big Data Problem : Coun6ng Words.

Text Mining : Word Count

Map-‐Reduce Pseudo Code

Does the Map Reduce Paradigm completely solve the Big Data

Problem? Where do we go from here ?

The BIG Picture ! What do we need to process Big Data?

Three major Big Data Scenarios

•  Interac6ve Queries : Enable faster decision. Example : Query website logs and diagnose why the website is slow ? ( Apache Pig, Apache Hive.. ) •  Sophis6cated Batch Data Processing : Enable be_er decisions.

Example : Trend Analysis, Analy2cs •  Streaming Data Processing : Real #me decision making.

Example : Fraud Detec2on, detect DDoS aJacks.

A>er Google ’s iniBal work … The Open Source Community soon caught up with Hadoop Ecosystem !

Tools for every Data Analysis scenario…

There are many inherent limita6ons with Map Reduce ecosystem …

Map Reduce Limita6ons

•  Itera6ve Jobs : Common algorithms apply a func#on repeatedly to the same dataset. While each itera#on can be expressed as a MapReduce job, each job must store and than reload data from disk.

•  Interac6ve Analysis : There is a need to run ad-‐hoc queries on datasets. We want to be able to load a dataset into memory across machines and query it repeatedly.

Example 1 : Itera6ve Breadth First Search in Hadoop.

Example 2 : Ad-‐Hoc SQL like Queries anyone ? Using Hive and Pig ?

to the Rescue ! Spark easily outperforms Hadoop in the discussed scenarios by 10X – 100X.

not simply about Performance gain…

Introducing Spark •  Open Source data analy#cs cluster compu#ng framework.

•  Provides primi#ves for In-‐Memory cluster compu#ng that allows user to load data into cluster’s memory and query it repeatedly (spills to HDFS only when needed).

•  Allows interac#ve ad-‐hoc data explora#on. (Supports Pipelining & Lazy Ini#aliza#on)

•  Unifies batch, streaming and interac6ve computa6on.

A rich Spark Ecosystem

The Scala Programming language -‐ Object / Func#onal -‐ Runs on the JVM -‐ Concise Syntax

-‐ Aims to support interac#ve scrip#ng + development

Working with Spark …

Why is it so Powerful ?

Exploring Spark Internals …

•  Spark : Cluster Compu6ng with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Sco_

Shenker, Ion Stoica. Hot Cloud 2010. June 2010.

•  Resilient Distributed Datasets: A Fault-‐Tolerant Abstrac6on for In-‐Memory Cluster Compu6ng

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Jus#n Ma, Murphy McCauley, Michael J. Franklin, Sco_ Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award and Honorable Men6on for Community Award.

Spark Essen6als : Drive Program

Core of Spark : RDDs

•  RDD : Resilient Distributed Datasets are a distributed memory abstrac6on that lets programmer perform in memory computa#ons on large clusters in fault tolerant manner.

•  An RDD is a read-‐only collec6on of objects par66oned across a set of machines.

•  RDDs provide an interface based on coarse-‐grained transforma#on that applies same opera#on to many data items. This allows fault tolerance by logging transforma6ons and building a dataset lineage that can be used to rebuilt them if a par66on is lost.

Manipula6ng RDDs …

Periodic CheckPoin6ng of data for long running lineage.

Shared Variables in SPARK : -‐ Broadcast Variables -‐ Accumulators Programmers can create two restricted types of shared variables to support two common simple usage paJerns…

Ini6al Experiment Results

Logis6c Regression 29GB dataset.

Interac6ve Data Mining on Wikipedia 1 TB from disk took 170s

Back to our Ques6on …

Why is SPARK so Powerful ?

RDDs Expressivity and Generality

•  RDDs are able to express a diverse set of programming models as the Restric#ons have li_le impact in parallel applica#ons.

•  A lot of these programs naturally apply the same opera#on on many records, making them a good fit.

•  Previous systems explored specific problems with MapReduce. However, at the core of the problem is the need for a common data sharing abstrac6on.

•  RDDs capture all major op#miza#ons : keeping specific data in memory, custom par##oning to minimize communica#on and recovering from failures effec#vely.

Current Research…

Queries with Bounded Error and Bounded Response Time on Very Large Data.

Approximate Queries…

Explore More of BlinkDB •  Visit : h_p://blinkdb.org/ •  Blink and It’s Done: Interac6ve Queries on Very Large Data : In PVLDB 5(12): 1902-‐1905, 2012, Istanbul, Turkey by Sameer Agarwal, Aurojit Panda, Barzan Mozafari, Anand P. Iyer, Samuel Madden, Ion Stoica.

•  Queries with Bounded Errors and Bounded Response Times on Very Large Data : Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica. BlinkDB:. In ACM EuroSys 2013, Prague, Czech Republic (Best Paper Award).

Ques6ons ?

Thank You !

Sports

Spark