53
Spark : Lightening Fast Cluster Compu6ng Framework ! Ni#sh Upre# [email protected]

Spark

Embed Size (px)

Citation preview

Spark  :  Lightening  Fast  Cluster  Compu6ng  Framework  !  

Ni#sh  Upre#  [email protected]  

 

Agenda  for  the  Day  

•  How   do   we   mine   useful   informa#on   from  massive  datasets  ?  

•  Overview  of  Problems  in  Big  Data  Space.  •  Apache  Hadoop  and  its  Limita#ons.  •  Introduce  Apache  Spark.  •  Switching  Gears  :  Exploring  Spark  Internals  •  Discuss   an   ac#ve   research   area,   “BlinkDB”   :    Queries   with   Bounded   Reponses   #me   on   very  large  data.  

•  Conclusion  

Data  is  KEY  for  Organiza6ons.  

Data  Mining  is  challenging.  Mining  is  even  more  challenging  

with  Massive  Datasets  !  

 Big  Data  Problem  for  Amazon    “Grouping  books  on  same  topic”    

Solu6on  :  k-­‐Means  Algorithm  

Not  so  simple  with  large  datasets  …  How  to  solve  this  problem  ?  

Back  In  2000  Google  had  a  problem…  

•  Processing  massive  amount  of  raw  data  (  crawled  documents,  web  request  logs  )  

•  Specialized  Hardware  (Ver#cal  Scaling)  was  super  expensive.  

•  The  computa#on  thus  needed  to  be  distributed.  Solu6on   :   Google’s   Map   Reduce   Paradigm   by  Jeffrey  Dean  and  Sanjay  Ghemawat.  Essen6ally   a   Distributed   Cluster   Compu6ng  Framework.  

What  are  the  core  ideas  behind  Map  Reduce  and  why  does  it  scale  to  

Massive  Datasets?  

Why  is  Map  Reduce  so  important  ?  

•  Using  commodity  hardware  for  computa#on.  •  Overcome   Commodity   Hardware   limita2ons:  Provides   solu#on   for   an   environment   where  failures  are  very  frequent.  

•  Pushing   computa#on   to   data   (unlike   the   other  way  round)  

•  Provides  abstrac2on  to  focus  on  domain  logic  :  A  programming   /   cluster   compu#ng   model   for  distributed  compu#ng  using  func#onal  primi#ves  map  and  reduce.    

Simplest  Big  Data  Problem  :  Coun6ng  Words.  

Text  Mining  :  Word  Count  

Map-­‐Reduce  Pseudo  Code    

Does  the  Map  Reduce  Paradigm  completely  solve  the  Big  Data  

Problem?  Where  do  we  go  from  here  ?  

The  BIG  Picture  !  What  do  we  need  to  process  Big  Data?    

Three  major  Big  Data  Scenarios  

•  Interac6ve  Queries  :  Enable  faster  decision.  Example  :  Query  website  logs  and  diagnose  why  the  website  is  slow  ?  (  Apache  Pig,  Apache  Hive..  )  •  Sophis6cated  Batch  Data  Processing  :  Enable  be_er  decisions.  

Example  :  Trend  Analysis,  Analy2cs  •  Streaming  Data  Processing  :  Real  #me  decision  making.  

Example  :  Fraud  Detec2on,  detect  DDoS  aJacks.  

 A>er  Google  ’s  iniBal  work  …  The  Open  Source  Community  soon  caught  up  with  Hadoop  Ecosystem  !  

Tools   for   every   Data   Analysis  scenario…  

There  are  many  inherent  limita6ons  with  Map  Reduce  ecosystem  …  

Map  Reduce  Limita6ons  

•  Itera6ve   Jobs   :   Common   algorithms   apply   a  func#on   repeatedly   to   the   same   dataset.  While   each   itera#on   can   be   expressed   as   a  MapReduce  job,  each  job  must  store  and  than  reload  data  from  disk.  

•  Interac6ve   Analysis   :   There   is   a   need   to   run  ad-­‐hoc   queries   on   datasets.   We   want   to   be  able   to   load   a   dataset   into   memory   across  machines  and  query  it  repeatedly.  

Example  1  :  Itera6ve  Breadth  First  Search  in  Hadoop.  

Example   2   :   Ad-­‐Hoc   SQL   like  Queries  anyone  ?  Using    Hive  and  Pig  ?  

to  the  Rescue  !  Spark   easily   outperforms   Hadoop   in   the   discussed  scenarios  by  10X  –  100X.  

not  simply  about  Performance  gain…  

Introducing  Spark    •  Open  Source  data  analy#cs  cluster  compu#ng  framework.  

•  Provides   primi#ves   for   In-­‐Memory   cluster  compu#ng   that   allows   user   to   load   data   into  cluster’s   memory   and   query   it   repeatedly  (spills  to  HDFS  only  when  needed).  

•  Allows   interac#ve   ad-­‐hoc   data   explora#on.  (Supports  Pipelining  &  Lazy  Ini#aliza#on)  

•  Unifies   batch,   streaming   and   interac6ve  computa6on.  

A  rich  Spark  Ecosystem    

The  Scala  Programming  language  -­‐  Object  /  Func#onal  -­‐  Runs  on  the  JVM  -­‐  Concise  Syntax  

-­‐  Aims  to  support  interac#ve  scrip#ng  +  development    

 

Working  with  Spark  …  

Why  is  it  so  Powerful  ?  

Exploring  Spark  Internals  …  

•  Spark  :  Cluster  Compu6ng  with  Working  Sets    Matei   Zaharia,   Mosharaf   Chowdhury,   Michael   J.   Franklin,   Sco_                  

Shenker,  Ion  Stoica.  Hot  Cloud  2010.  June  2010.  

•  Resilient   Distributed   Datasets:   A   Fault-­‐Tolerant  Abstrac6on  for  In-­‐Memory  Cluster  Compu6ng    

Matei   Zaharia,   Mosharaf   Chowdhury,   Tathagata   Das,   Ankur   Dave,  Jus#n  Ma,  Murphy  McCauley,  Michael  J.  Franklin,  Sco_  Shenker,        Ion  Stoica.   NSDI   2012.   April   2012.   Best   Paper   Award   and   Honorable  Men6on  for  Community  Award.  

Spark  Essen6als  :  Drive  Program  

Core  of  Spark  :  RDDs  

•  RDD  :  Resilient  Distributed  Datasets  are  a  distributed  memory  abstrac6on   that   lets  programmer  perform  in  memory   computa#ons   on   large   clusters   in   fault  tolerant  manner.    

•  An  RDD  is  a  read-­‐only  collec6on  of  objects  par66oned  across  a  set  of  machines.  

•  RDDs   provide   an   interface   based   on   coarse-­‐grained  transforma#on   that   applies   same   opera#on   to   many  data   items.   This   allows   fault   tolerance   by   logging  transforma6ons   and   building   a   dataset   lineage   that  can  be  used  to  rebuilt  them  if  a  par66on  is  lost.  

Manipula6ng  RDDs  …  

Periodic  CheckPoin6ng  of  data  for  long  running  lineage.  

Shared  Variables  in  SPARK  :  -­‐  Broadcast  Variables  -­‐  Accumulators  Programmers   can   create   two   restricted  types   of   shared   variables   to   support   two  common  simple  usage  paJerns…      

Ini6al  Experiment  Results  

Logis6c  Regression  29GB  dataset.    

Interac6ve  Data  Mining  on  Wikipedia  1  TB  from  disk  took  170s  

Back  to  our  Ques6on  …    

Why  is  SPARK  so  Powerful  ?  

RDDs  Expressivity  and  Generality  

•  RDDs  are  able  to  express  a  diverse  set  of  programming  models  as  the  Restric#ons  have  li_le  impact  in  parallel  applica#ons.    

•  A   lot   of   these   programs   naturally   apply   the   same  opera#on  on  many  records,  making  them  a  good  fit.  

•  Previous   systems   explored   specific   problems   with  MapReduce.   However,   at   the   core   of   the   problem   is  the  need  for  a  common  data  sharing  abstrac6on.  

•  RDDs  capture  all  major  op#miza#ons  :  keeping  specific  data   in   memory,   custom   par##oning   to   minimize  communica#on   and   recovering   from   failures  effec#vely.  

Current  Research…  

Queries   with   Bounded   Error   and  Bounded   Response   Time   on   Very  Large  Data.  

Approximate  Queries…  

Explore  More  of  BlinkDB    •  Visit  :  h_p://blinkdb.org/  •  Blink  and  It’s  Done:  Interac6ve  Queries  on  Very  Large  Data  :  In  PVLDB  5(12):  1902-­‐1905,  2012,  Istanbul,  Turkey  by  Sameer  Agarwal,  Aurojit  Panda,  Barzan  Mozafari,  Anand  P.  Iyer,  Samuel  Madden,  Ion  Stoica.    

•  Queries  with  Bounded  Errors  and  Bounded  Response  Times  on  Very  Large  Data  :  Sameer  Agarwal,  Barzan  Mozafari,  Aurojit  Panda,  Henry  Milner,  Samuel  Madden,  Ion  Stoica.  BlinkDB:.  In  ACM  EuroSys  2013,  Prague,  Czech  Republic  (Best  Paper  Award).  

Ques6ons  ?    

Thank  You  !