34
FROM ORACLE TO CASSANDRA WITH SPARK

Silicon Valley Data Science: From Oracle to Cassandra with Spark

Embed Size (px)

Citation preview

FROM  ORACLE  TO  CASSANDRA  WITH  SPARK  

@TheAllantGroup | @SVDataScience 2 © 2015. ALL RIGHTS RESERVED.

WHO ARE WE?

Shambho Krishnasamy Fausto Inestroza

@TheAllantGroup | @SVDataScience 3 © 2015. ALL RIGHTS RESERVED.

CUSTOMER RECOGNITION

Challenges in the Digital age –  Scalability

–  Throughput

–  Cost

@TheAllantGroup | @SVDataScience 4 © 2015. ALL RIGHTS RESERVED.

CUSTOMER RECOGNITION

Key  Management   Tailoring  Key  Assignment  Hygiene  

Func6onal  Buckets  

@TheAllantGroup | @SVDataScience 5 © 2015. ALL RIGHTS RESERVED.

CUSTOMER RECOGNITION

Key  Management   Tailoring  Key  Assignment  Hygiene  

Func6onal  Buckets  

@TheAllantGroup | @SVDataScience 6 © 2015. ALL RIGHTS RESERVED.

LEGACY APPLICATION  

Input   Output  

Recogni6on  Bus  Service  

Party   Address   HHLD   Indv   Digital   Keying  Lookup   Reference  

Address      

Household      

Individual      

DigitalKey      

Digi-­‐Asso      

Reference      

@TheAllantGroup | @SVDataScience 7 © 2015. ALL RIGHTS RESERVED.

LEGACY SOLUTION

JMS  

@TheAllantGroup | @SVDataScience 8 © 2015. ALL RIGHTS RESERVED.

NEED  FOR  CHANGE?  RE-­‐PLATFORM  !  RE-­‐ARCHITECT  !  

@TheAllantGroup | @SVDataScience 9 © 2015. ALL RIGHTS RESERVED.

LIMITATIONS TO SCALE – MESSAGE PROCESSING ARCHITECTURE

•  Message processing engine

•  Common API to handle real-time and batch

•  Batch is converted into messages

@TheAllantGroup | @SVDataScience 10 © 2015. ALL RIGHTS RESERVED.

LIMITATIONS TO SCALE – DATA THROUGHPUT

4-­‐8  MM  records/hour  

Volume   Performance  

Scale  to  meet  Allant’s  Audience  Interconnect®    customer  recogni6on  needs  

@TheAllantGroup | @SVDataScience 11 © 2015. ALL RIGHTS RESERVED.

LIMITATIONS TO SCALE – SCALING HORIZONTALLY

Locking!  

@TheAllantGroup | @SVDataScience 12 © 2015. ALL RIGHTS RESERVED.

LIMITATIONS TO SCALE – SCALING VERTICALLY

=  

@TheAllantGroup | @SVDataScience 13 © 2015. ALL RIGHTS RESERVED.

WHAT DO WE WANT?

Increase  throughput  

 

Improve  scalability  

Elas6c  infrastructure  

(but  don’t  compromise  on  real-­‐6me  API  capability!)    

(but  contain  cost!)  

 

(well…  so  we  went  Cloud)  

@TheAllantGroup | @SVDataScience 14 © 2015. ALL RIGHTS RESERVED.

WHAT TO RE-PLATFORM?

?  

JMS  

@TheAllantGroup | @SVDataScience 15 © 2015. ALL RIGHTS RESERVED.

CASSANDRA

@TheAllantGroup | @SVDataScience 16 © 2015. ALL RIGHTS RESERVED.

Consistent  Reads!  Consistent  Writes!  

SWITCH DATA STORE

JMS  

@TheAllantGroup | @SVDataScience 17 © 2015. ALL RIGHTS RESERVED.

WE’RE DONE!

@TheAllantGroup | @SVDataScience 18 © 2015. ALL RIGHTS RESERVED.

BUT…APPLICATION  LAYER  IS  STILL  A  BOTTLENECK  

@TheAllantGroup | @SVDataScience 19 © 2015. ALL RIGHTS RESERVED.

MUST  MAINTAIN  EXISTING  LOGIC!  

@TheAllantGroup | @SVDataScience 20 © 2015. ALL RIGHTS RESERVED.

RECAP…  

@TheAllantGroup | @SVDataScience 21 © 2015. ALL RIGHTS RESERVED.

RECAP – BASELINE

JMS  

@TheAllantGroup | @SVDataScience 22 © 2015. ALL RIGHTS RESERVED.

RECAP  -­‐  LEGACY  APPLICATION  

Input   Output  

Recogni6on  Bus  Service  

Party   Address   HHLD   Indv   Digital   Keying  Lookup   Reference  

Address      

Household      

Individual      

DigitalKey      

Digi-­‐Asso      

Reference      

@TheAllantGroup | @SVDataScience 23 © 2015. ALL RIGHTS RESERVED.

HOW  DID  WE  DO  IT?  

@TheAllantGroup | @SVDataScience 24 © 2015. ALL RIGHTS RESERVED.

JMS  

INTRODUCED CASSANDRA…

But  Cassandra  is  very  bored…  

@TheAllantGroup | @SVDataScience 25 © 2015. ALL RIGHTS RESERVED.

What  about  this  part?    

HOW TO RE-ARCHITECT?

JMS  

Cassandra  is  very  bored…  

@TheAllantGroup | @SVDataScience 26 © 2015. ALL RIGHTS RESERVED.

NOW INTRODUCE HADOOP We  employed  Distributed  Data  Management  Technology  end-­‐to-­‐end…  

Cassandra  is  very  happy!  

@TheAllantGroup | @SVDataScience 27 © 2015. ALL RIGHTS RESERVED.

PERFORMANCE BENCHMARK RESULTS - ENVIRONMENT

•  12 Cassandra Nodes –  4 CPU –  15GB RAM –  80GB SSD

•  6 Hadoop Nodes –  32 CPU –  60GB RAM –  640GB SSD

@TheAllantGroup | @SVDataScience 28 © 2015. ALL RIGHTS RESERVED.

PERFORMANCE BENCHMARK RESULTS - MAPREDUCE

Environment Results

JMS – Oracle 4.5 Million / Hour

MapReduce – Cassandra 44 Million / Hour

Benchmark 1: Smaller Input (~15 Million Profiles)

~10x

Environment Results

JMS – Oracle 2.5 Million / Hour

MapReduce – Cassandra 45 Million / Hour

Benchmark 1: Larger Input (~400 Million Profiles)

~20x

From 6-7 days down to ~8 hours!

@TheAllantGroup | @SVDataScience 29 © 2015. ALL RIGHTS RESERVED.

COULD WE DO BETTER?

@TheAllantGroup | @SVDataScience 30 © 2015. ALL RIGHTS RESERVED.

INTRODUCE SPARK

Cassandra  is  ecsta6c!  

@TheAllantGroup | @SVDataScience 31 © 2015. ALL RIGHTS RESERVED.

EMPLOY DATASTAX LIGHTNING FAST CONNECTOR

@TheAllantGroup | @SVDataScience 32 © 2015. ALL RIGHTS RESERVED.

PERFORMANCE BENCHMARK RESULTS - SPARK

Environment Results

JMS – Oracle 2.5 Million / Hour

MapReduce – Cassandra 45 Million / Hour

Spark – Cassandra 125 Million / Hour [185 Million / Hour for “match only”]

Benchmark 1: Larger Input (~400 Million Profiles)

~50x

From 6-7 days down to ~3 hours!

@TheAllantGroup | @SVDataScience 33 © 2015. ALL RIGHTS RESERVED.

TAKEAWAYS

•  We  did  contain  cost!  –  with  be^er  throughput  &  scalability    

•  Pu`ng  Cassandra  to  work  by  employing  MapReduce  and  Spark    

•  Unimpeded  throughput  regardless  of  the  data-­‐store  volume    

•  Unique  Key  Genera6on  under  distributed  data  technology    

•  Resolving  Latency  vs.  Throughput  -­‐  Tradi6onal  Conflict    

•  In  our  use-­‐case,  the  data-­‐store    •  Is  encapsulated  !  

•  Has  only  controlled  access  !  

•  Does  only  Reads  and  Writes  !  

@TheAllantGroup | @SVDataScience 34 © 2015. ALL RIGHTS RESERVED.

THANK YOU