21
Copyright © 2016 Splunk Inc. Raanan Dagan, May Long Big Data Architect, Splunk Use Splunk With Big Data Repositories Like Spark, Solr, Hadoop And Nosql Storage

Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Embed Size (px)

Citation preview

Page 1: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Copyright  ©  2016  Splunk  Inc.  

Raanan  Dagan,  May  Long  Big  Data  Architect,  Splunk  

Use  Splunk  With  Big  Data  Repositories  Like  Spark,  Solr,  Hadoop  And  Nosql  Storage  

Page 2: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Disclaimer  

2  

During  the  course  of  this  presentaJon,  we  may  make  forward  looking  statements  regarding  future  events  or  the  expected  performance  of  the  company.  We  cauJon  you  that  such  statements  reflect  our  current  expectaJons  and  esJmates  based  on  factors  currently  known  to  us  and  that  actual  events  or  results  could  differ  materially.  For  important  factors  that  may  cause  actual  results  to  differ  from  those  contained  in  our  forward-­‐looking  statements,  please  review  our  filings  with  the  SEC.  The  forward-­‐looking  statements  made  in  the  this  presentaJon  are  being  made  as  of  the  Jme  and  date  of  its  live  presentaJon.  If  reviewed  aVer  its  live  presentaJon,  this  presentaJon  may  not  contain  current  or  

accurate  informaJon.  We  do  not  assume  any  obligaJon  to  update  any  forward  looking  statements  we  may  make.  In  addiJon,  any  informaJon  about  our  roadmap  outlines  our  general  product  direcJon  and  is  

subject  to  change  at  any  Jme  without  noJce.  It  is  for  informaJonal  purposes  only  and  shall  not,  be  incorporated  into  any  contract  or  other  commitment.  Splunk  undertakes  no  obligaJon  either  to  develop  the  features  or  funcJonality  described  or  to  include  any  such  feature  or  funcJonality  in  a  future  release.  

Page 3: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Agenda  

Use  Cases:    Fraud  With  Solr,  Splunk,  And  Splunk  AnalyJcs  For  Hadoop    Business  AnalyJcs  With  Cassandra,  Splunk  Cloud,  And  Splunk  AnalyJcs  For  Hadoop  

  Document  ClassificaJon  With  Spark  And  Splunk      Network  IT  With  KaZa  And  Splunk  KaZa  Add  On    Demo  

  3  

Page 4: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Fraud  With  Solr,  Splunk,  And  Splunk  AnalyJcs  For  Hadoop  

Page 5: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Use  Case:  Fraud  –  Why  Apache  Solr  Apache  Solr  is  an  open  source  enterprise  search  pla^orm  from  the  Apache  Lucene  API.  Its  major  features  include  full-­‐text  search,  hit  highlighJng,  faceted  search,  and  real-­‐Jme  indexing.    1.   Problem:  To  scan  a  whole  month  of  data  or  longer  looking  for  a  key  words  in  

Splunk  AnalyJcs  for  Hadoop  takes  long  Jme  since  we  have  many  log  files  to  search  through.      

2.   Goal:  We  would  like  to  limit  number  of  files  to  search  and  reduce  number  of  map/reduce  jobs  to  run.      

3.   Solu<on:  To  achieve  this  affect,  we  have  created  summary  of  key  words  in  Solr.    This  Solr  summary  data  contains  key  words  and  in  which  HDFS  files  the  key  words  are  found  in.    

5  

Page 6: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Splunk Indexers Real-Time Data

Splunk / Splunk Analytics for Hadoop

Hadoop Raw Data

Architecture  

Hadoop Indexed Data

1

23

Page 7: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Fraud  –  Technical  Details    

7  

Splunk  Search  Head  Hadoop  Raw  data  

Hadoop  -­‐  Solr   Splunk  –  Solr   Cassandra  -­‐  Splunk  Analy<cs  for  Hadoop  

•  Solr  monitors  any  changes  to  Hadoop  Directory  

•  Index  Key  words  based  on    Hadoop  Source  files  

•  Splunk  Form  dashboard  •  User  enter  Key  words  •  Python  script  calls  Solr    •  Solr  returns  to  Splunk  all  

Hadoop  source  files  with  the  Key  words  

•  Splunk  Analy<cs  for  Hadoop  runs  MapReduce  Hadoop  jobs  with  the  Specific  Source  Files  

•  Eliminates  massive    Hadoop  scan  

Hadoop  Indexed  data  

Page 8: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Business  AnalyJcs  With  Cassandra,  Splunk  Cloud,  And  Splunk  AnalyJcs  For  Hadoop  

Page 9: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Business  AnalyJcs  –  Why  Cassandra  Apache  Cassandra  is  an  open-­‐source  distributed  NoSQL  database  system  designed  to  handle  large  amounts  of  data  across  cluster  of  commodity  servers.    1.   Problem:  Lack  of  Visibility  into  customer  behavior  from  Mobile  ApplicaJons.      2.   Goal:  Visualize  and  analyze  all  data  that  is  stored  in  Cassandra.      3.   Solu<on:  To  achieve  this  affect,  we  stored  all  Mobile  acJvity  into  Cassandra  

and  use  Splunk  AnalyJcs  for  Hadoop  add  on  for  Cassandra  to  query  that  data.      

9  

Page 10: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Splunk Cloud UF

Splunk Cloud

     Architecture  

RabbitMQ   iPhone  Apps  Cassandra  

Splunk  Heavy  Forwarder  /  Splunk  AnalyJcs  for  Hadoop  

12

3

4

10  

Page 11: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Business  AnalyJcs  –  Technical  Details    

11  

Cassandra  Splunk  Search  Head  

Splunk  Cloud  

Cassandra  -­‐  Splunk  Analy<cs  for  Hadoop  

Splunk  Analy<cs  for  Hadoop  –  Summary  Index  

Summary  Index  –  Splunk  Cloud  

•  Splunk  Analy<cs  for  Hadoop  Add  On  for  Cassandra  

•  [cassandra_weathercql]  vix.provider  =  cassandra_erp  vix.cassandra.cql.cmd  =  SELECT  *  FROM  weathercql.monthly  

•  index  =  cassandra_weathercql  |  table  *          And    Schedule  Search  

•  index  =  cassandra_weathercql  |    Collect  SummaryIndex    

•  Output.conf  [tcpout]  forwardedindex.0.whitelist  =  SummaryIndex  

•  SummaryIndex  for  5  Min  •  Use  the  normal  Splunk    

Cloud  UF  

Summary  Index  

Page 12: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Document  ClassificaJon  With    Spark  And  Splunk  

Page 13: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Document  ClassificaJon  –  Why  Spark  Apache  Spark  provides  APIs  that  provides  a  fast  processing  and  it  was  developed  in  response  to  limitaJons  into  the  Hadoop  MapReduce  cluster  compuJng  paradigm.  The  main  components  for  Spark  are:  Core,  SQL,  Machine  Learning,  Stream,  and  Graph  APIs..    1.   Problem:  Spark  processing  does  not  provides  an  easy  analyJcs  or  any  visualizaJon.  2.   Goal:  Allows  analysts  and  regulators  the  ability  to  know  exactly  where  each  file  

exists  in  the  system.      3.   Solu<on:  Apache  Nifi  collect  all  new  files  from  NFS  and  store  it  on  Hadoop.  Spark  

Core,  Spark  Machine  Learning,  and  Apache  Tika  creates  Metadata  classificaJon.  Splunk  AnalyJcs  for  Hadoop  expose  Metadata  classificaJon  files  to  end  users.  

13  

Page 14: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Splunk / Splunk Analytics for Hadoop Search Head

   Architecture  

1

Documents  Transport  

2

Hadoop  Raw  data  

3

Machine  Learning  Extracts  Metadata  

Hadoop  Metadata  4

5

14  

Page 15: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Network  IT  With  KaZa  And    Splunk  KaZa  Add  On  

Page 16: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Network  IT  –  Why  KaZa  Apache  KaVa  is  a  fast  publish-­‐subscribe  messaging  system.  A  single  KaZa  broker  can  handle  hundreds  of  megabytes  of  reads  and  writes  per  second  from  thousands  of  clients  a  indexing.    1.   Problem:  No  unified  collecJon  framework.    2.   Goal:  Real  Time  visualizaJon  and  analyJcs  using  Splunk,  Batch  visualizaJon  

and  analyJcs  using  Hadoop  and  RDBMS.      3.   Solu<on:  To  achieve  this  affect,  we  used  a  Splunk  Add-­‐on  for  KaZa.    

16  

Page 17: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

     Architecture  

1

Splunk Indexers Splunk Add-on for Kafka

Data Producers

Network  Data  Window  Logs  Cisco  Logs  Mobile  Data  Security  Logs  Streaming  Hop  Server  data    

Kafka

2

2

3

Real-Time Pipeline

Batch Pipeline

Oracle

CEP  

NoSQL  

Data Consumers

•  Real-­‐Jme,  fast  analyJc  

•  Alerts  •  Debugging  •  Dashboards    

•  Ad-­‐hoc  exploraJon  

•  Data  Science    

3

17  

Page 18: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Network  IT  –  Technical  Details    

18  

KaVa  -­‐  Splunk  Add-­‐on  for  KaVa  

•  Splunk  Add-­‐on  for  KaVa  •  index  =  networkdata    •  kaZa_brokers  =  sandbox:6667  •  kaZa_parJJon_offset  =  earliest  •  kaZa_topic_whitelist  =  truckevent  

•  Search  •  index=networkdata  

sourcetype="kaZa:topicEvent"    

Page 19: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

Demo  

Page 20: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

AddiJonal  Resources    

Use  Cases:  •  Fraud  with  Solr:  hops://lucidworks.com/soluJons/lucidworks-­‐splunk-­‐connector/  

•  Business  AnalyJcs  with  Cassandra:  hops://splunkbase.splunk.com/app/2668/  

•  Document  classificaJon  with  Spark:    hops://splunkbase.splunk.com/app/2686/    (Spark  SQL)  

•  Network  IT  with  KaZa:  hops://splunkbase.splunk.com/app/2935/  

 20  

Page 21: Use*Splunk*With*Big*DataRepositories*Like* Spark,*Solr ... · Use*Case:*Fraud*–Why*Apache* Solr* Apache’Solr’is*an*open*source*enterprise*search*plaorm*from* the’Apache’Lucene*

THANK  YOU