View
687
Download
0
Category
Preview:
Citation preview
Leveraging the Power of SOLR with SPARK
Johannes Weigend QAware GmbH Germany pache Big Data Europe
September 2015
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Welcome
• Johannes Weigend- CTO QAware GmbH- Software architect / developer- 25 years of experience- Custom enterprise solutions (Java, JS,…)- Lecturer for UI development at the University of
Applied Science in Rosenheim - Focus on performance and scalability- SOLR user since 2011
2
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Brute Force Data Analysis
3
Read Read Read
Filter Filter Filter
Map Map Map
Reduce
Dataflow
Not Indexed
foreach() -> Minutes / Hours
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Search Based Data Analysis
4
Filter
Search Search Search
Map Map Map
Reduce
DataflowFilter Filter
Indexed Data (There’s no free lunch)
foreach() -> Seconds/Minutes
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Agenda
SOLR cloudDemo
SPARK clusterDemo
Importing data into SOLR with SPARKDemo
Analysis with SOLR and SPARKDemo
5
1
2
3
4
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
• Horizontally scalable, distributed NoSQL (Index) Database • Document oriented• A document is a collection of fields (string, number, date, …)• Simple and multiple fields (similar to arrays)• Schema and schema less• Powerful query language (Lucene)
• Distributed data in shards• Replication• Powerful full text search capabilities• Aggregation functions (aka facets)• Stable —> V 5.3
6
1 2 3 4
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SOLR@QAware
• AIR• Aftersales Information Research
• ZEBRA• Part explosion for complex products
• EKG • Software Electro Cardiogram
• QAsearch• Enterprise search across all repositories including
history
7
8
9
10
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Apache SOLR for BigData Analysis?
• Text Search Engine?• Aggregations?• Slice and Dice?• Pivots?
11
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: SOLR Cloud
• Installing and configuring SOLR Cloud• Searching, sorting and filtering• Facets
• Terms (count by term)• Ranges (count in range)• Functions (avg, sum, …)• Sub-Facets (pivot)
12
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Counting as Term Facet
13
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Statistics as Function Facet
14
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Pivots as Sub Facets
15
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
careerbuilder.com
16
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Banana
17
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
What’s Missing?
• Client-side processing of SOLR results does not scale• No built-in M/R support• Where to store really big data?
• Images• Videos• Binaries / large text documents
• No interfaces to R / ML
19
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
• Distributed job execution engine• Map/Reduce framework• Scala based (runs on JVM)• Java/Scala/Python APIs• Processes data from various data sources
• Textfiles (accessible from all nodes)• Hadoop File System (HDFS)• Databases (JDBC)• SOLR!
20
1 2 3 4
Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Combining Spark with SOLR
• Use Cases• Distributed ETL – Importing data into SOLR-
Cloud• Our Usecase: importing N logfiles into SOLR
• Distributed processing – data analysis• Statistics on binary data• Map/Reduce
21
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Four Ways to Import Data into SOLR 1. Using built-in functions
post scriptDataimport handler,Admin-UI
2. Writing custom parallel code using the SOLRJ API 3. Using and customizing Apache Nutch (Hadoop !)4. Using and customizing Apache Spark
22
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: Import Logfiles with Spark• Writing a Spark job which imports a bunch of
logfiles in one directory • Using Lucidwork’s Solr-Spark library
23
1 2 3 4
24
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Demo: Distributed Analysis with Spark• Write a Spark Job which calculates the Duration of Business Actions • Use Spark to access SOLR per SQL / JDBC
25
1 2 3 4
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SolrRDD - The Spark Abstraction to process SOLR Resultshttps://github.com/LucidWorks/spark-solr
26
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SPARK Supports Parallel SQL
27
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Dataframe API
28
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
SPARK WorkerSOLR 5.3SHARD #4
29
Odroid XU4 2 GB RAM 64 GB eMMC Disk Ubuntu Linux 70$
SPARK WorkerSOLR 5.3SHARD #3
SPARK WorkerSOLR 5.3SHARD #1
SPARK WorkerSOLR 5.3SHARD #2
SPARK Master
SOLR 5.3SHARD #0
SPARK Worker
ZOOKEEPERNFS
40 Cores 10 GB RAM 320 GB eMMC Disk
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Summary
30
Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany
Any Questions ?
31
Recommended