47
Building a Data Warehouse for Business Analytics using Spark SQL Copyright Edmunds.com, Inc. (the Company). Edmunds and the Edmunds.com logo are registered trademarks of the Company. This document contains proprietary and/or confidential information of the Company. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Company, and any such disclosure requires the express approval of the Company. Blagoy Kaloferov Software Engineer 06/15/2015

Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Embed Size (px)

Citation preview

Page 1: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Building a Data Warehouse for Business Analytics using Spark SQL

Copyright Edmunds.com, Inc. (the “Company”). Edmunds and the Edmunds.com logo are registered trademarks of the Company. This document contains proprietary and/or confidential information of the Company. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Company, and any such disclosure requires the express approval of the Company.

Blagoy Kaloferov Software Engineer

06/15/2015

Page 2: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

About me: Blagoy Kaloferov Big Data Software Engineer

About my company: Edmunds.com is a car buying platform

18M+ unique visitors each month

Today’s talk

Page 3: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

“It’s about time!”  

Agenda

1.  Introduction and Architecture

2.  Building Next Gen DWH

3.  Automating Ad Revenue using Spark SQL

4.  Conclusion

Page 4: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Business Analytics at Edmunds.com

Divisions: DWH Engineers

Business Analysts Statistics team

Two major groups: Map Reduce / Spark developers

Analysts with advanced SQL skills

Page 5: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Spark SQL :

•  Simplified ETL and enhanced Visualization tools

•  Allows anyone in BA to quickly build new Data marts •  Enabled a scalable POC to Production process for our

projects

Proposition

Page 6: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Data Ingestions / ETL

Architecture for Analytics

Raw Data Clickstream Inventory Dealer Lead Transaction

HDFS aggregates

Map Reduce jobs

DWH Developers Business Analyst

Hadoop Cluster

Page 7: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Data Ingestions / ETL Reporting Ad Hoc / Dashboards

Architecture for Analytics

Raw Data Clickstream Inventory Dealer Lead Transaction

HDFS aggregates

Map Reduce jobs

Database Redshift

Business Intelligence

Tools

Platfora Tableau

DWH Developers Business Analysts

Hadoop Cluster

Page 8: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Data Ingestions / ETL Reporting Ad Hoc / Dashboards

Architecture for Analytics

Raw Data Clickstream Inventory Dealer Lead Transaction

HDFS aggregates

Map Reduce jobs

Database Redshift

Business Intelligence

Tools

Platfora Tableau

DWH Developers Business Analysts

Hadoop Cluster

Spark Spark SQL

Databricks

Page 9: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Data Ingestions / ETL Reporting Ad Hoc / Dashboards

Architecture for Analytics

Raw Data Clickstream Inventory Dealer Lead Transaction

HDFS aggregates

Map Reduce jobs

Database Redshift

Business Intelligence

Tools

Platfora Tableau

DWH Developers Business Analysts

Hadoop Cluster

Spark Spark SQL

Page 10: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Data Ingestions / ETL Reporting Ad Hoc / Dashboards

Architecture for Analytics

Raw Data Clickstream Inventory Dealer Lead Transaction

HDFS aggregates

Map Reduce jobs

Database Redshift

Business Intelligence

Tools

Platfora Tableau

DWH Developers Business Analysts

Hadoop Cluster

Spark Spark SQL

ETL

Page 11: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Our approach: o  Spark SQL tables similar to

existing our Redshift tables

o  Best fit for us are Hive tables pointing to S3 delimited data

o  Exposed hundreds of Spark SQL tables

Exposing S3 data via Spark SQL

Page 12: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

•  S3 Datasets have thousands of directories: ( location  /year/month/day/hour  ) •  Every new S3 directory for each dataset has to be registered

Adding Latest Table Partitions

Adding Latest Table Partitions

Spark SQL tables

S3 dataset

S3 dataset

S3 dataset S3 dataset

S3 dataset S3 dataset

S3 dataset

2015/05/31/01 2015/05/31/02

Spark SQL table

partitions

Page 13: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Utilities to for Spark SQL tables and S3:

•  Register valid partitions of Spark SQL tables with spark Hive Metastore •  Create Last_X_ Days copy of any Spark SQL table in memory

Scheduled jobs:

•  Registers latest available directories for all Spark SQL tables programmatically

•  Updates Last_3_ Days of core datasets in memory

Adding Latest Table Partitions Spark SQL tables

Page 14: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

“It’s about time!”  

S3 and Spark SQL potential

o  Now that all S3 data is easily accessible, there are a lot of opportunities !

o  Anyone can ETL on prefixed

aggregates and create new Data Marts

Spark Cluster

Spark SQL tables Last_3_days Tables

Utilities and UDF’s

Business Intelligence

Tools

Faster Pipeline Better Insights

Page 15: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Platfora Dashboards Pipeline Optimization

o  Platfora is a Visualization Analytics Tool o  Provides More than 200 dashboards for BA

o  Uses MapReduce to load aggregates

Source Dataset Build / Update Lens Dashboards

HDFS S3

Joined Datasets

MapReduce jobs

Page 16: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Platfora Dashboards Pipeline Optimization

Limitations: o  We can not optimize the Platfora Map Reduce jobs o  Defined Data Marts not available elsewhere

Source Dataset Build / Update Lens Dashboards

HDFS S3

Joined Datasets

MapReduce jobs

Join  on  lead_id  ,  inventory_id,    visitor_id,  dealer_id  …    

Page 17: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Platfora Dealer Leads dataset Use Case

o  Dealer Leads: Lead Submitter insights dataset o  More than 40 Visual Dashboards are using Dealer Leads

Lead Submitter Data

Region Info

Transaction Data

Vehicle Info

Dealer Info

Dealer Leads Joined Dataset

Lead Categorization

Lead Submitter Insights

join

Page 18: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

“It’s about time!”  

Optimizing Dealer Leads Dataset

Dealer Leads Platfora Dataset stats: o  300+ attributes

o  Usually takes 2-3 hours to build lens o  Scheduled to build daily

Page 19: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

“It’s about time!”  

Optimizing Dealer Leads Dataset

How do we optimize it?

Page 20: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

“It’s about time!”  

Optimizing Dealer Leads Dataset

How do we optimize it?

1. Have Spark SQL do the work! o  All required datasets are exposed as Spark SQL tables o  Add new useful attributes

2. Make the ETL easy for anyone in Business Analytics to do it themselves

o  Provide utilities and UDF’s so that aggregated data can be exposed to Visualization tools

Page 21: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Dealer Leads Using Spark SQL Demo

Dealer Leads Data Mart using Spark SQL Demo

Expose all original 300+ attributes Enhance: Join with site_traffic

Dealer Leads Dataset

Traffic Data Lead submitter journey

Entry page, page views, device …

aggregate_traffic_spark_sql

Page 22: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Dealer Leads Using Spark SQL results

o  Spark SQL aggregation in 10 minutes. o  Adds dimension attributes that were not available before

o  Platfora does not need to join aggregates o  Significantly reduced latency

o  Dashboard refreshed every 2 hours instead of once per day.

Spark SQL Dealer Leads

Lens

10 minutes 10 minutes

Dashboards

Page 23: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

ETL and Visualization takeaway

o  Now anyone in BA can perform and support ETL on their own o  New Data marts can be exported to RDBMS

S3 New Data Marts Using Spark SQL Redshift

Platfora

Tableau

Spark Cluster Spark SQL tables

Last N days Tables Utilities

Spark SQL connector

ETL

load

Page 24: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Usual POC process

o  Business Analyst Project Prototype in SQL o  Not scalable. Takes Ad Hoc resources from RDBMS

o  SQL to Map Reduce o  Transition from two very different frameworks o  MR do not always fit complicated business logic. o  Supported only by Developers

POC with Spark SQL vision

Page 25: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

“It’s about time!”  

POC with Spark SQL vision

new POC process using Spark

o  A developer and BA can work together on the same platform and collaborate using Spark

o  Its scalable o  No need to switch frameworks when productionalizing

Page 26: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Ad Revenue Billing Use Case

Definitions: Impression, CPM Line Item, Order

Introduction

OEM Advertising on website

Page 27: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Introduction

OEM Advertising on website

Ad Revenue computed at the end of the month

using OEM provided impression data

Ad Revenue Billing Use Case

Page 28: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Impressions served * CPM != actual revenue

o  There are billing adjustment rules! o  Each OEM has a set of unique rules that

determine the actual revenue. o  Adjusting revenue numbers requires manual

user inputs from OEM’s Account Manager

Ad Revenue End of Month billing

Page 29: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

•  Line Item groupings

Line Item adjustments’ examples Box representing each example

ORIGINAL  Line  Item    |  CPM  |  impressions  |  a?ributes

Page 30: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

•  Line Item groupings

Line Item adjustments’ examples Box representing each example

ORIGINAL  Line  Item    |  CPM  |  impressions  |  a?ributes

SUPPORT  Line  Item  |  CPM  |  impressions

Page 31: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

•  Line Item groupings

Line Item adjustments’ examples Box representing each example

MERGED  |  CPM  |  NEW_impressions  |  a?ributes

Combine data

Page 32: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

•  Line Item groupings

•  Capping / Adjustments

Line Item adjustments’ examples Box representing each example

MERGED  |  CPM  |  NEW_impressions  |  a?ributes

impressions_served > Contract ?

Line  Item    |  impressions|  Contract

Page 33: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

•  Line Item groupings

•  Capping / Adjustments

Line Item adjustments’ examples Box representing each example

MERGED  |  CPM  |  NEW_impressions  |  a?ributes

Line  Item    |  CAPPED_impressions|  Contract

Cap impression!

impressions_served > Contract ?

Page 34: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

•  Line Item groupings

•  Capping / Adjustments

Line Item adjustments’ examples Box representing each example

MERGED  |  CPM  |  NEW_impressions  |  a?ributes

impressions_served > (X% * Contract) ?

Line  Item    |  impressions|  Contract

Page 35: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

•  Line Item groupings

•  Capping / Adjustments

Line Item adjustments’ examples Box representing each example

MERGED  |  CPM  |  NEW_impressions  |  a?ributes

impressions_served > (X% * Contract) ?

Adjust impression!

Line  Item    |  ADJUSTED_impressions|  Contract

Page 36: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

1_day_impr  |  CPM  |  a?ributes

1d,  7d,  MTD,  QTD,  YTD:  adjusted_impr  |  CPM    

Billing Engine

Each Line Item

Process Vision:

Impressions served * CPM = actual revenue

Can we automate ad revenue calculation?

Page 37: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Automation Challenges

o  Many rules , user defined inputs, the logic changes o  Need for scalable unified platform o  Need for tight collaboration between OEM team,

Business Analysts and DWH developers

Can we automate ad revenue calculation?

Page 38: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Billing Rules Modeling Project

How do we develop it?

Page 39: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Ad Revenue

Billing Rules Modeling Project

How do we develop it?

Spark + Spark SQL approach

BA + Developers + OEM Account Team collaboration

Goal is an Ad Performance Dashboard

Adjusted Billing Ad Revenue

OEM:

Page 40: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Project Architecture in Spark

Sum / Transform / Join Rows

Processing separated in phases where input / outputs are Spark SQL tables

1_day_impr  |  CPM  |  a?ributes Spark SQL Row Each Line Item =

Page 41: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Project Architecture in Spark

Base Line Items Spark SQL

Merged Line Items Spark SQL

Adjusted Line Items Spark SQL

Phase 1 Phase 2 Phase 3

Ad Performance Tableau

Dashboard

Page 42: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Spark SQL

Project Architecture in Spark

Business Analysts

DFP / Other Impressions

Line Item Dimensions

Base Line Items Spark SQL

join aggregate

Phase 1

Merged Line Items Spark SQL

Adjusted Line Items Spark SQL

Phase 2 Phase 3

Ad Performance Tableau

Dashboard

Page 43: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Spark SQL

Project Architecture in Spark

Business Analysts BA + Developer

DFP / Other Impressions

Line Item Dimensions

- Manual Groupings - Other Inputs

Spark SQL

Line Item Merging Engine

OEM Account Managers

Base Line Items Spark SQL

Merged Line Items Spark SQL

join aggregate

Phase 1 Phase 2

Adjusted Line Items Spark SQL

Phase 3

Ad Performance Tableau

Dashboard

Page 44: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Spark SQL

Billing Rules Engine

Project Architecture in Spark

Business Analysts BA + Developer

DFP / Other Impressions

Line Item Dimensions

- Manual Groupings - Other Inputs

Spark SQL

Line Item Merging Engine

Ad Performance Tableau

Business Analysts

OEM Account Managers

Base Line Items Spark SQL

Merged Line Items Spark SQL

Adjusted Line Items Spark SQL

join aggregate

Phase 1 Phase 2 Phase 3 Dashboard

Page 45: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

“It’s about time!”  

Billing Rules Modeling Achievements

o  Increased accuracy of revenue forecasts for BA o  Cost savings by not having a dedicated team doing

manual adjustments o  Monitor ad delivery rate for orders

o  Allows us to detect abnormalities in ad serving

o  Collaboration between BA and DWH Developers

Page 46: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Thank you! Blagoy Kaloferov

[email protected]

Questions?

Page 47: Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)