LEGO: Data Driven Growth Hacking Powered by Big Data

Preview:

Citation preview

1Salesforce ConfidentialSalesforce

Confidential

LEGO: Data Driven Growth Hacking Powered by Big Data

June 2016

Kamal Duggireddy Prashant Gokhale

2Salesforce Confidential

Kamal Duggireddy

Kamal Duggireddy currently leads Data Engineering, Product Data Science Team at Salesforce.com Prior to this, he served as Director - Big Data Architecture at American Express. Combining deep technical skills along with business knowledge and strong execution experience, Kamal developed reference architectures and new enterprise-level capabilities with the Hadoop stack.

Prashant Gokhale

Prashant is currently working on solving big data problems at Salesforce.com using Hadoop and its ecosystem components. Prior to this he held several critical engineering positions at Yahoo, Cloudera & Lookout.

About Us

3Salesforce Confidential

The Use Case | Overview

ExecutivesAnalystsProduct Managers

4Salesforce Confidential

The Use Case | Flow

Ad-Hoc Requests

Predictive Data Apps

Data Engineering & Curation

Smart Data Dashboards(Salesforce Wave)

Advanced AnalysisInstrumentation

150+ Loglines

HadoopData Processing

Traditional Data Warehouses Dimensions

5Salesforce Confidential

The Journey | How it all started

6Salesforce Confidential

Milestones | Along the way

</>

<\>

Reusability Declarative Data Lake Data Dictionary

Self serviceAutomation

Security Visualization Governance

7Salesforce Confidential

The Framework | Finally!

Dat

aset

s(V

ario

us g

rain

)

Data Lake

Log Processing

Metadata

Flow Engine

W

eb A

pp

Self Service

Log

Sou

rces

Clou

d M

etri

cs

Data Profiler

Data Science

Kafka Splunk

Files

Warehouse

Objects

Hadoop

Cube

s(C

usto

m g

rain

)

8Salesforce Confidential

Goals

ScalableProcess hundreds of billions of log lines.

FlexibleHandle thousands of log schemas. Support variable grain and transformations using custom code.

Data QualityAutomated data profiling, monitoring and alerting.

Self ServiceEnable ad-hoc analysis

9Salesforce Confidential

Log Processing Engine•Declaratively define features and flows.

•Normalize data across multiple log lines.

•Custom code injection for data transformation.

Data Profiler•Profile data at scale to detect anomalies.

Web App •Interface to manage features and flows.

Job Automation engine•End to end automation from features/flows to curated data sets in Wave.

Key Building Blocks

10Salesforce Confidential

Log Processing Engine

logType==’X’ and event==’Create Event’ and page==’Home Landing’,”Feat 1”,”eval_code(event.toUpperCase())”,page,…..

logType==’ABC’ and event==’Create Event’ and page==’Home’,”Feat 2”,”eval_code(event.substring(5))”,event,…..

usage Log Files

Feature definitions

Hive tables

Data Normalization

Data Cleansing

Data Transformation

+

11Salesforce Confidential

Data Profiler

Dataset Field Type, Total, Min, Max, Avg, # Nulls, # Distinct, Median, 99th %tile, Top N

lego_feat browser STR 2.3B 7 63 25 1M 50 34 38 [.....]

lego_feat url STR 2.3B 20 223 50 0 5M 70 90 [.....] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Datasets across platform

HCatalog

MapReduce

Datasets Dataset Profile An Example

Monitoring & alerting

12Salesforce Confidential

Everything put together

Dat

aset

s(V

ario

us g

rain

)

Data Lake

Log Processing

Metadata

Flow Engine

W

eb A

pp

Self Service

Log

Sou

rces

Clou

d M

etri

cs

Data Profiler

Data Science

Kafka Splunk

Files

Warehouse

Objects

Hadoop

Cube

s(C

usto

m g

rain

)

13Salesforce Confidential

Data Volumetrics

TOTAL

Avg. Volume of App Logs processed (Compressed) 100’s TB/mon

Avg. Number of Jobs 6000+ /mon

Avg. Log Size volume growth rate A lot!

Number of Log Record Types 1,000s

Number of fields 10s of 1,000s

200+ BEvents / Day

500+Features

14Salesforce Confidential

thank y u

14

We are hiring!! www.salesforce.com/comapany/careers

Recommended