54
Big Data Simplified "Is all about abˈstrakSH(ə)n" HEMAL GANDHI DIRECTOR OF DATA ENGINEERING

Big Data Simplified"Is all about abˈstrakSH(ə)n"

Embed Size (px)

Citation preview

Page 1: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Big Data Simplified "Is all about abˈstrakSH(ə)n"

HEMAL GANDHI D IRECTOR OF DATA ENGINEER ING

Page 2: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Background

Page 3: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Analyze Current State

•  Challenges

•  Facts

New Platform Design

•  Define Goals

•  Feature List

•  Implementation Approach

Compare

•  Feature List

•  Trade Offs

•  Cost Structure

Decision

Fix vs.

Build?

Page 4: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Analyze Current State

Page 5: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Platform is very complex

Page 6: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Struggling to keep up with business needs

Page 7: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Huge backlog

Page 8: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Code base is increasing rapidly

Page 9: Big Data Simplified"Is all about abˈstrakSH(ə)n"

We are slow to respond to market needs

Page 10: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Outdated technology stack

Page 11: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Missing best practices

Page 12: Big Data Simplified"Is all about abˈstrakSH(ə)n"

High cost of data Storage

Finding Insights Integration Maintenance

Page 13: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Strategic Value

Data Identity

Time Value

Dependencies

Lack of understanding business impact of data

Page 14: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Agile – mini waterfall

Page 15: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Process and Organization

High Investments Costs

Adoption Issues

Complex Framework

Page 16: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Lot of Challenges

Page 17: Big Data Simplified"Is all about abˈstrakSH(ə)n"

NOT scalable platform

Can impact revenue negatively!!!

Page 18: Big Data Simplified"Is all about abˈstrakSH(ə)n"

New Platform Design

Page 19: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Keep it simple

Page 20: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Keep up with business needs

Page 21: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Move fast

Page 22: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Keep technology stack current over time

Page 23: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Low cost of data Storage

Finding Insights Integration Maintenance

Page 24: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Strategic Value

Data Identity

Time Value

Dependencies

Understand business impact of data

Page 25: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Measure data

Page 26: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Be Agile – Do Less

Page 27: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Improve data ROI

Page 28: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Compare

Page 29: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Investment needs

Current Platform

High

New Platform Vs.

High

Page 30: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Scalability

Current Platform

Not Scalable

New Platform Vs.

Initially Scalable

Page 31: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Maintenance cost

Current Platform

High

New Platform Vs.

Initially low, grows over time

Page 32: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Technology

Current Platform

Outdated

New Platform Vs.

Big Data tools provide technology

not solutions to design problems

Page 33: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Technology choices

Page 34: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Decision Fix vs.

Build?

Page 35: Big Data Simplified"Is all about abˈstrakSH(ə)n"
Page 36: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Next Steps

Page 37: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Build a feature based scalable big data

platform in 6 months with limited resources

while supporting legacy system.

Goal

Page 38: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Design Patterns

Page 39: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Take Platform Approach

Project Requirements

Data Platform Features

Reusable Components

Page 40: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Technology Abstraction

Business Logic Declarative

Configuration

Pick Technology at Runtime

Execution Engine

Page 41: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Data Access & Ingestion Abstraction

Data Storage

Data Access API Data Ingestion Framework

Data Producers Data Consumers

Page 42: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Data Integration Jobs

Stream Data to Storage Layer

Data Storage

Data Integration Jobs Stream

Page 43: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Hot Data

Hot/Cold Data Management

Cold Data Configuration

Configuration

Page 44: Big Data Simplified"Is all about abˈstrakSH(ə)n"

abˈstrakSH(ə)n

Page 45: Big Data Simplified"Is all about abˈstrakSH(ə)n"

High Level Architecture

Page 46: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Data Quality Service (Data Lineage & Profiling)

Security Scheduling & Cluster Monitoring

Applications & Visualization Tools

Dredge

Collection • Apache Flume • Sqoop

Flow • Kafka • Spark

Processing • PIG • Spark • Map Reduce

Storage • Hive • HBase • Vertica

Delivery • Looker • Tableau • Visualization (d3.js) •  Email/FTP

Data Platform

Data Access Abstraction

Architecture

Page 47: Big Data Simplified"Is all about abˈstrakSH(ə)n"

A declarative, abstraction layer for integrating big

data tools, enabling loosely coupled big data platform.

WHAT IS DREDGE

Page 48: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Dredge Logical View

Events Management Log Streaming

Tasks Hadoop Cluster

Source Readers

Target Writer Streams/Direct

Dredge Repository – HBase

Target End

Points

Source End

Points

Configuration Abstraction

Page 49: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Dredge Repository – HBase

LAMDA Architecture : HDFS, Hive, HBase, PIG, Flume, Kafka, Oozie

Dredge Runtime Temp Store - HDFS Event

Management Temp Cache- HDFS Logger Stream

Dredge Data Services

Aggregator

UDF’s

Combiners, Routers..

Plugin (Java/Shell, PIG, SQL)

Rank, Sorter Set Operations

Filters/Patterns Analysis

Abstraction builder (Kafka, Flume, Pig, Custom)

Source Readers (Logs, RDBMS, unstructured data, Custom)

Direct/Stream

Target Writers (Hive, HBase, RDBMS, Custom) Direct/Stream

Dredge UI

Declarative configuration

Logical Flows

Data Lineage

Runtime Logs

Admin

Dredge Architecture

Page 50: Big Data Simplified"Is all about abˈstrakSH(ə)n"

•  From 1000+ scripts to 50-100 scripts

•  From 1000+ configuration files to <5 files

•  Logical view of workflow, abstract physical implementation

•  Quickly integrate new tools, declarative configuration

implementation for big data tools

•  Improved SLA, time to market, better cluster utilization,

higher performance

•  Simplified integration

•  Minimal migration costs

•  Low maintenance, configurable archiving of data

DREDGE BENEFITS

Page 51: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Summarizing

Page 52: Big Data Simplified"Is all about abˈstrakSH(ə)n"

ü  Abstraction layer

ü  Technology

ü  Data access

ü  Data ingestion

ü  Dependencies… It is all about abˈstrakSH(ə)n

ü  Reusable data components

ü  Event driven dependencies

ü  Plug & Play integration, loosely coupled (Cluster resources, Data)

Page 53: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Big data requires a different mindset:

Innovate, iterate often and keep it simple.

Page 54: Big Data Simplified"Is all about abˈstrakSH(ə)n"

Thank you.

E N G I N E E R I N G . O N E K I N G S L A N E . C O M