17
Milos Milovanovic, Co-Founder & Data Engineer @ Things Solver [email protected] [email protected] Planning and Optimizing Data Lake Architecture

Planning and Optimizing Data Lake Architecture - Milos Milovanovic

Embed Size (px)

Citation preview

Page 1: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

Milos Milovanovic, Co-Founder & Data Engineer @ Things [email protected]@datascience.rs

Planning and Optimizing Data Lake Architecture

Page 2: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

Agenda

Introduction - Business Data Requirements

What is A Data Lake?

A Common Data Lake Architecture

When Problems Start To Show Up - Optimizing Data Lake

Expanding a Data Lake

How To Plan Data Lake - Success Factors

Page 3: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

Introduction - Business Data Requirements

Main goal for organizations is to adapt and put all of their data into use.

It’s not an easy task - it might require the mindset and structural changes.

Flexibility and agility are required for success.

Various trends and buzzwords are making it hard to stay on track.

Challenge of Transforming Enterprise Data Management - (“The data lake is a

foundational component and common denominator of the modern data architecture

enabling, and complementing specialized components, such as enterprise data

warehouses, discovery-oriented environments, and highly-specialized analytic or

operational data technologies…” - John O’Brien, CEO @ Radiant Advisors)

Page 4: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

Data Lake - The Very First Definition

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

- James Dixon, CTO @ Pentaho

Page 5: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

A More Formal Definition

“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

Page 6: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

Data Warehouse & Data Lake

Page 7: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

Data Warehouse & Data Lake by Example

Social Media Streaming can be implemented using traditional Data Warehouse

… but such an application will be to restricted and inflexible (extending the number of columns analyzed).

Using Data Lake for this purpose gives us flexibility to adapt and test new metrics

… and we can easily add new applications on top.

Page 8: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

Data Lake Architecture Overview

Page 9: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

A Common Data Lake Implementation Architecture

❏ In general, the architecture of a data lake is simple: a Hadoop File System (HDFS) with lots of directories and files on it.

❏ Hadoop is usually in the center of Data Lake Architecture, although the concept is broader than Hadoop.

❏ Hadoop’s scalable, low-cost persistence layer and its ability to perform big data processing and analytics is a great toolset to achieve measurable business value opportunities at speed and low cost.

❏ Hive and Spark provide us rich analytics on top of the data that is persisted at low cost.

Page 10: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

This Architecture:Acts like SQL

Efficient and Scalable

Connects to Basically Anything

Different Processing Modes (Realtime, Batch, Pipelines, Machine Learning, Ad Hoc Analysis …)

HADOOPDISTRIBUTED

FILESYSTEM

HIVE AND SPARK

DATA SOURCES

Page 11: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

When Problems Show Up

Hadoop + Spark/Hive != Database

- Searching a row within TBs of Data

select * from my_table where some_column like ’%123asd%’;

- No updates and deletes- Too many concurrent requests from BI Tools

...Spark Best Practice: http://go.databricks.com/not-your-fathers-database

Page 12: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

How Do We Optimize Such a Solution?

❏ Use ORC File Format❏ File Compaction (small files, deduplication)❏ Run Spark on YARN❏ Use Spark Dataframes❏ Data Caching❏ Use Traditional Databases❏ Extend the Toolset (Solr, ES, Kafka, Redis, …)

Page 13: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

Data Lake - Extended Toolset

HDFS

AND MANY MORE...

Page 14: Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Page 15: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

How To Start With The Data Lake?

❏ Think of the Use Cases (don’t plan all the use cases - have some in mind)❏ Master the Technology❏ Go agile and flexible❏ Do not forget about the Data Governance, Data Quality, Security (but do not

drown in this)❏ Integrate with BI and DWH

Page 16: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

Make data accessible and let Data Scientists go fishing in the Lake.

Page 17: Planning and Optimizing Data Lake Architecture - Milos Milovanovic

Milos Milovanovic, Co-Founder & Data Engineer @ Things [email protected]@datascience.rs

Planning and Optimizing Data Lake Architecture