17
DATA WAREHOUSE DESIGN ON THE CLOUD A BIGDATA APPROACH PANCHALESWAR NAYAK ,SR ARCHITECT (CLOUD & BIGDATA) Public Clouds Private Cloud

Data Warehouse Design on Cloud ,A Big Data approach Part_One

Embed Size (px)

Citation preview

Page 1: Data Warehouse Design on Cloud ,A Big Data approach Part_One

DATA WAREHOUSE DESIGN ON THE CLOUDA BIGDATA APPROACH

PANCHALESWAR NAYAK ,SR ARCHITECT (CLOUD & BIGDATA)

Public Clouds

Private Cloud

Page 2: Data Warehouse Design on Cloud ,A Big Data approach Part_One

AGENDA• WHAT IS BUSINESS INTELLIGENCE (BI)• WHAT IS DATA WAREHOUSE(DW)• WHAT IS DATA MARTS• WHAT IS DATA MINING• LOGICAL ARCHITECTURE OF ETL (EXTRACT TRANSFORMATION AND LOAD)• DATA WAREHOUSE (DW) DESIGN METHODOLOGIES

• BILL INMON’S TOP-DOWN APPROACH• RALPH KIMBALL'S BOTTOM-UP APPROACH

• THE NEW 3V DATA PROBLEM (VOLUME, VELOCITY, VARIETY)• THE ARCHITECTURE FOR THE NEXT GENERATION OF DATA WAREHOUSING

Page 3: Data Warehouse Design on Cloud ,A Big Data approach Part_One

BUSINESS INTELLIGENCE (BI)• Business intelligence, or BI, Is an umbrella term that refers to a variety of

software applications used to analyze an organization's raw data. • BI as a discipline is made up of several related activities, including data

mining, online analytical processing, querying and reporting.

Page 4: Data Warehouse Design on Cloud ,A Big Data approach Part_One

WHAT DATA WAREHOUSE (DW)A data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered as a core component of business intelligence environment. DWs are central repositories of integrated data from one or more disparate sources.

It is a relational database schema which stores historical data and metadata from an operational system or systems, in such a way as to facilitate the reporting and analysis of the data, aggregated to various levels.

Page 5: Data Warehouse Design on Cloud ,A Big Data approach Part_One

WHAT IS DATA MART• The DATA MART is a subset of the DATA WAREHOUSE that is usually

oriented to a specific business line or team. • DATA MARTS are small slices of the DATA WAREHOUSE. • Where as data warehouses have an enterprise-wide depth, the information

in data marts pertains to a single department.

Page 6: Data Warehouse Design on Cloud ,A Big Data approach Part_One

DATA WARWHOUSE VS DATA MART• THE MAIN DIFFERENCE IS THE INFORMATION SCOPE THEY STORE. • DATA WAREHOUSE:

• Data warehouses save all kinds of data related to whole system. • Data warehouse is usually much bigger than data marts, because it keeps a lot more data.• Usually integrates large number of data sources in order to feed its database. • Holds multiple subject areas• Holds very detailed information• Does not necessarily use a dimensional model but feeds dimensional models.

• DATA MART• Data marts store specific subject information, becoming much more focused on these functionalities,

for example, finance, or sales.• A data mart has a lot less integration to do, since its data is very specific.• May hold more summarized data • Concentrates on integrating information from a given subject area or set of data source.• Is built focused on a dimensional model using a star schema.

Page 7: Data Warehouse Design on Cloud ,A Big Data approach Part_One

WHAT IS DATA MINING• Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different

perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both.

It is the data-driven discovery and modeling of hidden patterns in a volume of data.

Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified.

Page 8: Data Warehouse Design on Cloud ,A Big Data approach Part_One

LOGICAL ARCHITECTURE OF TRADITIONAL ETL SYSTEM

Online Data Store Data MartETL ToolsOriginal

Data Transformed Data

Transform

Load

Query

BI Tools

Page 9: Data Warehouse Design on Cloud ,A Big Data approach Part_One

DATA WAREHOUSE DESIGN METHODOLOGIES

• Bill Inmon is sometimes also referred to as the "father of data warehousing“, his design methodology is based on a top-down approach and defines data warehouse in these terms• Subject oriented - Data in a data warehouse is categorized on the basis of the subject area

and hence it is "subject oriented".• Integrated - Data gets integrated from different disparate data sources and hence universal

naming conventions, measurements, classifications and so on used in the data warehouse.  The data warehouse provides an enterprise consolidated view of data and therefore it is designated as an integrated solution.

• Non-volatile - Once the data is integrated\loaded into the data warehouse it can only be read. Users cannot make changes to the data and this practice makes the data non-volatile.

• Time variant - Data is stored for long periods of time quantified in years and has a date and timestamp and therefore it is described as "time variant".

Page 10: Data Warehouse Design on Cloud ,A Big Data approach Part_One

BILL INMON’S TOP-DOWN APPROACH

Bill Inmon Top-Down approach STAGING AREA

SALES

FINANCE

HR

OTHER SOURCES

MARKETING CLEANING

SCRUBBINGDE-

DUPLICATIONTRANSFORMATI

ON DATA WAREHOUSE

DATA MART2

DATA MART3

DATA MART4

DATA MART1

DATA MART5

EXTRACT LOAD

• Bill Inmon saw a need to integrate data from different OLTP systems into a centralized repository (called a data warehouse) with a so called top-down approach.

• He envisions a DW center of the "corporate information factory" (cif), which provides a logical framework for delivering Business Intelligence (BI), business analytics and business management capabilities.

Page 11: Data Warehouse Design on Cloud ,A Big Data approach Part_One

PROS AND CONS OF TOP-DOWN APPROACH

• PROS• Highly consistent dimensional view of data across data marts as all data marts are loaded from the centralized

repository (data warehouse).• Proven to be flexible to support business changes as it looks at the organization as whole, not at each function

or business process of the organization. • Generating a new dimensional data marts against the data stored in the data warehouse is a relatively simple

task.

• CONS• It represents a very large project with a very broad scope and hence the up-front cost for implementing a data

warehouse using the top-down methodology is significant. • The duration of time from the start of project to the point that end users start experience initial benefits of the

solution can be substantial. • The top-down methodology can be inflexible and unresponsive to changing departmental or business process

needs in today's dynamically changing environment.

Page 12: Data Warehouse Design on Cloud ,A Big Data approach Part_One

RALPH KIMBALL'S BOTTOM-UP APPROACH

Ralph Kimball's bottom-up approach

STAGING AREA

SALES

FINANCE

HR

OTHER SOURCES

MARKETING CLEANING

SCRUBBINGDE-

DUPLICATIONTRANSFORMATI

ON DATA WAREHOUSE

DATA MART2

DATA MART3

DATA MART4

DATA MART1

DATA MART5

EXTRACT

LOAD

LOAD

LOAD

LOAD

LOAD

• Ralph Kimball's bottom-up approach proposes to create a business matrix which should contain all the common elements (that are used by data marts such as conformed\shared dimension, measures, etc.) defined for the enterprise as whole.

• The user can design and develop solutions which supports doing analysis across the business processes for cross selling.

Page 13: Data Warehouse Design on Cloud ,A Big Data approach Part_One

BOTTOM-UP DATA WAREHOUSE DESIGN APPROACH

• Ralph Kimball is a renowned author on the subject of data warehousing. His design methodology is called dimensional modeling or the Kimball methodology.

• A data warehouse is the copy of the transactional data specifically structured for EMPHASIZING THE VALUE OF THE DATA WAREHOUSE TO THE USERS AS QUICKLY AS POSSIBLE.

• A Data Warehouse is the copy of the transactional data specifically structured for analytical querying and reporting in order to support the decision support system.

• Data Marts are first created to provide reporting and analytical capabilities for specific business\functional processes and later on these data marts can eventually be unioned together to create a comprehensive Enterprise Data Warehouse.

• The bottom-up approach focuses on each business process at one point of time so the return on investment (ROI) could be as quick as first data mart gets created.

• Though if not carefully planned, you might lack the big picture of the Enterprise Data Warehouse by missing some dimensions or by creating redundant dimensions, etc. When you are too focused on an individual business process.

Page 14: Data Warehouse Design on Cloud ,A Big Data approach Part_One

PROBLEM WITH OLD DATA WAREHOUSE• CAN HANDLE VERY LIMITED NUMBER OF DATA SOURCES (MAY BE AROUND 25-30)• CAN NOT HANDLE LARGE NUMBER OF DATA SOURCES• GLOBAL SCHEMA REQUIRED

• A PROGRAMMER OR DATA ENGINEER WAS REQUIRED FOR EACH DATA SOURCE TO• To Understand Data Schema• To Write local to Global mapping (Scripting language)• To Clean the DATA• To Run the ETL

• HUMAN INVOLVEMENT WAS VERY MUCH REQUIRED FOR ADDING A NEW DATA SOURCE. • SCALABILITY ISSUES• AGILITY ISSUES

Page 15: Data Warehouse Design on Cloud ,A Big Data approach Part_One

THE NEW 3V DATA PROBLEM

• VOLUME • TOO BIG DATA TO BE HANDLED AND TOO BIG TO BE PROCESSED BY A SINGLE SERVER

• VELOCITY • TOO MUCH CONTINUOUS DATA FLOW WITH HIGH SPEED OF DATA INGESTION TO BE HANDLED BY A STATIC DATA WAREHOUSE

• VARIETY • TOO UNSTRUCTURED TO FIT INTO A ROW-AND-COLUMN DATABASE

Page 16: Data Warehouse Design on Cloud ,A Big Data approach Part_One

THE NEW DATA ARCHITECTURE WITH HADOOP

Online Data Store

HadoopData Mart

ETL Tools

Original Data

Transformed Data

Extract Transform Load

Query

BI Tools

Page 17: Data Warehouse Design on Cloud ,A Big Data approach Part_One

THE ARCHITECTURE FOR THE NEXT GENERATION OF DATA WAREHOUSING