36
October 15-18, 2013 Charlotte, NC Building an Effective Data Warehouse Architecture James Serra, Business Intelligence Consultant SolidQ

Building an Effective Data Warehouse Architecture

Embed Size (px)

DESCRIPTION

Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.

Citation preview

Page 1: Building an Effective Data Warehouse Architecture

October 15-18, 2013Charlotte, NC

Building an Effective Data Warehouse Architecture

James Serra, Business Intelligence ConsultantSolidQ

Page 2: Building an Effective Data Warehouse Architecture

October 15-18, 2013 | Charlotte, NC

Please silence cell phones

Page 4: Building an Effective Data Warehouse Architecture

4

Session Evaluations

ways to access

Go to passsummit/evals

Download the GuideBook App and search: PASS Summit 2013

Follow the QR code link displayed on session signage throughout the conference venue and in the program guide

Submit by 5pmFriday Oct. 18 toWIN prizes

Your feedback is important and valuable.

Page 5: Building an Effective Data Warehouse Architecture

5

About Me

Business Intelligence Consultant, in IT for 28 years Owner of Serra Consulting Services, specializing in end-to-

end Business Intelligence and Data Warehouse solutions using the Microsoft BI stack

Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW developer

Been perm, contractor, consultant, business owner Presenter at PASS Business Analytics Conference and PASS

Summit MCSE for SQL Server 2012: Data Platform and BI SME for SQL Server 2012 certs Contributing writer for SQL Server Pro magazine Blog at JamesSerra.com SQL Server MVP

Page 6: Building an Effective Data Warehouse Architecture

6

Agenda

What a Data Warehouse is not What is a Data Warehouse and why use one? Fast Track Data Warehouse (FTDW) Appliances Data Warehouse vs Data Mart Kimball and Inmon Methodologies Populating a Data Warehouse ETL vs ELT Surrogate Keys SSAS Cubes SQL Server 2012 Tabular Model End-User Microsoft BI Tools

Page 7: Building an Effective Data Warehouse Architecture

7

What a Data Warehouse is not

• A data warehouse is not a copy of a source database with the name prefixed with “DW”

• It is not a copy of multiple tables (i.e. customer) from various sources systems unioned together in a view

• It is not a dumping ground for tables from various sources with not much design put into it

Page 8: Building an Effective Data Warehouse Architecture

8

Data Warehouse Maturity Model

Courtesy of Wayne Eckerson

Page 9: Building an Effective Data Warehouse Architecture

9

What is a Data Warehouse and why use one?All these reasons are for data warehouses only (not OLTP): Reduce stress on production system Optimized for read access, sequential disk scans Integrate many sources of data Keep historical records (no need to save hardcopy reports) Restructure/rename tables and fields, model data Protect against source system upgrades Use Master Data Management, including hierarchies No IT involvement needed for users to create reports Improve data quality and plugs holes in source systems One version of the truth Easy to create BI solutions on top of it (i.e. SSAS Cubes)

Page 10: Building an Effective Data Warehouse Architecture

10

Why use a Data Warehouse?

Legacy applications + databases = chaos

Production Control

MRP

InventoryControl

Parts Management

Logistics

Shipping

Raw Goods

Order Control

Purchasing

Marketing

Finance

Sales

Accounting

Management Reporting

Engineering

Actuarial

Human Resources

ContinuityConsolidationControlComplianceCollaboration

Enterprise data warehouse = order

Single version of the truth

Enterprise DataWarehouse

Every question = decision

Two purposes of data warehouse: 1) save time building reports; 2) slice in dice in ways you could not do before

Page 11: Building an Effective Data Warehouse Architecture

11

Hardware Solutions

Fast Track Data Warehouse - A reference configuration optimized for data warehousing.  This saves an organization from having to commit resources to configure and build the server hardware.  Fast Track Data Warehouse hardware is tested for data warehousing which eliminates guesswork and is designed to save you months of configuration, setup, testing and tuning.  You just need to install the OS and SQL Server

Appliances - Microsoft has made available SQL Server appliances that allow customers to deploy data warehouse (DW), business intelligence (BI) and database consolidation solutions in a very short time, with all the components pre-configured and pre-optimized.  These appliances include all the hardware, software and services for a complete, ready-to-run, out-of-the-box, high performance, energy-efficient solutions

Page 12: Building an Effective Data Warehouse Architecture

12

Fast Track Data Warehouse

FT Version 4.0

Benefits:- Pre-Balanced

Architectures- Choice of HW platforms- Lower TCO- High Scale- Reduced Risk- Taking the guess work

out of building a server

Page 13: Building an Effective Data Warehouse Architecture

13

Appliances

Dell Quickstart Data Warehouse Appliance 1000 (FT 4.0, 5TB) Dell Quickstart Data Warehouse Appliance 2000 (FT 4.0,

12TB) Dell Parallel Data Warehouse Appliance (v2, SQL 2012, 15TB-

6PB) IBM Fast Track Data Warehouse (FT 4.0, 3 versions: 24TB,

60TB, 112TB) HP Enterprise Data Warehouse Appliance (v2, SQL 2012,

15TB-6PB) HP Business Data Warehouse Appliance (FT 3.0, 5TB) HP Business Decision Appliance (BI, SharePoint 2010, SQL

Server 2008R2, PP) HP Database Consolidation Appliance (virtual environment,

Windows2008R2)

Page 14: Building an Effective Data Warehouse Architecture

14

Data Warehouse vs Data Mart

Data Warehouse: A single organizational repository of enterprise wide data across many or all subject areas Holds multiple subject areas Holds very detailed information Works to integrate all data sources Feeds data mart

Data Mart: Subset of the data warehouse that is usually oriented to specific subject (finance, marketing, sales)• The logical combination of all the data marts is a data

warehouse

In short, a data warehouse as contains many subject areas, and a data mart contains just one of those subject areas

Page 15: Building an Effective Data Warehouse Architecture

15

Kimball and Inmon Methodologies

Two approaches for building data warehouses

Page 16: Building an Effective Data Warehouse Architecture

16

Kimball and Inmon Myths

Myth: Kimball is a bottom-up approach without enterprise focus Really top-down: BUS matrix (choose business

processes/data sources), conformed dimensions, MDM Myth: Inmon requires a ton of up-front design that takes a

long time Inmon says to build DW iteratively, not big bang

approach (p. 91 BDW, p. 21 Imhoff) Myth: Star schema data marts are not allowed in Inmon’s

model Inmon says they are good for direct end-user access of

data (p. 365 BDW), good for data marts (p. 12 TTA) Myth: Very few companies use the Inmon method

Survey had 39% Inmon vs 26% Kimball. Many have a EDW

Myth: The Kimball and Inmon architectures are incompatible They can work together to provide a better solution

Page 17: Building an Effective Data Warehouse Architecture

17

Kimball and Inmon Methodologies

Relational (Inmon) vs Dimensional (Kimball) Relational Modeling:

Entity-Relationship (ER) model Normalization rules Many tables using joins History tables, natural keys Good for indirect end-user access of data

Dimensional Modeling: Facts and dimensions, star schema Less tables but have duplicate data (de-normalized) Easier for user to understand (but strange for IT people used to

relational) Slowly changing dimensions, surrogate keys Good for direct end-user access of data

Page 18: Building an Effective Data Warehouse Architecture

18

Relational Model vs Dimensional Model

Relational Model

Dimensional Model

If you are a business user, which model is easier to use?

Page 19: Building an Effective Data Warehouse Architecture

19

Kimball and Inmon Methodologies

• Kimball:• Logical data warehouse (BUS), made up of subject areas (data

marts)• Business driven, users have active participation• Decentralized data marts (not required to be a separate physical

data store)• Independent dimensional data marts optimized for

reporting/analytics• Integrated via Conformed Dimensions (provides consistency

across data sources)• 2-tier (data mart, cube), less ETL, no data duplication

• Inmon:• Enterprise data model (CIF) that is a enterprise data warehouse

(EDW)• IT Driven, users have passive participation• Centralized atomic normalized tables (off limit to end users)• Later create dependent data marts that are separate physical

subsets of data and can be used for multiple purposes• Integration via enterprise data model• 3-tier (data warehouse, data mart, cube), duplication of data

Page 20: Building an Effective Data Warehouse Architecture

20

Kimball Model

Why staging: Limit source contention (ELT), Recoverability, Backup, Auditing

Data Warehouse (star schema subject areas)

StagingArea 1

StagingArea 2

StagingArea 3

Multi-dimensional

Reporting Layer

OLTP Data Sources

SSIS

SSIS

SSIS

SSIS

SSIS

Dimensionalized View

Staging

Multi-dimensional

Cube Process

DW Bus Architecture

Data Mart 1

Data Mart 2

Atomic Data

Page 21: Building an Effective Data Warehouse Architecture

21

Inmon Model

StagingArea 1

StagingArea 2

StagingArea 3

Multi-dimensional

Reporting Layer

OLTP Data Sources

SSIS

SSIS

SSIS

SSIS

SSIS

SSIS

Staging

Tabular

DimensionalizedView

Cube Process

Data Warehouse(Normalized)

Data Mart 1(Normalized)SSIS

Corporate Information Factory (CIF)

Data Mart 2(Normalized)SSIS

Atomic Data

Page 22: Building an Effective Data Warehouse Architecture

22

Reasons to add a Enterprise Data Warehouse Single version of the truth May make building dimensions easier using lightly

denormalized tables in EDW instead of going directly from the OLTP source

Normalized EDW results in enterprise-wide consistency which makes it easier to spawn-off the data marts at the expense of duplicated data

Less daily ETL refresh and reconciliation if have many sources and many data marts in multiple databases

One place to control data (no duplication of effort and data) Reason not to: If have a few sources that need reporting

quickly

Page 23: Building an Effective Data Warehouse Architecture

23

Which model to use?

Models are not that different, having become similar over the years, and can compliment each other

Boils down to Inmon creates a normalized DW before creating a dimensional data mart and Kimball skips the normalized DW

With tweaks to each model, they look very similar (adding a normalized EDW to Kimball, dimensionally structured data marts to Inmon)

Bottom line: Understand both approaches and pick parts from both for your situation – no need to just choose just one approach

BUT, no solution will be effective unless you possess soft skills (leadership, communication, planning, and interpersonal relationships)

Page 24: Building an Effective Data Warehouse Architecture

24

Hybrid Model

Advice: Use SQL Server Views to interface between each level in the model

In the DW Bus Architecture each data mart could be a schema (broken out by business process subject areas), all in one database. Some companies will have one database and then build separate data mart tables from it

Data Warehouse (star schema subject areas)

StagingArea 1

StagingArea 2

StagingArea 3

Multi-dimensional

Reporting Layer

Mirror OLTP

(subset)

SSIS

SSIS

SSIS

SSIS

SSIS

SSIS

Staging

Data Mart 1

Tabular

Cube Process

Cube Process

DW Bus Architecture

Data Warehouse(Normalized)

SSIS

OLTP Data Sources SSIS

Data Mart 2

Atomic Data

Atomic Data

Corporate Information Factory (CIF)

EDW

Page 25: Building an Effective Data Warehouse Architecture

25

Kimball Methodology

From: Kimball’s The Microsoft Data Warehouse Toolkit

Kimball defines a development lifecycle, where Inmon is just about the data warehouse (not “how” used)

Page 26: Building an Effective Data Warehouse Architecture

26

Populating a Data Warehouse

Determine frequency of data pull (daily, weekly, etc) Full Extraction – All data (usually dimension tables) Incremental Extraction – Only data changed from last run

(fact tables) How to determine data that has changed

Timestamp - Last Updated Change Data Capture (CDC) Partitioning by date Triggers on tables MERGE SQL Statement Column DEFAULT value populated with date

Online Extraction – Data from source. First create copy of source:

Replication Database Snapshot Availability Groups

Offline Extraction – Data from flat file

Page 27: Building an Effective Data Warehouse Architecture

27

ETL vs ELT

• Extract, Transform, and Load (ETL)• Transform while hitting source system• No staging tables• Processing done by ETL tools (SSIS)

• Extract, Load, Transform (ELT)• Uses staging tables• Processing done by target database engine (SSIS: Execute T-SQL

Statement task instead of Data Flow Transform tasks)• Use for big volumes of data• Use when source and target databases are the same• Use with Parallel Data Warehouse (PDW)

ELT is better since database engine is more efficient than SSIS• Best use of database engine: Transformations• Best use of SSIS: Data pipeline and workflow management

Page 28: Building an Effective Data Warehouse Architecture

28

Surrogate Keys

Surrogate Keys – Unique identifier not derived from source system• Embedded in fact tables as foreign keys to dimension

tables• Allows integrating data from multiple source systems• Protect from source key changes in the source system• Allows for slowly changing dimensions• Allows you to create rows in the dimension that don’t exist

in the source (-1 in fact table for unassigned)• Improves performance (joins) and database size by using

integer type instead of text• Implemented via identity column on dimension tables

Page 29: Building an Effective Data Warehouse Architecture

29

SSAS Cubes

Reasons to use instead of data warehouse: Aggregating (Summarizing) the data for performance Multidimensional analysis – slice, dice, drilldown Hierarchies Advanced time-calculations – i.e. 12-month rolling average Easily use Excel to view data via Pivot Tables Slowly Changing Dimensions (SCD)

Page 30: Building an Effective Data Warehouse Architecture

30

Data Warehouse Architecture

Page 31: Building an Effective Data Warehouse Architecture

31

SQL Server 2012 Tabular Model

New xVelocity in-memory database in SSAS Build model in Power Pivot or SQL Server Data Tools (SSDT) Uses existing relational model No star schema, no extra SSIS Uses DAX (Excel-like formula language to create custom

calculations) Faster and easier to use than multidimensional model In Hybrid Model, DW Bus Architecture goes away, but only if

you don’t need surrogate keys, SCD, conformed dimensions, and all reporting will be off of the cubes

Page 32: Building an Effective Data Warehouse Architecture

32

End-User Microsoft Tools

Many options on Microsoft tools to use to go against the DW/BI: Excel PivotTables (SSAS Cube and Relational) SQL Server Reporting Services (SSRS) (SSAS Cube and

Relational) Report Builder (SSAS Cube and Relational) PowerPivot (Relational) PerformancePoint Services (PPS) (SSAS Cube) Power View (SSAS Cube) Data Mining (SSAS Cube)

Page 33: Building an Effective Data Warehouse Architecture

33

Resources Data Warehouse Architecture – Kimball and Inmon methodologies:

http://bit.ly/SrzNHy SQL Server 2012: Multidimensional vs tabular: http://bit.ly/SrzX1x Data Warehouse vs Data Mart: http://bit.ly/SrAi4p Fast Track Data Warehouse Reference Guide for SQL Server 2012:

http://bit.ly/SrAwsj Complex reporting off a SSAS cube: http://bit.ly/SrAEYw Surrogate Keys: http://bit.ly/SrAIrp Normalizing Your Database: http://bit.ly/SrAHnc Difference between ETL and ELT: http://bit.ly/SrAKQa Microsoft’s Data Warehouse offerings: http://bit.ly/xAZy9h Microsoft SQL Server Reference Architecture and Appliances:

http://bit.ly/y7bXY5 Methods for populating a data warehouse: http://bit.ly/SrARuZ Great white paper: Microsoft EDW Architecture, Guidance and Deployment

Best Practices: http://bit.ly/SrAZug End-User Microsoft BI Tools – Clearing up the confusion: http://bit.ly/SrBMLT Microsoft Appliances: http://bit.ly/YQIXzM Why You Need a Data Warehouse: http://bit.ly/1fwEq0j Data Warehouse Maturity Model: http://bit.ly/xl4mGM

Page 34: Building an Effective Data Warehouse Architecture

October 15-18, 2013 | Charlotte, NC

Questions?James Serra, Business Intelligence ConsultantEmail me at: [email protected] me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck will be posted)

34

Page 35: Building an Effective Data Warehouse Architecture

JOIN US for our second annual event to get the best learning for analyzing, managing, and sharing business information and insights through the Microsoft Data Platform of technologies.

Page 36: Building an Effective Data Warehouse Architecture

October 15-18, 2013 | Charlotte, NC

Thank youfor attending this session and the 2013 PASS Summit in Charlotte, NC

36