Building Data Warehouse in SQL Server

th SQLNIGHT

CHAPTER

Building Data Warehouse in SQL Server

Antonios ChatzipavlisSQLschool.gr Founder, Principal Consultant

SQL Server Evangelist, MVP on SQL Server

May 30, 2015

25

I have been started with computers.

I started my professional carrier in computers industry.

I have been started to work with SQL Server version 6.0

I earned my first certification at Microsoft as Microsoft Certified Solution

Developer (3rd in Greece) and started my carrier as Microsoft Certified

Trainer (MCT) with more than 20.000 hours of training until now!

I became for first time Microsoft MVP on SQL Server

I created the SQL School Greece (www.sqlschool.gr)

I became MCT Regional Lead by Microsoft Learning Program.

I was certified as MCSE : Data Platform, MCSE: Business Intelligence

Antonios ChatzipavlisDatabase Archi tect

SQL Server Evange l i st

MCT, MCSE, MCITP, MCPD, MCSD, MCDBA,

MCSA, MCTS, MCAD, MCP, OCA, ITIL-F

1982

1988

1996

1998

2010

2012

2013

CHAPTER

Follow us in social media

Twitter @antoniosch / @sqlschool

Facebook fb/sqlschoolgr

YouTube yt/user/achatzipavlis

LinkedIn SQL School Greece group

Pinterest pi/SQLschool/

[email protected]

Stay Involved!

• Sign up for a free membership today at sqlpass.org

• Linked In: http://www.sqlpass.org/linkedin

• Facebook: http://www.sqlpass.org/facebook

• Twitter: @SQLPASS

• PASS: http://www.sqlpass.org

http://www.sqlpass.org/linkedin

http://www.sqlpass.org/facebook

http://www.sqlpass.org/

Whatever your data passion – there’s a Virtual Chapter for you!

www.sqlpass.org/vc

Planning on attending PASS Summit 2015? Start

saving today!

• The world’s largest gathering of SQL Server & BI professionals

• Take your SQL Server skills to the next level by learning from the world’s

top SQL Server experts, in over 190 technical sessions

• Over 5000 registrations, representing 2000 companies, from 52

countries, ready to network & learn

Save $150 right now using

discount code LC15CPJ8$1795until July 12th, 2015

Don’t miss your chance to vote in the 2015 PASS

elections-update your myPASS profile by June 1!

In order to vote for the 2015 PASS Nomination Committee & the Board of Directors, you need

to complete all mandatory fields in your myPASS profile by 11:59 PM PDT June 1, 2015.

• PASS members will be reminded to review & complete their profiles

• Members will receive instructions for updating profiles and deleting duplicate profiles

• Eligible voters will receive information about key election dates and the voting process after June 1

Head to sqlpass.org/myPASS today!

For more info on elections,

visit to sqlpass.org/elections

• Overview of Data Warehousing

• Data Warehouse Solution

• Data Warehouse Infrastructure

• Data Warehouse Hardware

• Data Warehouse Design Overview

• Designing Dimension Tables

• Designing Fact Tables

• Data Warehouse Physical Design

Agenda

Overview of Data Warehousing

• There are many definitions for the term “data warehouse,”

and disagreements over specific implementation details.

• It is generally agreed that a data warehouse is a centralized

store of business data that can be used for reporting and

analysis to inform key decisions.

• A data warehouse provides a solution to the problem of

distributed data that prevents effective business decision-

making.

What is a Data Warehouse?

The single organizational repository of enterprise wide

data across many or all lines of business and subject

areas.

Contains massive and integrated data.

Represents the complete organizational view of

information needed to run and understand the

business.

Definition of Data Warehouse

• Contains a large volume of data that relates to historical

business transactions.

• Is optimized for read operations that support querying the

data.

• Is loaded with new or updated data at regular intervals.

• Provides the basis for enterprise BI applications.

Data Warehouse characteristics

• Finding the information required for business decision• This is time-consuming and error-prone.

• Key business data is distributed across multiple systems. • This makes it hard to collate all the information necessary for a particular

business decision.

• Fundamental business questions are hard to answer. • Most business decisions require a knowledge of fundamental facts.

• The distribution of data throughout multiple systems in a typical

organization can make them difficult, or even impossible, to answer.

What makes a Data Warehouse useful?

• The specific, subject oriented, or departmental view of

information from the organization.

• Generally these are built to satisfy user requirements for

information

What is a Data Mart?

Data Warehouse Vs Data Mart

Data Warehouse Data Mart

Scope• Application independent

• Centralized or Enterprise

• Planned

• Specific application

• Decentralized by group

• Organic but may be planned

Data• Historical, detailed, summary

• Some denormalization

• Some history, detailed, summary

• High denormalization

Subjects • Multiple Subjects • Single central subject area

Sources • Many internal and external sources • Few internal and external sources

Other

• Flexible

• Data oriented

• Long life

• Single complex structure

• Restrictive

• Project oriented

• Short life

• Multiple simple structures

• Centralized Data Warehouse

• Departmental Data Mart

• Hub and Spoke

Data Warehouse Architectures



• Hub and Spoke




• Hub and Spoke




• Hub and Spoke


Components of a Data Warehousing Solution

Data

WarehouseMaster Data

Management

Data

Cleansing

Data

So

urc

es

ETL

Data

Models

Reporting and Analysis

1. Start by identifying the business questions that the data warehousing

solution must answer

2. Determine the data that is required to answer these questions

3. Identify data sources for the required data

4. Assess the value of each question to key business objectives versus

the feasibility of answering it from the available data

For large enterprise-level projects, an incremental approach can be effective:

• Break the project down into multiple sub-projects

• Each sub-project deals with a particular subject area in the data warehouse

Starting a Data Warehouse Project

Core Data Warehousing

• SQL Server Database

Engine

• SQL Server Integration

Services

• SQL Server Master Data

Services

• SQL Server Data Quality

Services

Enterprise BI

• SQL Server Analysis

Services

• SQL Server Reporting

Services

• Microsoft SharePoint

Server

• Microsoft Office

Self-Service BI

Big Data Analysis

• Excel Add-ins

(PowerPivot, Power

Query, Power View,

Power Map)

• Microsoft Office 365

Power BI

• Windows Azure

HDInsight

SQL Server As a Data Warehousing Platform

Data Warehouse Solution

A data warehouse

is a relational database that

is optimized for reading data

for analysis and reporting.

Keep in mind

• Logical:• Is typically designed to denormalize data into a structure that minimizes the

number of join operations required in the queries used to retrieve and

aggregate data.

• A common approach is to design a star schema

• Physical:

• Affect the performance and manageability of the data warehouse

Logical and Physical Database schema

• Query processing requirements, including anticipated peak

memory and CPU utilization.

• Storage volume and disk input/output requirements.

• Network connectivity and bandwidth.

• Component redundancy for high availability.

Hardware selection

• Failover time requirements.

• Configuration and management complexity.

• The volume of data in the data warehouse.

• The frequency of changes to data in the data warehouse.

• The effect of the backup process on data warehouse

performance.

• The time to recover the database in the event of a failure.

High availability and Disaster Recovery

• The authentication mechanisms that you must support to

provide access to the data warehouse.

• The permissions that the various users who access the data

warehouse will require.

• The connections over which data is accessed.

• The physical security of the database and backup media.

Security

• Data Source Connection Types

• Credentials and Permissions

• Data Formats

• Data Acquisition Windows

Data sources

• Staging:• What data must be staged?

• Staging data format

• Required transformations:• Transformations during extraction versus data flow transformations

• Incremental ETL:• Identifying data changes for extraction

• Inserting or updating when loading

ETL Processes

• Data quality:

• Cleansing data:

• Validating data values

• Ensuring data consistency

• Identifying missing values

• Deduplicating data

• Master data management:

• Ensuring consistent business entity definitions across multiple systems

• Applying business rules to ensure data validity

Data Quality and Master Data Management

Data Warehouse Infrastructure

Data volume

• The amount of data that the data warehouse must store

• The size and frequency of incremental loads of new data.

• The primary consideration is the number of rows in fact tables

• But don’t forget dimension data, indexes, and data models

that are stored on disk.

System Sizing Factors

Analysis and Reporting Complexity

• This includes the number, complexity, and predictability of the

queries that will be used to analyze the data or produce reports.

• Typically, BI solutions must support a mix of the following query

types:

• Simple. Relatively straightforward SELECT statements.

• Medium. Repeatedly executed queries that include aggregations or many joins.

• Complex. Unpredictable queries with complex aggregations, joins, and

calculations.


Number of Users

• This is the total number of information workers who will

access the system, and how many of them will do so

concurrently.

Availability Requirements

• These include when the system will need to be used, and

what planned or unplanned downtime the business can

tolerate.


Typical System Categorization

Small Medium Large

Data Volume 100s of GBs to 1 TB 1 to 10 TB 10 TB to 100s of TBs

Analysis and

Reporting

Complexity

Over 50% simple

30% medium

Less than 10% complex

50% simple

30-35% medium

10-15% complex

30-35% simple

40% medium

20-25% complex

Number of Users100 total

10 to 20 concurrent

1,000 total

100 to 200 concurrent

1,000s of concurrent

users

Availability

RequirementsBusiness hours

1 hour of downtime per

night24/7 operations

Data Warehouse Workloads

ETL

• Control flow tasks

• Data query and insert

• Network data transfer

• In-memory data pipeline

• SSIS Catalog or MSDB I/O

Reporting

• Client requests

• Data source queries

• Report rendering

• Caching

• Snapshot execution

• Subscription processing

• Report Server Catalog I/O

Operations and

Maintenance• OS activity

• Logging

• SQL Server Agent Jobs

• SSIS packages

• Indexes

• Backups

DW

• Processing

• Aggregation storage• Multidimensional on disk

• Tabular in memory

• Query execution

Cubes

Typical Server Topologies for a BI Solution

Single Server

Architecture

DW

Distributed

Architecture

ServersFew Many

Hardware costs

Software license costs

Configuration complexity

Scalability & Performance

Flexibility

Scaling-out a BI Solution

Analysis ServicesData Warehouse

Integration Services Reporting Services

• Partitioning the data

across multiple

database servers

• SQL Server Parallel

Data Warehouse

edition

• Install the Reporting

Services database on a

single database server,

• Then install the

Reporting Services

report server service

on multiple servers

that all connect to the

same Reporting

Services database.

Create a read-only copy of a

multidimensional database and

connect to it from multiple Analysis

Services query servers.

Use multiple SSIS

servers to perform a

subset of the ETL

processes in parallel

Planning for High Availability

• AlwaysOn Failover

Cluster

• RAID Storage

• AlwaysOn Failover

Cluster

• AlwaysOn Availability

Group

• NLB Report

Servers

• AlwaysOn

Availability

Group

• AlwaysOn

Failover Cluster

Data Warehouse

Analysis Services

Integration Services

Reporting Services

Data Warehouse Hardware

• A DW usually has longer-running queries

• A DW has higher read activity than write activity

• The data in DW is usually more static

• In a DW it is much more important to be able to process a

large amount of data quickly, than it is to support a high

number of I/O operations per second

Keep in mind

• Determine initial data volume• Number of fact table rows x row size

• Use 100 bytes per row as an estimate if unknown

• Add 30-40% for dimensions and indexes

• Project data growth• Number of new fact rows per month

• Factor in compression• Typically 3:1

Determining Storage Requirements

Other storage requirements

• Configuration databases

• Log files

• TempDB

• Staging tables

• Backups

• Analysis Services models

• Use more smaller disks instead of fewer larger disks

• Use the fastest disks you can afford• Consider solid state disks especially for random I/O

• Use RAID 10, or minimally RAID 5

• Consider a dedicated storage area network for manageability

and extensibility• Balance I/O across enclosures, storage processors, and disk groups

Considerations for Storage Hardware

Server size Minimum memory Maximum memory

1 socket 64 GB 128 GB

2 sockets 128 GB 256 GB



Server Memory

• Determine core MCR

• Apply formula to estimate required number of cores:

Estimating CPU Requirements

𝐶𝑃𝑈𝑠 =

𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑞𝑢𝑒𝑟𝑦 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑀𝐵𝑀𝐶𝑅

𝑥 𝐶𝑜𝑛𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑢𝑠𝑒𝑟𝑠

𝑇𝑟𝑎𝑟𝑔𝑒𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑡𝑖𝑚𝑒

• This metric measures the maximum SQL Server data processing rate for a

standard query and data set for a specific server and CPU combination.

• This is provided as a per-core rate, and it is measured as a query-based scan

from memory cache.

• MCR is the initial starting point for Fast Track system design.

• It represents an estimated maximum required I/O bandwidth for the server, CPU,

and workload.

• MCR is useful as an initial design guide because it requires only minimal local

storage and database schema to estimate potential throughput for a given CPU.

• It is not a measure of system performance.

Maximum Consumption Rate (MCR)

• Create a reference dataset based on the TPC-H line item table or similar data set. • The table should be of a size that it can be entirely cached in the SQL Server buffer pool yet still

maintain a minimum one-second execution time for the query provided here.

• For FTDW the following query is used:

SELECT sum([integer field]) FROM [table]

WHERE [restrict to appropriate data volume]

GROUP BY [col].

• Ensure that Resource Governor settings are at default values.

• Ensure that the query is executing from the buffer cache. • Executing the query once should put the pages into the buffer, and subsequent executions should

read fully from buffer. Validate that there are no physical reads in the query statistics output.

Calculate MCR

• Set STATISTICS IO and STATISTICS TIME to ON to output results.

• Run the query multiple times, at MAXDOP = 1.

• Record the number of logical reads and CPU time from the statistics output for each query execution.

• Calculate the MCR in MB/s using the formula:

( [Logical reads] / [CPU time in seconds] ) * 8KB / 1024

• A consistent range of values (+/- 5%) should appear over a minimum of five query executions. • Significant outliers (+/- 20% or more) may indicate configuration issues. The

average of at least 5 calculated results is the FTDW MCR.

Calculate MCR

• Special SQL Server Edition only available in hardware appliances

• Massively parallel processing

• Shared-nothing architecture

• Dedicated control nodes, compute nodes, and storage nodes

SQL Server Parallel Data Warehouse

Du

al Fib

er

Ch

an

nel

Database servers

(compute nodes)

Infi

nib

an

d

Storage Arrays

Control Node

Cluster

Management

Servers

Landing Zone

(ETL Interface)

Backup Nodes

Data Warehouse Design Overview

The Dimensional Model

Fact

Dimension

Dimension

Dimension

Dimension

Dimension

Dimension

Snowflake schema

Star schema

Measures

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

• Identify the grain

• Select the required dimensions

• Identify the facts

Dimensional Modeling

• The grain of a dimensional model is the lowest level of detail

at which you can aggregate the measures.

• It is important to choose the level of grain that will support the

most granular of reporting and analytical requirements

• Typically the lowest level possible from the source data is the

best option.

Identify the grain

• Determine which of the dimensions related to the business

process should be included in the model

• The selection of dimensions depends on the reporting and

analytical requirements, specifically on the business entities by

which users need to aggregate the measures

• Almost all dimensional models include a time-based

dimension

Select the required dimensions

• Identify the facts that you want to include as measures.

• Measures are numeric values that can be expressed at the

level of the grain chosen earlier and aggregated across the

selected dimensions.

• Depending on the grain you choose for the dimensional

model and the grain of the source data, you might need to

allocate measures from a higher level of grain across multiple

fact rows.

Identify the facts

Documenting Dimensional Models

Sales Order

Item Quantity

Unit Cost

Total Cost

Unit Price

Sales Amount

Shipping Cost

Time(Order Date and

Ship Date)

Salesperson

CustomerProduct

Calendar Year

Month

Date

Fiscal Year

Fiscal Quarter

Month

Date

Region

Country

Territory

Manager

Name

Name

Country

State or Province

City

Age

Marital Status

Gender

CategorySubcategoryProduct Name

ColorSize

Designing Dimension Tables

Each row

in a dimension table represents

an instance of a business entity

by which the measures

in the fact table

can be aggregated

Keep in mind

• A key column uniquely identifies each row in the dimension table.

• Usually the dimension data is obtained from a source system in which a key is already assigned, this is the “business key”

• It is standard practice to define a new “surrogate key” that uses an integer value to identify each row. • A surrogate key is recommended for the following reasons:

• The data warehouse might use dimension data from multiple source systems, so it is possible that business keys are not unique.

• Some source systems use non-numeric keys, such as a globally unique identifier (GUID), or natural keys, such as an email address, to uniquely identify data entities. Integer keys are smaller and more efficient to use in joins from fact tables.

• If the dimension table supports “Type 2” slowly-changing dimensions.

Dimension keys

Dimension keys

ProductKey ProductAltKey ProductName Color Size

1 MB1-B-32 MB1 Mountain Bike Blue 32

2 MB1-R-32 MB1 Mountain Bike Red 32

CustomerKey CustomerAltKey Name

1 1002 Amy Alberts

2 1005 Neil Black

Surrogate Key Business (Alternate) Key

• Hierarchies • Multiple attributes can be combined to form hierarchies that enable users to drill

down into deeper levels of detail.

• Business users can view aggregated fact data at each level

• Slicers• Attributes do not need to form hierarchies to be useful in analysis and reporting.

• Business users can group or filter data based on single-level hierarchies to create analytical sub-groupings of data.

• Drill-through detail • Some attributes have little value as slicers or members of a hierarchy.

• It can be useful to include entity-specific attributes to facilitate drill-through functionality in reports or analytical applications.

Dimension Attributes and Hierarchies

Dimension Attributes and Hierarchies

CustKey CustAltKey Name Country State City Phone Gender

1 1002 Amy Alberts Canada BC Vancouver 555 123 F

2 1005 Neil Black USA CA Irvine 555 321 M

3 1006 Ye Xu USA NY New York 555 222 M

Hierarchy

SlicerDrill-through detail

• Identify the semantic meaning of NULL• Unknown or None?

• Do not assume NULL equality• Use ISNULL( )

Unknown and None

OrderNo Discount DiscountType

1000 1.20 Bulk Discount

1001 0.00 N/A

1002 2.00

1003 0.50 Promotion

1004 2.50 Other

1005 0.00 N/A

1006 1.50

So

urc

e

Dim

en

sio

n T

ab

le

DiscKey DiscAltKey DiscountType

-1 Unknown Unknown

0 N/A None

1 Bulk Discount Bulk Discount

2 Promotion Promotion

3 Other Other

• The simplest type of SCD to implement.

• Attribute values are updated directly in the existing dimension table row and no history is maintained.

• Suitable for attributes that are used to provide drill-through details

• Unsuitable for analytical slicers or hierarchy members where historic comparisons must reflect the attribute values as they were at the time of the fact event.

Slowly Changing Dimensions – Type 1

CustKey CustAltKey Name Phone

1 1002 Amy Alberts 555 123

CustKey CustAltKey Name Phone

1 1002 Amy Alberts 555 222

• These changes involve the creation of a fresh version of the dimension entity in the form of a new row.

• Typically, a bit column in the dimension table is used as a flag to indicate which version of the dimension row is the current one.

• Additionally, datetime columns are often used to indicate the start and end of the period for which a version of the row was (or is) current. • Maintaining start and end dates makes it easier to assign the appropriate foreign key value to fact rows as they are

loaded so they are related to the version of the dimension entity that was current at the time the fact occurred.


CustKey CustAltKey Name City Current Start End

1 1002 Amy Alberts Vancouver Yes 1/1/2000

CustKey CustAltKey Name City Current Start End

1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012

4 1002 Amy Alberts Toronto Yes 1/1/2012

• Rarely used

• The previous value (or a complete history of previous values) is maintained in the dimension table row.

• This requires modifying the dimension table schema to accommodate new values for each tracked attribute, and can result in a complex dimension table that is difficult to manage.


CustKey CustAltKey Name Cars

1 1002 Amy Alberts 0

CustKey CustAltKey Name Prior Cars Current Cars

1 1002 Amy Alberts 0 1

• Surrogate key

• Granularity

• Range

Time DimensionDateKey DateAltKey MonthDay WeekDay Day MonthNo Month Year

00000000 01-01-1753 NULL NULL NULL NULL NULL NULL

20130101 01-01-2013 1 3 Tue 01 Jan 2013

20130102 01-02-2013 2 4 Wed 01 Jan 2013

20130103 01-03-2013 3 5 Thu 01 Jan 2013

20130104 01-04-2013 4 6 Fri 01 Jan 2013

• Attributes and hierarchies

• Multiple calendars

• Unknown values

• Create a Transact-SQL script

• Use Microsoft Excel

• Use a BI tool to autogenerate a time dimension table

Populating a Time Dimension Table

• A common requirement in a data warehouse is to support

dimensions with parent-child hierarchies

• Typically, parent-child hierarchies are implemented as self-

referencing tables, in which a column in each row is used as a

foreign-key reference to a primary-key value in the same

table

Self-Referencing Dimension

EmployeeKey EmployeeAltKey EmployeeName ManagerKey

1 1000 Kim Abercrombie NULL

2 1001 Kamil Amireh 1

3 1002 Cesar Garcia 1

4 1003 Jeff Hay 2

• Combine low-cardinality attributes that don’t belong in

existing dimensions into a junk dimension

• Avoids creating many small dimension tables

Junk Dimensions

JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit

1 1 1 Credit

2 1 1 Debit

3 1 0 Credit

4 1 0 Debit

5 0 1 Credit

6 0 1 Debit

7 0 0 Credit

8 0 0 Debit

Designing Fact Tables

Fact Table Columns

OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount

20120101 25 120 1000 1 350.99

20120101 99 120 1000 2 6.98

20120101 25 178 1001 2 701.98


20120101 25 120 1000 1 350.99

20120101 99 120 1000 2 6.98

20120101 25 178 1001 2 701.98


20120101 25 120 1000 1 350.99

20120101 99 120 1000 2 6.98

20120101 25 178 1001 2 701.98

Dimension Keys

Measures

Degenerate Dimensions

Types of MeasureOrderDateKey ProductKey CustomerKey SalesAmount

20120101 25 120 350.99

20120101 99 120 6.98

20120102 25 178 701.98

DateKey ProductKey StockCount

20120101 25 23

20120101 99 118

20120102 25 22

OrderDateKey ProductKey CustomerKey ProfitMargin

20120101 25 120 25

20120101 99 120 22

20120102 25 178 27

Additive measures

Semi-additive measures

Non-additive measures

Types of Fact Table

OrderDateKey ProductKey CustomerKey OrderNo Qty Cost SalesAmount

20120101 25 120 1000 1 125.00 350.99

20120101 99 120 1000 2 2.50 6.98

20120101 25 178 1001 2 250.00 701.98

DateKey ProductKey OpeningStock UnitsIn UnitsOut ClosingStock

20120101 25 25 1 3 23

20120101 99 120 0 2 118

OrderNo OrderDateKey ShipDateKey DeliveryDateKey

1000 20120101 20120102 20120105

1001 20120101 20120102 00000000

1002 20120102 00000000 00000000

Transaction

fact tables

Periodic

snapshot tables

Accumulating

snapshot

fact tables

Data Warehouse Physical Design

Understanding DW Components Activity

ETL

Data Models

Reports

User Queries

ETL Loads• Bulk inserts

• Some lookups and

updates

• Large fact

tables

• Star joins to

dimension

tables

Data Model Processing• Mostly table/index

scans

Report Processing• Predictable queries

• Many rows with range-based

query filters

Self-Service BI• Potentially

unpredictable

queries

• Create files with an initial size• Based on the eventual size of the objects that will be stored on them

• This pre-allocates sequential disk blocks and helps avoid fragmentation.

• Disable autogrowth• If you begin to run out of space in a data file, it is more efficient to explicitly

increase the file size by a large amount rather than rely on incremental

autogrowth.

Data files guidelines

• Create at least one filegroup in addition to the primary one, and then set it as the default filegroup so you can separate data tables from system tables.

• Create dedicated filegroups for extremely large fact tables and using them to place those fact tables on their own logical disks.

• If some tables in the data warehouse are loaded on a different schedule from others, consider using filegroups to separate the tables into groups that can be backed up independently.

• If you intend to partition a large fact table, create a filegroup for each one so that older, stable rows can be backed up, and then set as read-only.

Filegroups guidelines

• Separate staging database• Create it on a logical disk distinct from the data warehouse files.

• Into the data warehouse database • Create a file and filegroup for them on a logical disk

• Separate from the fact and dimension tables.

• An exception to the previous guideline is made for staging tables that will be

switched with partitions to perform fast loads.

• These must be created on the same filegroup as the partition with which they will be

switched.

Staging tables

• To avoid fragmentation of data files • Place it on a dedicated logical disk

• Set its initial size based on how much it is likely to be used.

• Set the growth increment to be quite large to ensure that

performance is not interrupted by frequent growth of

TempDB.

• Creating multiple files for TempDB to help minimize

contention during page free space (PFS) scans as temporary

objects are created and dropped.

TempDB

• Set the transaction mode of the Data Warehouse, Staging

Database and TempDB to Simple

• Helps to avoid having to truncate transaction logs

• Additionally, most of the inserts in a data warehouse are

typically performed as bulk load operations which are not

logged.

• To avoid disk resource conflicts between data warehouse I/O

and logging, place the transaction log files for all databases

on a dedicated logical disk.

Transaction logs

• SQL Server Enterprise edition supports data compression at

both page and row level.

• Data compression benefits in a data warehouse• Reduced storage requirements.

• Improved query performance

• Best practices for data compression in a data warehouse• Use page compression on all dimension tables and fact table partitions.

• If performance is CPU-bound, revert to row compression on frequently-

accessed partitions.

Data Compression

• Improved query performance

• More granular manageability

• Improved data load performance

• Best practices for partitioning in a DW• Partition Large Fact Tables

• Partition on an incrementing date key

• Design the partition scheme for ETL and manageability.

• Maintain an empty partition at the start and end of the table

Table Partitioning

• Indexes maximize query performance

• Planning Indexes is the most important part of database

design process

• Some inexperienced BI professionals are tempted to create

many indexes on all tables to support queries.

Indexes in DW

• Create a clustered index on the surrogate key column.

• This column is used to join the dimension table to fact tables, and a clustered

index will help the query optimizer minimize the number of reads required to filter

fact rows.

• Create a non-clustered index on the alternate key column and

include the SCD current flag, start date, and end date columns.

• This index will improve the performance of lookup operations during ETL data

loads that need to handle slowly-changing dimensions.

• Create non-clustered indexes on frequently searched attributes, and

consider including all members of a hierarchy in a single index.

Dimension table indexes

• Create a clustered index on the most commonly-searched

date key. • Date ranges are the most common filtering criteria in most data warehouse

workloads, so a clustered index on this key should be particularly effective in

improving overall query performance.

• Create non-clustered indexes on other, frequently-searched

dimension keys.

• Columnstore index on all columns

Fact table indexes

• Create a view for each dimension and fact table with

NOLOCK query hint in the view definition

• Create views with user-friendly view and column names

• Do not include metadata columns in views

• Create views to combine snowflake dimension tables

• Partition-align indexed views

• Use the SCHEMABINDING option

• Security

Using Views in a DW

• Overview of Data Warehousing

• Data Warehouse Solution

• Data Warehouse Infrastructure

• Data Warehouse Hardware

• Data Warehouse Design Overview

• Designing Dimension Tables

• Designing Fact Tables

• Data Warehouse Physical Design

Summary

Thank you

SELECT

KNOWLEDGE

FROM

SQL SERVER

http://www.sqlschool.gr

Copyright © 2015 SQL School Greece

Technology

Building Data Warehouse in SQL Server