14
Dimensional Modeling Concept Dimensional Model is a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access. It is inherently dimensional, and it adheres to a discipline that uses the relational model with some important restrictions. Every dimensional model is composed of one table with a multi-part key, called the fact table, and a set of smaller tables called dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multi-part key in the fact table. (See Figure) This characteristic 'star-like' structure is often called a star join. A fact table, because it has a multi-part primary key made up of two OR more foreign keys, always expresses a many-to-many relationship. The most useful fact tables also contain one OR more numerical measures, OR 'facts,' that occur for the combination of keys that define each record. In Figure, the facts are Units_Sold, Dollars_Sold, and Avg_sales. The most useful facts in a fact table are numeric and additive. Additivity is crucial because data warehouse applications almost never retrieve a single fact table record; rather, they fetch back hundreds, thousands, OR even millions of these records at a time, and the only useful thing to do with so many records is to add them up. Dimension tables, by contrast, most often contain descriptive textual information, and the attributes (also called classification attributes), which are used for analysis. Dimension attributes are used as the source of most of the interesting constraints in data warehouse queries, and they are virtually always the source of the row headers in the SQL answer set. Fact Table and Dimension Tables in a Dimensional Model Schema

Data Warehousing Notes

Embed Size (px)

DESCRIPTION

Uploaded from Google Docs

Citation preview

Page 1: Data Warehousing Notes

Dimensional Modeling Concept

Dimensional Model is a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access. It is inherently dimensional, and it adheres to a discipline that uses the relational model with some important restrictions. Every dimensional model is composed of one table with a multi-part key, called the fact table, and a set of smaller tables called dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multi-part key in the fact table. (See Figure) This characteristic 'star-like' structure is often called a star join.

A fact table, because it has a multi-part primary key made up of two OR more foreign keys, always expresses a many-to-many relationship. The most useful fact tables also contain one OR more numerical measures, OR 'facts,' that occur for the combination of keys that define each record. In Figure, the facts are Units_Sold, Dollars_Sold, and Avg_sales. The most useful facts in a fact table are numeric and additive.

Additivity is crucial because data warehouse applications almost never retrieve a single fact table record; rather, they fetch back hundreds, thousands, OR even millions of these records at a time, and the only useful thing to do with so many records is to add them up.

Dimension tables, by contrast, most often contain descriptive textual information, and the attributes (also called classification attributes), which are used for analysis. Dimension attributes are used as the source of most of the interesting constraints in data warehouse queries, and they are virtually always the source of the row headers in the SQL answer set.

Fact Table and Dimension Tables in a Dimensional Model Schema

Lets consider a Data-Warehouse cube. This cube has 4 dimensions and three measures. This means that for every value of each of these 4 dimensions there will two values of coordinates. For example:

Co-ordinate [City(X), Product(Y), channel(Z),Month] = [ Sales (Quantity), Sales (Value)]OR [NY, Standard Desk-top, Mail, September 2005] = [2000 units, $15000]

In the dimensional modeling schema, the FACT table contains the value of coordinates against the lowest granularity of all the possible combinations of dimensions. The dimension tables contain the details of the

dimensions, which include the attributes of dimensions including all the higher-level hierarchies. The link between the fact table and all the associated dimension tables is through a dimension key, which is the lowest level granularity primary key of the dimension tables.

Page 2: Data Warehousing Notes

Fact Table- The central linkage in Dimensional Modeling

A fact table contains the value of all the measures linked to the set of dimensions linked to the FACT table. It contains the measure values for the combination of lowest level of granularity of dimensions. The measures are typically numeric, which can undergo mathematical aggregation and analysis.

Families of FACT Tables● Chains and Circles. ● Heterogeneous products. ● Transactions and snapshots. ● Aggregates

Dimension Table- What does and should it contain

The dimension table contains all the information on the dimension. This includes:

a. The primary key (Equivalent foreign key in the Fact Table). b. All attributes of the dimension. These include:

● The hierarchy attributes- Consider a business hierarchy -- pin-code to city to district to state

to country for location dimension . This means that each hierarchy element will be an attribute.

● Textual as well as the code attributes- Location code as well as the name of the location. This is required, because both could be used for different reasons by different users. A power user could be looking for location code (NY01), whereas an end user could be looking for more explicit header (New Jersey).

● Include all parallel hierarchies – A product could be having different hierarchies, depending upon if CFO OR Head of sales is looking at it. This enables the done on all hierarchies as well as cross-hierarchies.

● Production Primary Key Refer Surrogate primary key link to FACT table– These keys are used because the production keys could change OR could be reused. For example a bill number could be reused after 5 years, OR a part number (especially FMCG) could be reused after few years.

● Production OR source system key- This is required for audit ability OR link to the Extraction data and source systems.

Dimensional Model Schemas- Star, Snow-Flake and ConstellationDimensional model can be organized in star-schema or snow-flaked schema.

Dimensional Model Star Schema using Star Query

Page 3: Data Warehousing Notes

The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. The center of the star consists of a large fact table and the points of the star are the dimension tables.

A star schema is characterized by one OR more very large fact tables that contain the primary information in the data warehouse, and a number of much smaller dimension tables (OR lookup tables), each of which contains information about the entries for a particular attribute in the fact table.

A star query is a join between a fact table and a number of dimension tables . Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The cost-based optimizer recognizes star queries and generates efficient execution plans for them.

A typical fact table contains keys and measures. For example, in the sample schema, the fact table, sales, contain the measures quantity_sold,

amount, and average, and the keys time_key, item-key, branch_key, and location_key. The dimension tables are time, branch, item and

A star join is a primary key to foreign key join of the dimension tables to a fact table.

The main advantages of star schemas are that they:

● Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design. ● Provide highly optimized performance for typical star queries. ● Are widely supported by a large number of business intelligence tools, which may anticipate OR even require that the data-warehouse

schema contains dimension tables

Snow-Flake Schema in Dimensional Modeling

Page 4: Data Warehousing Notes

The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a snowflake schema because the diagram of the schema resembles a snowflake.

Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a location dimension table in a star schema might be normalized into a location table and city table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance. Figure above presents a graphical representation of a snowflake schema.

Fact Constellation Schema

Page 5: Data Warehousing Notes

This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension. The split of fact table is done only when we want to focus on aggregation over few facts & dimensions.

 

   Data Warehouse Dimensional Model Components Concept Dimensional Modeling vs . Relational  Dimensional Modeling vs. Relational ModelingDimensional modeling is different from the OLTP normalized modeling to enable analysis and querying through massive and unpredicted queries. Something which is a relational model is ill-equipped to handle.

How Dimensional model is different from an E-R diagram?● An E-R diagram (used in OLTP or transactional system) has highly normalized model (Even at a logical level), whereas dimensional model

aggregates most of the attributes and hierarchies of a dimension into a single entity. ● An E-R diagram is a complex maze of hundreds of entities linked with each other, whereas the Dimensional model has logical grouped set

of star-schemas. ● The E-R diagram is split as per the entities. A dimension model is split as per the dimensions and facts. ● In an E-R diagram all attributes for an entity including textual as well as numeric, belong to the entity table. Whereas a 'dimension' entity in

dimension model has mostly the textual attributes, and the 'fact' entity has mostly numeric attributes. Dimensional modeling is a better approach for Data warehouse compared to standard Data Model.The dimensional model has a number of important data warehouse advantages that the ER model lacks. First advantage of the dimensional model is that there are standard type of joins and framework. All dimensions can be thought of as symmetrically equal entry points into the fact table. The logical design can be done independent of expected query patterns. The user interfaces are symmetrical, the query strategies are symmetrical, and the SQL generated against the dimensional model is symmetrical. In other words,

● You will never find attributes in fact tables and facts in dimension tables. ● If you see a non-fact field in the fact table, you can assume that it is a key to a dimension table

Second advantage of the dimensional model is that it is smoothly extensible to accommodate unexpected new data elements and new design decisions. First, all existing tables (both fact and dimension) can be changed in place by simply adding new data rows in the table. Data should not have to be reloaded. Typically, No query tool OR reporting tool needs to be reprogrammed to accommodate the change. All old applications continue to run without yielding different results. You can, respectively, make the following graceful changes to the design after the data warehouse is up and

Page 6: Data Warehousing Notes

running by: ● Adding new unanticipated facts (that is, new additive numeric fields in the fact table), as long as they are consistent with the fundamental

grain of the existing fact table. ● Adding completely new dimensions, as long as there is a single value of that dimension defined for each existing fact record ● Adding new, unanticipated dimensional attributes. ● Breaking existing dimension records down to a lower level of granularity from a certain point in time forward.

Third advantage of the dimensional model is that there is a body of standard approaches for handling common modeling situations in the business world. Each of these situations has a well-understood set of alternatives that can be specifically programmed in report writers, query tools, and other user interfaces. These modeling situations include:

● Slowly changing dimensions, where a 'constant' dimension such as Product OR Customer actually evolves slowly and asynchronously. Dimensional modeling provides specific techniques for handling slowly changing dimensions , depending on the business environment.

● Heterogeneous products, where a business such as a bank needs to: ○ track a number of different lines of business together within a single common set of attributes and facts, but at the same time.. ○ it needs to describe and measure the individual lines of business in highly idiosyncratic ways using incompatible measures.

Data Warehousing - Dimensions & Measures and Related Concepts

Each data warehouse consists of dimensions and measures. Dimensions allow data analysis from various perspectives. For example, time dimension could show you the breakdown of sales by year, quarter, month, day and hour. Product dimension could help you see which products bring in the most revenue. Supplier dimension could help you choose those business partners who always deliver their goods on time. Customer dimension could help you pick the strategic set of consumers to whom you'd like to extend your very special offers.

Measures are numeric representations of a set of facts that have occurred. Examples of measures include dollars of sales, number of credit hours, store profit percentage, dollars of operating expenses, number of past-due accounts and so forth.

Additivity of Measures - Facts    Additivity and correct aggregation methods application is fundamental to the success of Business Intelligence. The most common mistakes the modelers and designers make is on - Setting the Right Hierarchies AND Establishing Right Additivity and aggregation rules. You need to go through the chapter of business dimensional hierarchies, before you go through this chapter. Additivity of a measure is when you are able to apply the sum operator across all the dimensions. Other aggregations on measures-facts are when you use operators like Average, Maximum and Minimum.

 

Non - Additive Measures - Facts    Non-Additivity is that when you cannot use a sum operator to generate the needed aggregation.

 

Semi - Additive Measures - Facts    Semi-Additivity is when you can have a measure aggregated on a certain dimension, but not all the dimensions. Another phrase for semi-additivity is when you have the summarization with an index of in-accuracy.

Page 7: Data Warehousing Notes

 

Additive measures are measures that can be added across all dimensions. For example dollars of sales can be added across all dimensions within a retail store warehouse.

Semi-additive measures are measures that can be added across some, but not all dimensions. For example the bank account balance is simply a snapshot in time and cannot be summed over time. However you could add multiple accounts of the same customer to get the total balance for that customer.Non-additive measures are measures that cannot be added across any dimensions. For example the inventory is simply a snapshot in time and cannot be summed over time. Nor can you combine inventory for various products.Hierarchy defines parent-child relationships among various levels within a single dimension. For instance in a time dimension, year level is parent of four quarters, each of which is a parent of three months, which are parents of 28 to 31 days, which are parents of 24 hours. Similarly in a geography dimension a continent is a parent of countries, country could be a parent of states, and state could be a parent of cities. Level is a column within a dimension table that could be used for aggregating data. For example, product dimension could have levels of product type (beverage), product category (alcoholic beverage), product class (beer), product name (miller lite, budlite, corona, etc).Member is a value within a dimension level that can be used for aggregating and reporting data. For example each product category such as beverage, non-consumable, food, clothing, etc is a member. Each product class such as beer, wine, coke, bottled water would represent a member.

Data Mart is a subset of the data warehouse typically serving a functional area such as marketing or finance, or particular location of the business (for instance mid-Western division).

Jump to: navigation, search

Data Warehousing - Fact and Dimension TablesData warehouses are built using dimensional data models which consist of fact and dimension tables. Dimension tables are used to describe dimensions; they contain dimension keys, values and attributes. For example, the time dimension would contain every hour, day, week, month, quarter and year that has occurred since you started your business operations. Product dimension could contain a name and description of products you sell, their unit price, color, weight and other attributes as applicable.

Dimension tables are typically small, ranging from a few to several thousand rows. Occasionally dimensions can grow fairly large, however. For example, a large credit card company could have a customer dimension with millions of rows. Dimension table structure is typically very lean, for example customer dimension could look like following:

Customer_keyCustomer_full_nameCustomer_cityCustomer_stateCustomer_country

Although there might be other attributes that you store in the relational database, data warehouses might not need all of those attributes. For example, customer telephone numbers, email addresses and other contact information would not be necessary for the warehouse. Keep in mind that data warehouses are used to make strategic decisions by analyzing trends. It is not meant to be a tool for daily business operations. On the other hand, you might have some reports that do include data elements that aren't necessary for data analysis.

Page 8: Data Warehousing Notes

Most data warehouses will have one or multiple time dimensions. Since the warehouse will be used for finding and examining trends, data analysts will need to know when each fact has occurred. The most common time dimension is calendar time. However, your business might also need a fiscal time dimension in case your fiscal year does not start on January 1st as the calendar year. Most data warehouses will also contain product or service dimensions since each business typically operates by offering either products or services to others. Geographically dispersed businesses are likely to have a location dimension.

Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. The fact table of an educational entity could track credit hours awarded to students. A bakery could have a fact table that records manufacturing of various baked goods.

Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table "grain". For instance, for a healthcare billing company it might be sufficient to track revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand, the assembly line warehouse analysts might be very concerned in number of defective goods that were manufactured each hour. Similarly a marketing data warehouse might be concerned by the activity of a consumer group with a specific income-level rather than purchases made by each individual.

Jump to: navigation, search

Data Warehousing - Star and Snowflake SchemasJump to: navigation, search

The foundation of each data warehouse is a relational database built using a dimensional model. A dimensional model consists of dimension and fact tables and is typically described as star or snowflake schema.

Star schema resembles a star; one or more fact tables are surrounded by the dimension tables. Dimension tables aren't normalized - that means even if you have repeating fields such as name or category no extra table is added to remove the redundancy. For example, in a car dealership scenario you might have a product dimension that might look like this:

Product_keyProduct_categoryProduct_subcategoryProduct_brandProduct_makeProduct_model

Page 9: Data Warehousing Notes

Product_year

In a relational system such design would be clearly unacceptable because product category (car, van, truck) can be repeated for multiple vehicles and so could product brand (Toyota, Ford, Nissan), product make (Camry, Corolla, Maxima) and model (LE, XLE, SE and so forth). So a vehicle table in a relational system is likely to have foreign keys relating to vehicle category, vehicle brand, vehicle make and vehicle model. However in the dimensional star schema model you simply list out the names of each vehicle attribute.

Star schema also contains the entire dimension hierarchy within a single table. Dimension hierarchy provides a way of aggregating data from the lowest to highest levels within a dimension. For example, Camry LE and Camry XLE sales roll up to Camry make, Toyota brand and cars category. Here is what a star schema diagram could look like:

File : ASDW 3 138. gif

Notice that each dimension table has a primary key. The fact table has foreign keys to each dimension table. Although data warehouse does not require creating primary and foreign keys, it is highly recommended to do so for two reasons:

1. Dimensional models that have primary and foreign keys provide superior performance, especially for processing Analysis Services cubes.  

2. Analysis Services requires creating either physical or logical relationships between fact and dimension tables. Physical relationships are implemented through primary and foreign keys. Therefore if the keys exist you save a step when building cubes.

Snowflake schema resembles a snowflake because dimension tables are further normalized or have parent tables. For example we could extend the product dimension in the dealership warehouse to have a product_category and product_subcategory tables. Product categories could include trucks, vans, sport utility vehicles, etc. Product subcategory tables could contain subcategories such as leisure vehicles, recreational vehicles, luxury vehicles, industrial trucks and so forth. Here is what the snowflake schema would look like with extended product dimension:

File : ASDW 3 139. gif

Snowflake schema generates more joins than a star schema during cube processing, which translates into longer queries. Therefore it is normally recommended to choose the star schema design over the snowflake schema for optimal performance. Snowflake schema does have an advantage of providing more flexibility, however. For example, if you were working for an auto parts store chain you might wish to report on car parts (car doors, hoods, engines) as well as subparts (door knobs, hood covers, timing belts and so forth). In such cases you could have both part and subpart dimensions, however some attributes of subparts might not apply to parts and vise versa. For example, you could examine the thread size attribute would apply to a tire but not for nuts and bolts that go on the tire. If you wish to aggregate your sales by part you will need to know which subparts should rollup to each part as in the following:

Dim_subpartsubpart_keysubpart_namesubpart_SKUsubpart_sizesubpart_weightsubpart_colorpart_key

Dim_partpart_keypart_name

Page 10: Data Warehousing Notes

part_SKU

With such a design you could create reports that show you a breakdown of your sales by each type of engine, as well as each part that makes up the engine.

Data Warehousing - Extraction, Transformation and Loading

Jump to: navigation, search

A data warehouse does not generate any data; instead it is populated from various transactional or operational data stores. The process of importing and manipulating transactional data into the warehouse is referred to as Extraction, Transformation and Loading (ETL). SQL Server supplies an excellent ETL tool known as Data Transformation Services (DTS) in version 2000 and SQL Server Integration Services (SSIS) in version 2005.

ETL resolves the inconsistencies in entity and attribute naming across multiple data sources. For example the same entity could be called customers, clients, prospects or consumers in various data stores. Furthermore attributes such as address might be stored as three or more different columns (address line1, address line2, city, state, county, postal code, country and so forth). Each column can also be abbreviated or spelled out completely, depending on data source. Similarly there might be differences in data types, such as storing data and time as a string, number or date. During the ETL process data is imported from various sources and is given a common shape.

In addition to the changes that you can manage in ETL relatively easily, there are some data inconsistencies that you might have to fix manually. For example, examine the following data values:

Dr. Jimmy SmithJames L. Smith, Jr.Jim L Smith, M.D.James Smith MDJim Smith, JR - M.D.

A human eye can easily suspect that all of these values could represent the same person. However unless you work with James Smith or his accounts you cannot be certain. Should you show each of these values as a separate person on your reports Writing a program that can fix such data inconsistencies could be a challenge, whereas a data entry clerk that created these values might be able to change them to a single, correct value with minimal effort.

Data inconsistencies are commonplace in operational data sources that allow free form data entry. A data warehouse cannot fix problems with poorly designed operational systems, but it is likely to make such issues known to data analysts and business managers. Even if you design smart ETL logic to correct the existing issues predicting all future variations of "Doctor Jim L Smith Junior" is a daunting task. Instead you should attempt to fix the data entry applications to limit the human error.

In addition to importing data from various sources, ETL is also responsible for transforming data into a dimensional model. Depending on your data sources the import process can be relatively simple or very

Page 11: Data Warehousing Notes

complicated. For example, some organizations keep all of their data in a single relational engine, such as SQL Server. Others could have numerous systems that might not be easily accessible. In some cases you might have to rely on scanned documents or scrape report screens to get the data for your warehouse. In such situations you should bring all data into a common staging area first and then transform it into a dimensional model.

The need for a staging database isn't limited to those warehouses that have inaccessible data sources. A staging area also provides a good place for assuring that your ETL is working correctly before data is loaded into dimension and fact tables. So your ETL could be made up of multiple stages:

1. Import data from various data sources into the staging area. 

2. Cleanse data from inconsistencies (could be either automated or manual effort). 

3. Ensure that row counts of imported data in the staging area match the counts in the original data source. 

4. Load data from the staging area into the dimensional model. Retrieved from "http :// sqlserverpedia . com / wiki / Data _ Warehousing _- _ Extraction ,_ Transformation _ and _ Loading "

Definitions: Fact table, Dimension table

Fact tableA fact table consists of the measurements, metrics or facts of a business process. It is often located at the centre of a star schema, surrounded by dimension tables.Fact tables provide the (usually) additive values that act as independent variables by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined.

● Additive - Measures that can be added across all dimensions. ● Non Additive - Measures that cannot be added across all dimensions. ● Semi Additive - Measures that can be added across few dimensions and not with

others. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables).

In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called "Factless Fact tables".Dimension tableA dimension table is one of the set of companion tables to a fact table. The fact table contains business facts or measures and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables. The dimension tables contain attributes (or fields) used to constrain and group data when

Page 12: Data Warehousing Notes

performing data warehousing queries.

Over time, the attributes of a given row in a dimension table may change. For example, the shipping address for a company may change. Kimball refers to this phenomenon as Slowly Changing Dimensions. Strategies for dealing with this kind of change are divided into three categories:

● Type One - Simply overwrite the old value(s).

● Type Two - Add a new row containing the new value(s), and distinguish between

the rows using Tuple-versioning techniques.

● Type Three - Add a new attribute to the existing row.