106
DWBI Essential Guide Data Warehouse/Mart Chapter 1 Data Warehouse definition- What is Data Warehouse? Data Warehouse is repository of Data picked from Transaction systems, and filtered and transformed to make it available for data analysis reporting. Data-Warehouse is a repository of Data, which can provide most OR all of the Data and information requirements of an enterprise. It means that it pulls data from all the production and other sources. Once the data is pulled onto an offline staging area, it is cleansed, transformed and loaded into a sanitized, uniform and well-organized manner so that you can run queries, reports and all kind of analysis on the data. What is not Data Warehouse? Data Warehouse typically does not provide the online information, as mostly the data Extraction and consolidation happens in the end of day batch-processing. For online, transaction based queries, the OLTP (Online Transaction Processing systems) is used. Data Warehouse is not business intelligence OR OLAP. It is a repository of sanitized and consolidated data, which can be used for any purpose including Business Intelligence. The usage can be for transaction reports, data mining, Data Analysis, Statistical forecasting, valuation systems OR so on. Technically speaking, we can have a good data warehouse without good business intelligence (which is a combination of Data Analysis +Data Mining+ Performance Management Reporting). 1

DWBI Concepts All

Embed Size (px)

Citation preview

DWBI Essential Guide

Data Warehouse/MartChapter 1 Data Warehouse definition- What is Data Warehouse?Data Warehouse is repository of Data picked from Transaction systems, and filtered and transformed to make it available for data analysis reporting. Data-Warehouse is a repository of Data, which can provide most OR all of the Data and information requirements of an enterprise. It means that it pulls data from all the production and other sources. Once the data is pulled onto an offline staging area, it is cleansed, transformed and loaded into a sanitized, uniform and well-organized manner so that you can run queries, reports and all kind of analysis on the data.

What is not Data Warehouse?Data Warehouse typically does not provide the online information, as mostly the data Extraction and consolidation happens in the end of day batchprocessing. For online, transaction based queries, the OLTP (Online Transaction Processing systems) is used. Data Warehouse is not business intelligence OR OLAP. It is a repository of sanitized and consolidated data, which can be used for any purpose including Business Intelligence. The usage can be for transaction reports, data mining, Data Analysis, Statistical forecasting, valuation systems OR so on. Technically speaking, we can have a good data warehouse without good business intelligence (which is a combination of Data Analysis +Data Mining+ Performance Management Reporting).

Data Warehouse vs. Operational Data Store (ODS) - They are differentData Warehouse is an 'offline' integration of data, whereas Operational Data Store is an 'online' integration of data. ODS is used, when the data at a transaction (processing as well as querying) level is dispersed across various systems, and one needs to bring it together on online basis. For example- Let us say that you want to have a single view of customer to 1

DWBI Essential Guide be used by customer service, whereby they can also update the data in that single view online basis. However, the data on the customer (OPD Records, Hospitalization records, diagnostic records, pharmaceutical purchase records ...) is lying in different databases. ODS could be a good choice. The above-said concept of ODS is an ideal one. Another option for an Operational Data Store is to be used for online queries, but the information, which it provides is not real-time, but pertaining to last End of Day. For example you have a single customer view, but it does not include the transactions which the customer has done today. For this kind of need, sometimes even Data Warehouse repository can also used for this purpose.

Data Warehouse vs. Data Mart in Business IntelligenceData Warehouse is at an enterprise level repository, which is having a combination of various data marts. Data Warehouse carries all the dimensions and measures required and it ensures the integrity of same dimensions and measures across different data marts. Data mart is a limited set of dimensions and measures used for specific business theme. They are populated out of the Data-Warehouse Data Sets. Typically an organization's business intelligence agenda starts with few data marts, before maturing to a full-blown data warehouse. However, most of the design & development concept apply equally to a Data-mart. Please refer De-Normalized Data Warehouse/Data Mart comparison between a Data Warehouse and Data Marts. for detailed

2

DWBI Essential Guide

Chapter 2 Data Warehouse Components and FrameworkData Warehouse framework starts from extracting data from source systems, transforming and cleansing it, before loading into the repository. It ends with the data being accesses, analyzed, mined and dash boarded using end user tools.

This page presents a high level listing of the components linked to Data Warehouse. PLEASE REFER TO Business Intelligence Architecture for a complete big picture on the relevance and positioning of Data Warehouse.

3

DWBI Essential Guide

Source systems and DatabasesSource Systems are all those 'transaction/Production' raw data providers, from where the details are pulled out for making it suitable for Data Warehousing. The sources can be quite diverse: Production Databases like Oracle, Sybase and SQL. Excel Sheets. Database of small time applications like in MS Access. ASCII/Data flat files.

Data Staging 'Area'The data staging area is the place where all 'grooming' is done on data after it is pulled from the Source Systems. The end point of grooming is for the Data to be loaded into the 'Analysis OR Presentation Server'. Data staging covers most of the 'back-bone' activities of a Data-Warehouse, which typically are also the biggest analytical and technical challenge of a project. These activities are 'Extraction' and 'Transformation'

ETL-Data ExtractionData Extraction is an activity, which pulls the data from various data sources. Most of these sources are production systems OR are used for transaction level work.

ETL-Data TransformationIf Data Extraction is mining the iron ore, Transformation is to create the steel billets. The Transformation makes sure that the transaction level raw data is transformed into a form (while still being detailed) so that it can be loaded into the 'presentation/Loaded' area.

ETL-Presentation/Loaded 'Area'This is the repository where the data is finally loaded after going through all the works of Extraction and Transformation. This becomes the ultimate source for information for various reasons ranging from queries to advanced data modeling.

Dimensional ModelThe presentation area has data model, which is different from that of production system. This is called Dimensional Model. It is the way data is 4

DWBI Essential Guide organized in data-warehouse. This concept has been dealt with fair degree of detail as this is the engine of Data Warehouse.

Meta DataMeta Data subject is covered in a separate section. It contains all the business and technical designs, rules and locations etc. of all the data starting from the Extraction to final data usage.

End User Tools and ApplicationsData is cooked for consumption. There is a long list of applications to which the data can be put to and the tools, which can make it happen. This includes the reporting, publishing, analysis, modeling and mining tools.

Data-Warehouse Administration and ToolsData warehouse is a large platform, which has large number of users, data sources and data targets. Just like production systems, it has to be administered in terms of performance, timelines and availability. This also includes activity logging, data security, backing-up and archiving.

Data- MartsThe entire section of Data Warehouse is equally applicable to a Data-Mart. A Data-Mart is a Data repository with a more restricted and short-term perspective. Please refer to De-Normalized Data Warehouse/Data Mart for similarities and differences between a Data Warehouse and a Data Mart.

OLAP Servers & Data MartsWhile Data Warehouse can be accessed for any end-user tools application, it also feeds to the downstream OLAP Layer. For example, HR wants to have its own data mart in their separate servers due to confidential reasons. Similarly people who are traveling may need to have their own offline data Mart.

5

DWBI Essential Guide

Chapter 3 Data Warehouse Challenges and IssuesData Warehouse initiative is more challenging as compared to a transactional system. You better read it to understand what you are in for. In spite of their proven ROIs for well implemented projects, the proportion of Data Warehouse project failures is fairly high. Failures can take various forms in terms of

Functional The project is not able to deliver the functionality and analysis capabilities. Technical The technology platform and services dont work. Publishing- The availability of data is not as per expectations. Usage- Even if the project is well delivered, the capabilities and information is not used. Achievement of business goals Even if the information is used, it is not able to drive the expected business goals.

The driver behind these failures is that hype & glamour of Data Warehouse has overtaken the diligence (especially , when Data Warehouse and Data Management initiatives and knowledge base still to become a mass awareness and expertise). The diligence required is because Data Warehouse projects are quite different from business systems projects. . While typical business systems projects also may suffer from similar challenges, the intensity of them is much higher in Data Warehouse projects. Here are Data Warehouse challenges, which are unique

Data Warehouse vs. OLTP Transaction SystemsProject DomainBusiness Benefits

Transaction/Production Business System

Data Warehouse System

Tangible benefits in terms ofThere is a lesser proportion of functional capabilities,initiatives where there is 'heaven business processes that will bewill fall', if the project is not done. automated, number of headcounts and reduction etc.The benefits can be appreciated by

6

DWBI Essential GuideTypically, the process getting automated is being donefewer people and much fewer at a manually, and there is enoughground level. visible pain at the ground level and customers. A business system onceData Warehouse platform has a implemented drives the usagelesser compulsion for usage. Unless as it typically automates athere are critical operational reports business process. required. One can specify the measureWhile number of users and the of usage for a business systemnumber of queries does represent in terms of processed unit,the level of usage, but it no means number of users. suggest that the usage is resulting in delivery of final outcome. Business system requires theMore knowledge is required expertise on business processhorizontally and vertically. One knowledge needs a much higher domain experience as well as crossfunctional knowledge for an effective business role fulfillment in Data Warehouse project. The domain expertise also includes all three levels (strategic, managerial and operational). Ability of defining the businessWhat analysis one needs, why and requirements, prioritization iswhat one will do post its availability easier as a business systemare questions, which automates an existing processdemand/challenge the management and/or severely neededand strategic thought process. business functionality. Unlike a business process, analysis for any problem can be done in hundred different ways. Therefore, business requirements tend to change throughout a Data Warehouse project. Business users are moreIts easier to provide and confirm available and engaged. the requirements of a business process automation, and difficult to define the information and analysis needs. Business users are too busy doing day-to-day work to dwell upon these questions. The queries and data access isData-Warehouse cannot predict the predictable as they are drivenkind and incidences of queries on by the mapping of type ofthe system. A query can access all transaction, instances etc. Athe tables and records. typical transaction touches only certain tables and certain records. Mostly the large and

Usage

Measure Usage

of

Skills and Expertise RequirementsBusiness

Business Requirements

Business users availability and engagement

The demands on the Database

7

DWBI Essential Guideall-encompassing processing happens at end of the day processing. Variety of A business system has a pre-A data-Warehouse could be having front-end defined back-end and front endnew front-end applications being applications applications accessing theadded on the ongoing basis. This back-end Database includes OLAP tools, Data mining applications, business performance management applications, online user query and reporting applications. Expectations A typical business system hasA Data-Warehouse is expected to of flexibility to an ever increasing list ofprovide granular enhancements for enhancements enhancements, However, it ismost cases. It has to have its design expected that theflexible enough to be able to enhancements will take timeincorporate new dimensions, and system will go throughmeasures and system sources well-spaced out releases. without unsettling the foundations.

Chapter 4 Data Warehouse Purpose and Objective- Why is Data Warehouse Needed?Data Warehouse is designed to meet some business needs, which is transaction system cannot do (and vice versa). The big question is on why we have to create a Data-Warehouse? The reasons are as follows. As a starter, lets say you want to know the monthly variations in 3 months running average on your customer balances over last twelve months grouped by products+ channels+ customer segments. Lets see why you need a data-warehouse for this purpose.

Keeping Analysis/Reporting and Production SeparateIf you run the above-said query on your production systems, you will find that it will lock all your tables and will eat-up most of your resources, as it will be accessing a lot of data doing a lot of calculations. This results in the production work to come to a virtual halt. Imagine hundreds of such abovesaid queries running at the same time on your production systems. Reporting and analysis work typically access data across the database tables, whereas production work typically accesses specific customer OR 8

DWBI Essential Guide product OR channel record at a point of time. Thats why it is important to have the Information generation work to be done from an offline platform (aka. Data Warehouse). Purpose of Data Warehouse is to keep analysis/reporting (non-production use data) separate from production data.

Information Integration from multiple systems- Single point source for informationAs an example- Lets say you have different systems for say a loan product vs. credit card product. The above-said query, if run on production will need to pick the data on real time basis from these systems. This will make the query extremely slow, and will need to do connect in the intermediate tables OR in run-time memory. Moreover it will not be a reliable result as at a particular point of time, the databases may not be in synch as many of such synching happens in the end of day batch runs.

DW purpose for Data Consistency and QualityOrganizations are riddled with tens of important systems from which their information comes. Each of these systems may carry the information in different formats and also may be having out of synch information. (Different customer ID formats, mismatch in the supplier statuses). By bringing the data from these disparate sources at a common place, one can effectively undertake to bring the uniformity and consistency in data (Refer to cleansing and Data Transformation).

High Response Time- Production Databases are tuned to expected transaction loadEven if you run the above-said query on an offline database, it will take a lot of time on the database design, which is same as that of production. This is because the production databases are created to cater to production work. In production systems, there is some level of expected intensity for different kind of actions. Therefore, the indexing and normalization and other design considerations are for given transaction loads. However, the Data-warehouse has to be ready for fairly unexpected loads and type of queries, which demands a high degree of flexibility and quick response time.

9

DWBI Essential Guide

High Response Modeling

time-

Normalized

Data

vs.

Dimensional

Production/Source system database are typically normalized to enable integrity and non-redundancy of data. This kind of design is fine for transactions, which involved few records at a time. However, for large analysis and mining queries, the response time in normalized databases will be slow given the joins that have to be created.

Data Warehouse objective of providing an adaptive and flexible source of informationIts easier for users to define the production work and functionalities they want, but difficult to define the analysis they need. The analysis needs keep on changing and Data-Warehouse has the capabilities to adapt quickly to the changing requirements. Please refer to 'Dimension Modeling'

Establish the foundation for Decision SupportDecision making process of an organization will involve analysis, data mining, forecasting, decision modeling etc. By having a common point, which can provide consistent, quality data with high response time provides the core enabler for making fast and informed decisions.

Chapter 5 Data Warehouse Dimensional Components Concept Model

Dimensional model is equivalent of logical data design of Data Warehouse, and much more. It is more simplistic in design and suits the purpose of a data warehouse. Dimensional Model is a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access. It is inherently dimensional, and it adheres to a discipline that uses the relational model with some important restrictions. Every dimensional model is composed of one table with a multi-part key, called the fact table, and a set of smaller tables called dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the components 10

DWBI Essential Guide of the multi-part key in the fact table. (See Figure) This characteristic 'starlike' structure is often called a star join. A fact table, because it has a multi-part primary key made up of two OR more foreign keys, always expresses a many-to-many relationship. The most useful fact tables also contain one OR more numerical measures, OR 'facts,' that occur for the combination of keys that define each record. In Figure, the facts are Units_Sold, Dollars_Sold, and Avg_sales. The most useful facts in a fact table are numeric and additive. Additivity is crucial because data warehouse applications almost never retrieve a single fact table record; rather, they fetch back hundreds, thousands, OR even millions of these records at a time, and the only useful thing to do with so many records is to add them up. Dimension tables, by contrast, most often contain descriptive textual information, and the attributes (also called classification attributes), which are used for analysis. Dimension attributes are used as the source of most of the interesting constraints in data warehouse queries, and they are virtually always the source of the row headers in the SQL answer set.

Fact Table and Dimension Tables in a Dimensional Model SchemaLets consider a Data-Warehouse cube. This cube has 4 dimensions and three measures. This means that for every value of each of these 4 dimensions there will two values of coordinates. For example: Co-ordinate [City(X), Product(Y), channel (Z), Month] = [Sales (Quantity), Sales (Value)] OR [NY, Standard Desk-top, Mail, September 2005] = [2000 units, $15000] In the dimensional modeling schema, the FACT table contains the value of coordinates against the lowest granularity of all the possible combinations of dimensions. The dimension tables contain the details of the dimensions, which include the attributes of dimensions including all the higher-level hierarchies. The link between the fact table and all the associated dimension tables is through a dimension key, which is the lowest level granularity primary key of the dimension tables.

11

DWBI Essential Guide

Fact Table- The central linkage in Dimensional ModelingA fact table contains the value of all the measures linked to the set of dimensions linked to the FACT table. It contains the measure values for the combination of lowest level of granularity of dimensions. The measures are typically numeric, which can undergo mathematical aggregation and analysis.Families of FACT Tables

Chains and Circles. Heterogeneous products. Transactions and snapshots. Aggregates

Dimension Table- What does and should it containThe dimension table contains all the information on the dimension. This includes:a. The primary key (Equivalent foreign key in the Fact Table). b. All attributes of the dimension. These include:

The hierarchy attributes- Consider a business hierarchy-- pin-code to city to district to state to country for location dimension. This means that each hierarchy element will be an attribute. Textual as well as the code attributes- Location code as well as the name of the location. This is required, because both could be used for

12

DWBI Essential Guidedifferent reasons by different users. A power user could be looking for location code (NY01), whereas an end user could be looking for more explicit header (New Jersey). Include all parallel hierarchies A product could be having different hierarchies, depending upon if CFO OR Head of sales is looking at it. This enables the done on all hierarchies as well as cross-hierarchies. Production Primary Key Refer Surrogate primary key link to FACT table These keys are used because the production keys could change OR could be reused. For example a bill number could be reused after 5 years, OR a part number (especially FMCG) could be reused after few years. Production OR source system key- This is required for audit ability OR link to the Extraction data and source systems.

Chapter 613

DWBI Essential Guide

Dimensional Model SchemasFlake and Constellation

Star,

Snow-

Dimensional model can be organized in star-schema or snow-flaked schema.

Dimensional Model Star Schema using Star Query

The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. The center of the star consists of a large fact table and the points of the star are the dimension tables. A star schema is characterized by one OR more very large fact tables that contain the primary information in the data warehouse, and a number of much smaller dimension tables (OR lookup tables), each of which contains information about the entries for a particular attribute in the fact table. A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The cost-based optimizer recognizes star queries and generates efficient execution plans for them. 14

DWBI Essential Guide A typical fact table contains keys and measures. For example, in the sample schema, the fact table, sales, contain the measures quantity_sold, amount, and average, and the keys time_key, item-key, branch_key, and location_key. The dimension tables are time, branch, item and location. A star join is a primary key to foreign key join of the dimension tables to a fact table. The main advantages of star schemas are that they

Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design. Provide highly optimized performance for typical star queries. Are widely supported by a large number of business intelligence tools, which may anticipate OR even require that the data-warehouse schema contains dimension tables

Snow-Flake Schema in Dimensional Modeling

The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a snowflake schema because the diagram of the schema resembles a snowflake. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a location dimension table in a star schema might be normalized into a location table and city table in a snowflake schema. While this saves space, it increases the number of dimension tables and 15

DWBI Essential Guide requires more foreign key joins. The result is more complex queries and reduced query performance. Figure above presents a graphical representation of a snowflake schema.

Fact Constellation Schema

This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension. The split of fact table is done only when we want to focus on aggregation over few facts & dimensions.

16

DWBI Essential Guide

Chapter 7 Dimensional Modeling Modeling vs. Relational

Dimensional modeling is different from the OLTP normalized modeling to enable analysis and querying through massive and unpredicted queries. Something which is a relational model is illequipped to handle.

How Dimensional model is different from an E-R diagram?

An E-R diagram (used in OLTP or transactional system) has highly normalized model (Even at a logical level), whereas dimensional model aggregates most of the attributes and hierarchies of a dimension into a single entity. An E-R diagram is a complex maze of hundreds of entities linked with each other, whereas the Dimensional model has logical grouped set of starschemas. The E-R diagram is split as per the entities. A dimension model is split as per the dimensions and facts. In an E-R diagram all attributes for an entity including textual as well as numeric, belong to the entity table. Whereas a 'dimension' entity in dimension model has mostly the textual attributes, and the 'fact' entity has mostly numeric attributes.

Dimensional modeling is a better approach for Data warehouse compared to standard Data Model.The dimensional model has a number of important data warehouse advantages that the ER model lacks. First advantage of the dimensional model is that there are standard type of joins and framework. All dimensions can be thought of as symmetrically equal entry points into the fact table. The logical design can be done independent of expected query patterns. The user interfaces are symmetrical, the query strategies are symmetrical, and the SQL generated against the dimensional model is symmetrical. In other words,

You will never find attributes in fact tables and facts in dimension tables. If you see a non-fact field in the fact table, you can assume that it is a key to a dimension table

Second advantage of the dimensional model is that it is smoothly extensible to accommodate unexpected new data elements and new design decisions. First, all existing tables (both fact and dimension) can be changed 17

DWBI Essential Guide in place by simply adding new data rows in the table. Data should not have to be reloaded. Typically, No query tool OR reporting tool needs to be reprogrammed to accommodate the change. All old applications continue to run without yielding different results. You can, respectively, make the following graceful changes to the design after the data warehouse is up and running by:

Adding new unanticipated facts (that is, new additive numeric fields in the fact table), as long as they are consistent with the fundamental grain of the existing fact table. Adding completely new dimensions, as long as there is a single value of that dimension defined for each existing fact record Adding new, unanticipated dimensional attributes. Breaking existing dimension records down to a lower level of granularity from a certain point in time forward.

Third advantage of the dimensional model is that there is a body of standard approaches for handling common modeling situations in the business world. Each of these situations has a well-understood set of alternatives that can be specifically programmed in report writers, query tools, and other user interfaces. These modeling situations include:

Slowly changing dimensions, where a 'constant' dimension such as Product OR Customer actually evolves slowly and asynchronously. Dimensional modeling provides specific techniques for handling slowly changing dimensions, depending on the business environment. Heterogeneous products, where a business such as a bank needs to: o Track a number of different lines of business together within a single common set of attributes and facts, but at the same time.. o It needs to describe and measure the individual lines of business in highly idiosyncratic ways using incompatible measures.

18

DWBI Essential Guide

Chapter 8 Foundation & Conformed Dimensions and Facts in Data Warehouse Dimensional ModelData Warehouse is a repository which feeds data marts, and other downstream systems. It has to be designed to have global or re-usable set of dimensions and measures.

Data Warehouse modeling has two components:

Foundation to support medium to long-term capabilities, without the need to unsettle the structure time and again. The individual phases for developments of Data Marts eventually merge into the enterprise wide Data Warehouse.

A project has to address both the foundation and phase elements. Every stage in the Data Warehouse project will address these two elements in distinct and overt manner. For dimensional modeling, the following foundation setting elements will work like reusable components. They will be same across the Data-Marts/Data Warehouse for current and the future phases of developments:

Standard set of foundation or conformed dimensions. This means that:

Dimensions are super-sets of all possible attributes for that dimension. For example, customer 'age' attribute may not be required for sales analysis, but required for Credit Analysis. Therefore, when creating the standard dimensions, one make the superset of attributes. Dimensions include all possible levels of business hierarchy. For example- A portfolio analysis of a channel may not require the branch level location, but the agent productivity analysis could. Dimensions to include not only categories, but descriptive textual attributes as well wherever needed. For example- A textual detail for a location code could be needed for distribution analysis, but many not be needed for portfolio analysis. Make the dimension most granular- Many a times the analysis does not need to go down to the most granular level of customer ID. In case, customer

19

DWBI Essential Guidemoves from his existing customer segment, the whole dimensional modeling could lead to issues, if the dimension is starting from customer group upwards

Examples of foundation dimensions are- Customer, Location, Channel, Sales Lead etc.

Standard set of foundation or conformed facts. This means that:

A fact table will include all possible units of measures for given set of dimensions. For example sales by numbers could need only the number of 'Crates' in one data mart and 'Pieces' in the other. However, both units for the given measure should be included even if there is a standard conversion rate. These standards conversion rates keep on changing with time. A Fact table logically groups a business instance. For example you could require distribution of a 'product' to retail outlet for distribution analysis. However, you will require the fact on final sale to the end customer for sales analysis. As a guideline, a highly linked business process should get combined in a single fact.

Standard set of foundation measures. This means that

All the measures and their possible units to be listed out. Measures are most susceptible to having confusing definitions OR to be misnamed. Detailed formulas behind measures are must. Refer Sales Revenue Fact-Measure as an example.

Examples of foundation measures are- Sales Measures, Customer Measures, etc.

20

DWBI Essential Guide

Chapter 9 Slowly Changing Dimensions Dimensional Modeling SCD in

Dimensional model has to address some complex situations liked slowly changing dimensions.

Slowly Changing DimensionsEntities change over time. Customer demographics, product characteristics, classification rules, status of customers etc. lead to changes in the attributes of dimensions. In a transaction system, many a times the change is overwritten and track of change is lost. For example a source system may have only the latest customer PIN Code, as it is needed to send the marketing and billing statements. However, a data warehouse needs to maintain all the previous PIN Codes as well, because we need to track on how many customers move to new locations over what frequency. A key benefit for Data Warehouse is to provide historical information, which is typically over-written (and thus lost) in the transaction systems. How to handle slowly changing dimensions in a Dimensional Model is a key determinant to that benefit.

21

DWBI Essential Guide

There are three ways to handle the same:Slowly Changing Dimension method 1 (In short SCD 1)

The way most of the source systems will handle it- Overwrite the attribute value. For example if a customers marital status has moved from 'Unmarried' to 'Married', we over-write 'unmarried' to 'Married'. Similarly, if an insurance policy status has moved from 'Lapsed' to 'Re-instated' the new status is over written on the old status. This is obviously done, when we are not analyzing the historical information.Slowly Changing Dimension Method 2 (in short SCD 2)

This is the true-blue technique to deliver precise historical analysis. This is used, when there is more than one change in the attributes of an entity, and we need to track the date of change of the attribute. In this method, a new record is added whereby the new record is given a separate identifier as the primary key. We cannot use the production key as the primary key here as it has not changed (Customer ID has remained the same, while the value of its attribute 'marital status' has changed). This new identifier is called the surrogate key. Apart from adding a new record and providing a new primary (surrogate) key, the validity period for this new record is also added. For example- You have a dimensional table with customer_ID '110002' with marital status as 'single'. Overtime, customer gets married and also moved to a new location. The customer dimension record will be:Surrogate Key 1100021 1100022 1100023 Customer Marital Date Valid ID Status 110002 110002 110002 Sept 2004 23, Single Date Birth of City

Jan8, 1982 Palo Alto Jan8, 1982 Palo Alto Jan8, 1982 San Francisco

Oct 25, 2005 Married Nov 2005 23, Married

Slowly changing dimension method 3 (SCD 3)

This is a mid-way between method 1 and method 2. Here we dont add an additional record, but add a new field 'old attribute value'. However, this has 22

DWBI Essential Guide limitations. This method has to know from the beginning on what attributes will change. This is because a new field/attribute has to be added in the design for every attribute, which can change. Secondly, attribute can change maximum once in the lifetime of the entity OR at least the lifetime of the data warehouse.Surrogate Customer Marital Date Key ID Status Birth 1100021 110002 Married Jan8, 1982 of City Marital City Status Old Old Single Palo Alto

San Francisco

NOTE The term of 'Slowly changing dimension' is used because of it being a universally acknowledged term. However, the same methods will apply to fast changing dimensions as well.

Data Analysis/OLAPData Analysis/OLAP is most fundamental way to make sense out of your data. It involves looking at the data from all possible angles, slicing & dicing on various dimensions, drilling up/down, applying filters, exception highlighting, graphs and other presentation tools, doing time trending analysis. Whether you are doing a pivot on excel or creating advanced views in an upmarket OLAP tool, most of the usage of data in todays world falls within the realm of Data Analysis/OLAP It is essentially a . post graduate course before you go for fellowship in Data Mining.

Chapter 1 Online Analytic Processing (OLAP)-Overview

23

DWBI Essential Guide OLAP in Business Intelligence- What is OLAP? This topic provides a high level concept of OLAP and also on how it fits into BI frame-work. It provides different OLAP options and also shared that OLAP is a layer between Data Warehouse and BI end-user tools. Online Analytic Processing is the capability to store and manage the data in a way, so that it can be effectively used to generate actionable information. You are aware from the Business Intelligence Architecture; OLAP sits between the Data Warehouse and End-user tools. OLAP is explained in detail in OLAP vs. Data Warehouse

OLAP makes Business Intelligence happen, broadly by enabling the following:

Transforming the data into multi-dimensional cubes Summarized pre-aggregated and derived data Strong query management Multitude of calculation and modeling functions

A data-warehouse could be having data in various formats like dimensional (with a high degree of de-normalization) OR highly relational (like 3rd normal form). As a separate note- We have covered the entire data-warehouse chapter on the basis of dimensional modeling based storage. Most of the concepts in the data-warehouse chapter remain the same irrespective of the kind of storage and data-modeling one needs to do. The detail differential between OLAP vs. Data Warehouse is given in OLAP Layer OLAP provides the building blocks to enable analysis (like rich functions, multi-dimensional models, analysis types..). Mostly the end-user tools (like business modeling tools, Data mining tools, performance reporting tools..), which sit on top of the OLAP to provide rich user Business Intelligence interface. OLAP and Data warehouse work in conjunction to provide overall data-access for the end-user tools. You may like to refer to BI Architecture Scenarios to get a better back-ground. There are different ways to store the data in OLAP+ Data-warehouse combination. While you can refer to OLAP Architectures in BI Architecture, here is the brief:

MOLAP: OLAP storing the data in the multi-dimensional mode. To put it in a simplistic manner, there is one array for one combination of dimensions and associated measures. In this storage method there is no connect between the MOLAP database and data-warehouse database for query purpose. It means that a user cannot drill down from MOLAP summary data to the transaction level data of data-warehouse.

24

DWBI Essential Guide

ROLAP: OLAP storing the data in relational form in dimensional model. This is a de-normalized form in relational table structure. ROLAP database of OLAP server can be linked to the Data-warehouse database. HOLAP: The aggregate data is stored in the multi-dimensional model in the OLAP database and the transactional level data is stored in the relational form in the data-warehouse database. There is a linkage between the summary MOLAP database of OLAP and relational transactional database of Data-warehouse. This gives you the best of both the worlds.

Chapter 2 Basic Data Analysis Types- Building BlocksDrill (horizontal) and Cross (horizontal) Navigation and Analysis

these are the methods of moving horizontally and vertically within the dimensional structure of Data-warehouse and OLAP. This term is more used in context with OLAP, because typically various End-user Business Intelligence tools sit on top of OLAP, which in turn sits over Data-Warehouse.

25

DWBI Essential Guide

Drill (horizontal) and Cross (horizontal) Navigation and AnalysisThese are the methods of moving horizontally within the dimensional structure of Data-warehouse and OLAP. This term is more used in context with OLAP, because typically various End-user Business Intelligence tools sit on top of OLAP, which in turn sits over Data-Warehouse.Drill-down Navigation

It is a method of exploring for more detailed data. It is done by revealing lower-level data than was previously displayed. For instance, you can drill down from State to City to offices. Available levels depend on the granularity of the data in OLAP and data warehouse.Drill-link

A URL hyperlink to a destination, defining the parameters, such as the document name and prompt answers, for the drill. When the document is viewed in Web, a user can click the link to navigate to the link's destination.Roll-up Navigation

A method of exploring for more widely summarized data. Its an antonym to Drill Down. Typically you move up a dimension hierarchy. For example you have the office level break-up of sales revenue, and you can roll it up to city, zone, region and country level figures.

Cross-Dimensional Navigation

(horizontal)

analysis

and

Cross-dimensional analysis is an analysis across multiple dimensions- the key reason why OLAP and its multi-dimensional structure exist. Most of the business reporting and analysis goes across dimensions. A single dimension analysis is, when you get measures for a single dimension. For example- when one looks for measures sales, headcount of employees, operating expenses etc. for 'location' dimension (office, city, state, region, country..) A cross-dimension example will be to look for measures sales, gross profit etc. for 'location' dimension (office, city, state, region, country..) for a given set of products, for a given number of quarters. If you top this kind of example with other analysis types (max-min, exception, filtration), you come 26

DWBI Essential Guide close to the real-life complexity of a business analysis query. One example can be: Identifying top ten of the offices where, the sales for 'washing and cleaning' product range is more than the average sales for this product range across all offices, for those offices, which are open for more than 3 years and have an average growth of 5% per quarter over last 4 quarters. Cross-dimensional analysis capability with an OLAP server is also manifested in the cross-dimensional navigation. For example- you are seeing a pie-chart of revenue share for different product-lines. By clicking on pie of a given product (general insurance- Vehicles), you may like to go for state-wise split for the revenue of that product. Going further, you may like to click on a given state (New-York, California..) and look for split across the channels (telemarketing, sales employees, tied agents, corporate agents, 3rd party brokers..). In the above examples, you are able to seamlessly navigate and drill across due to the cross dimensional linkages.

Here is the list of cross-dimensional analysis you can perform:Drill-across Dimensions

You drill across dimensions, when you move from one dimension to another. For example you are looking at revenue break-up for the cities. However, now you want to have the break-up of revenue for various products (Say Fax machine, Telephone and copier) within that city (Say New York). Within the Fax Machine product in New York City, you want to find the break-up as per channels of telemarketing, mailers and direct sales. In the above example you have drilled across the Dimensions of 'Location'>'Products'->'Channel'.. This is one of the most important and features And is fundamental capability expected out of an OLAP toolDrill across Measures

It is similar to Drill-across dimensions. For example, you are doing the sales revenue analysis and have been able to find out the best and least performing offices. However, to have a further understanding of the picture, you now move across measures to find out about the Sales transactions of these offices (a low revenue , but higher 27

DWBI Essential Guide sales transaction point to a certain level of activity) and number of sales staff (the low performing offices could have lesser staff) and number of months since the office is set-up (the new offices being in gestation period could be performing lower)..Drill across Attributes

This is by all means same as 'drill-across dimensions'. For example you have the data for revenue in US as per the customer relationship value bands (say USD 10K to USD 20K/USD 20K to USD 50 K/USD 50K and above.). For USD >50K band, you want to have the break-up as per the age bands (18 years to 25 Years/25 Years to 40 Years/>40 Years), and within >40 years, you want to have the break-up for occupation (self employed, Practicing professional, employ ed..) In this example we drill across the attributes of relationship value band->Age Band-->Occupation All belong to the customer dimension.

Time Trending Data AnalysisTime trending analysis includes period to period comparisons, across the periods and within a period analysis. Apart from being an important analysis lever, time trending is also core to performance management. One always wants to see on how much needle is moving over time.Period Analysis - Beginning OR End of Period

This is 'balance-sheet' kind of analysis, where you find out the status of various measures (for example the account balances, the number of offices, number of customers, number of defaults, number of patients in admission) at the end OR beginning of a period (say end of month, beginning of quarter..)During the period activity

This is a 'profit & loss' kind of analysis, where you find out the extent of activity done within a period. For example-- Sales Revenue measure, Number of patients admitted, number of festival package flat screen TVs. sold.in given month, quarter, week

28

DWBI Essential GuideTime Trending through Period to Period Comparison

This is typical business performance parameter. For example --the comparison of sales in the first quarter compared to first quarter last year OR the sales in the New Year season this year vs. last yearTime-Trending across a fixed vs. rolling period range

It is used when you see the time-trend across the periods in the fixed time range. For example the revenue figures across twelve months in the calendar year (OR business performance year) OR when you do YTD (year to date) and MTD (month-to-date) analysis, where the starting point for your reference is beginning of the period. The investment analysis typically uses the rolling period range, where the stock movement across last twelve months on the rolling basis is tracked. You can refer Time Dimension, to understand more facets of Time related analysis.

Exception AnalysisException analysis expectations. throws-up the areas not meeting the

No one wants to know on what is going as expected. People are interested in finding out on what is going worse and what is better than expected.Range Exception

This is simplest of all. If the value of a certain measure goes beyond a range of values, the analysis should highlight it. The range can be a 'defined range' (the temperature of patient should be between 98 Degree F and 102 degree F) OR an 'undefined range' (the stock index movement more than 2% either direction).Value Exception

This analysis identifies, if an attribute OR measure value is belonging to OR not belonging to a specific list of values. For example a high 'credit-risk' customer spending on high value 'product'.Conditional Exception

When you want to identify the occurrence of certain conditions. 29

DWBI Essential Guide For example- Let us say you want to highlight the exception instances, when a high customer value relationship has not used your credit card for three months and has been paying only the minimum due amount.

Data Min-Max AnalysisMin-Max for Range Analysis

This is to see the maximum and minimum values, a measure takes. For example you want to find out on the maximum value geological disturbances in a given seismic location. OR you want to find out on the maximum load generation, when you switch on an electrical device.Outliers Analysis

Outliers are similar to the range analysis, but with a different purpose. Outliers are exception values less in occurrence, but having an impact on the aggregations like sum and averages. For example you want to find out average delivery TAT for a product. These results could show that taking out these top 2% instances can bring down the overall average of delivery TAT by 10%.Standard Deviation Analysis

Any performance is a combination of how one is performing on an average and how much is the consistency of performance. 'Standard deviation' is a measure of consistency.

Data Filtration AnalysisLike other analysis types covered in the chapter, Filtration analysis allows you to place filters for your queries. Applying filter can be seen both for 'exclusion' OR 'selecting specific values for inclusion'. In simplistic way, filtering can be equivalent to 'where' clause of an SQL query.

Here are different ways you can Data Filtration analysis:Data Filter on specific values of a dimension by direct specifications:

Calculating sales for select set of offices in a city OR calculating the operating expense across a given set of expense lines. The filtration can be on combination of dimensions- For example- select set of office locations, which are selling a given line of products.

30

DWBI Essential GuideData Filter on specific value of a dimension given a certain conditions, related to a dimension:

Calculating average sales revenue for only those office locations, which have been operating for less than 6 months of time. There can be more complex conditions.Filter on specific values of a dimension linked to measure values:

Calculating average sales value productivity for only those offices, where the sale is less than the average sales per office across all the offices.Filtration analysis on tolerances and outliers:

When you are calculating the averages (say), you may like to count out the values, which are below certain tolerances (outliers). For examples, calculation of average write-off values from cancelled credit cards, where the write-off is more than USD 10 dollar and less than USD 20000.Top and bottom filters:

Example-Filtering out the 'top 10', 'bottom 10', 'top 10%', 'bottom 10%'.

Pivoting, and Slicing & Dicing AnalysisSlicing means taking out the slice of a cube, given certain set of select dimension (customer segment), and value (home furnishings..) and measures (sales revenue, sales units..) or KPIs (Sales Productivity). Dicing means viewing the slices from different angles. For example -Revenue for different products within a given state OR revenue for different states for a given product. Slicing and Dicing leads to what you can call Pivot. Pivot is known in Excel context. Pivot is the standard and basic look and feel of the views you create on the OLAP cubes. A pivot creates ability for you to create the width and depth in your view of the data. A pivot is a two dimensional lay-out of the summary data. The x and y axis are the dimensions and the intersection cells for any two dimension values contain the value of the measures.

31

DWBI Essential Guide

Here is an example of how you can slice and dice through pivot:Step1: Starting layout- You can have product list on y axis (say 10 products), the quarters (say four quarters) on the X-axis. You can have sales value as the measure shown in the table against intersection of a given product and a quarter. You will have 10 X 4 matrixes. Step 2: Adding depth Cross-Dimensionally-Taking a step further, you can add a dimension of locations under the product to give it more depth. Therefore now you can have different locations (say 3 locations) for each row of product. You will not have a 30 (3 locations for each of the 10 products) X 4 (quarters) matrix. Step 3: Adding depth within a single dimension: You can also add another dimension like months under quarters. Now you will have 30 X 12 (3 months for each quarter). You can also specify, if you want to have sub-totals for every dimension. For example, you can have the sub-totals for locations, productions, month and quarters. Step 4: Pivoting on an axis: You can also pivot your view and transpose the product+ location combination on X axis and quarter + month combination on Y axis. Step 5: Adding Width: Referring to starting layout-You can also add dimensions in 'width' instead of 'depth'. For example- instead of having location dimension under the product, you can add location dimension adjacent to the product dimension. Therefore, you will have a matrix, which on Y axis will have 10 rows (for 10 products) and 3 rows (for 3 locations), with a 13X4 matrix.

32

DWBI Essential Guide

Chapter 3 Advanced Blocks Data Analysis TypesBuilding

OLAP what if AnalysisWhat-if analysis is essentially scenario building capability of OLAP. You can draw a straight parallel of what-if analysis in MS Excel, with more sophisticated capabilities in OLAP.

This is what you do in what-if analysis:Create a what-if calculation model:

This is the calculation model on which you are going to apply different scenarios. Profit and loss projection for next 5 years is one example of a calculation model. A Calculation model takes a set of input values, to give the set of out-put values. Each different set of these 'input and output value combinations' is called a what-if analysis scenario.Creating different set of what-if scenarios:

Depending upon your needs you can build different scenarios of input values, and you can apply those scenarios on the calculation model, and generate the output values. Here is a simplistic example:

33

DWBI Essential Guide Let's say that you have Profit and Loss projections for next 5 years as the calculation model. Following can be some of the inputs values to the model:

Revenue on year 0 Expected revenue rate of growth CAGR (cumulative average growth rate) Gross profit margin % Non-operating expenses. Income tax rate Rate of Dividend.Etc..

The output values can include:

Revenue Gross margin Operating margin Profit before Tax Profit after tax Allocation to reserves.

How what-if analysis works in OLAPAn OLAP tool will have an end-user tool sitting on top of the OLAP server (like MS excel OR a business modeling tool). You will create a calculation model in the modeling tool, and the input and output data for each scenario is stored in OLAP. This is what you call 'write-back', and it is a key value add from OLAP tools. This is one key area where OLAP differs from Data Warehouse (which is read-only for good reasons).Using combination of a good end-user tool and OLAP server, you can do the following:

All Input as well as output data can lie entirely in OLAP. Input and output data can lie partially in OLAP and partially in the enduser tool. The reason for partial OLAP storage is - you may like to keep only certain select scenarios in OLAP, and the rest of the scenarios could be more temporary OR non-serious scenarios, which you may not like to store in OLAP (but do it in the end-user tools) There is another cut for partial storage. You may like to keep only some part of input value-set in OLAP and rest in the analysis tool. This is because- some input values may not be aligned to the multi-dimensional design of OLAP. For example- in P&L calculation model, you may not have tax rate as a measure in OLAP dimensional model. To write back tax-rate, you may have to change the dimensional model, which may OR may not be worth the effort. Tagging of scenarios in term of how you want to present. For example-'most probable' to 'least probable' probable scenarios.

34

DWBI Essential Guide

OLAP Data Allocation AnalysisAllocation Engines is another differentiator for OLAP vs. Data Warehouse. Simplistically-It allows the users to automatically allocate or in other words 'split' the value into multiple values.

OLAP Allocation Engine enables users to:

Take a source data Define the basis of allocation Execute the allocation operation Store the allocated values to the target data.

A Simple example:

You want to allocate the enterprise IT expense (source) to IT expense for each line of business/departments (Target). The 'basis' of allocation is: 70% of the IT expense to be allocated on the basis of the LAN IDs and 30% of the expense is on the basis of the business revenue generated by the line of business (non-earning departments will not be included in this allocation basis). The operation will be 'proportional' allocation. This means that the expense will be allocation proportionally on the basis of number of LAN IDs and Business Revenue (for 70% and 30% of Enterprise IT expense respectively). From OLAP perspective, OLAP will pick the source data from within OLAP OR from Data Warehouse (if it is not stored in OLAP). It will apply the allocation basis and the output values (IT expense for the period calculated for each line of business) will be stored in OLAP (as a write-back). As a side note the source and target may be outside of OLAP, while using the allocation engine of OLAP server.

Here are the typical Allocation Analysis capabilities linked to OLAP:

The source and basis can be formulas, so you can perform computations on existing data and use the result as the source OR basis of the allocation. For example, you can have IT expense to be formula of summation of various IT expense lines stored in OLAP (like license fee, Data centre operations expenses, Network expenses). Basis of allocation is typically a formula. You can specify the method of operation of the allocation for a dimension. The operations range from simple to very complex. You can have:

35

DWBI Essential GuideProportional allocation Even Allocation Combination of proportional and Even calculation etc. You can specify whether the allocated value is added to OR replaces the existing value of the target cell. Taking the same example of Enterprise IT expense, the whole IT expense is lying in the IT account. When you run allocation engine, IT department will also be allocated certain expense, as it also has got some LAN IDs for its own employees OR on-site vendors. You can specify if you want to replace this value, add to this value (not a correct option) OR store it separately.o o o

You can specify an amount to add to OR multiply by the allocated value before the result is assigned to the target cell. This is an extension of what you call the 'allocation operation'. Taking the same IT example- Let's say that you have calculated the IT expense figure for each line of business/department. You may like to add a 2% additional expense overhead, as you allocate. This expense is to take care of any special IT initiatives, which may not be linked to business case driven IT projects. You can exclude certain values within a dimension hierarchy so that both the source data and target data is not included in the whole allocation process. Taking the same IT expense example- You may like the a certain expense like (like License Fee for ERP system) not to be included, as that might be part of the overall licensing agreement between your parent company and the ERP vendor. This will ensure that this expense is neither considered as source and nor applied to the target data. Within the allocation operation, you can define the limits and tolerance. For example- not to allocate IT expense to departments, which have less than 20 LAN IDs, OR not to allocate IT expense, where it is less than 10000 USD. This kind of allocation rules, necessitate iterative allocation calculations. You can store different versions of allocations. For example- if same department has two different allocations for IT expense. Say, there is an enterprise IT expense allocated to that department (the example used throughout this page), and there is direct allocation of license fee for a software, which is bought exclusively by this department. You should be able to store both of these allocations in separate cells. You are able to handle allocations for special situations. For example- if the basis is NULL- Say the LAN IDs are null in certain departments. This may not be a real-life scenario, but you are able to define that the allocation should consider this as non-applicable and should not allocate any expense.

36

DWBI Essential Guide

OLAP Goal-Seek Data AnalysisGoal-Seek is What-if analysis in OLAP, but of a reverse order. In a typical What-if analysis, A Calculation model takes a set of input values, to give the set of out-put values. Each different set of these 'input and output value combinations' is called a what-if analysis scenario. Depending upon your needs you can build different scenarios of input values, and you can apply those scenarios on the calculation model, and generate the output values. Here we are traveling from input values to output values. In goal-seeking, the direction is reversed. You have the out-put values for a scenario, and you want to have the input values, which will correspond to the given output values.The OLAP goal-seeking capabilities have following scenarios:

Single input value and single output value. Multiple input values and single output value. For example- you can have same net profit margin (output value), with different combinations of operating and gross margins as input values. Multiple input values and multiple output values. For example- you can have same P&L projections (output values of Gross profit, Operating Profit, Net profit..), with different combinations of input values (like Revenue growth, gross margins, non-operational expense..)

An OLAP analysis solution can have the following goal-seek capabilities:

Allows you to define on which input values you want to change through goalseek to achieve the given output values. Allows you to define the min-max limits for each input value, as goal-seek generates various options of input-value combinations. Allows you to define any tolerances, which are acceptable for the output values. Apart from min-max limits, you can define various other constraints on the input values. You can accept or reject an option created by goal-seek. You can tag an option generated by goal seek. For example- 'more probable' and 'less probable'. You can store the options generated by goal-seek in OLAP OR the analysis tool sitting on top of OLAP.

37

DWBI Essential Guide

Chapter 4 Business Hierarchies Warehouse in OLAP and Data

The subject of hierarchies is relevant to both OLAP and BIPM Delivery - Data Warehousing/Marting. Modeling of data is done, both in DW and OLAP, keeping the hierarchies in mind. However, 38

DWBI Essential Guide OLAP is the platform where the hierarchies are manifested in their final shape for the purpose of analysis.

OLAP and Data Warehouse Dimensional Model HierarchyThe subject of hierarchies is relevant to both OLAP and BIPM Delivery - Data Warehousing/Marting. Modeling of data is done, both in DW and OLAP, keeping the hierarchies in mind. However, OLAP is the platform where the hierarchies are manifested in their final shape for the purpose of analysis.

Definition of DW/OLAP HierarchyHierarchies are the paths over which any data (OR measure) is summarized. As you perform various Vertical and Horizontal navigation operations, you move along with these paths of hierarchies. 'Office> City> State> Country> Continent>' Globe is one such example of a hierarchy for a location dimension. In this hierarchy, office is at the lowest level of the ladder and globe at the highest. So you can roll-up sales revenue measure figures from office level, way up to the global level.

In terms of basic definitions linked to a DW/OLAP hierarchyA dimension level, which is participating in the hierarchy (OR a step in the ladder of hierarchy) is called a Level. For example 'city' in location dimension hierarchy will be a level. The sequence of these levels is called the Path. For example- the 'Office> City> State> Country> Continent > Globe' is the hierarchy path. The first OR the lowest level of hierarchy is called Leaf (office in the example) and highest OR last level is called Root (Globe in the example). Within the two consecutive levels, the higher level is called the Parent level and lower is called Child (for example 'City' is parent for 'office' and child for 'state' level). Business hierarchies are not limited to Business Intelligence. Business hierarchies exist since the data model was invented. If you look at your typical Entity Relationship diagram in your transaction system data models, you have child and parent entities. Child and parent entities are nothing, but representation of a hierarchy. One has to take a note, that Business intelligence dimensional modeling in most cases, does not invent hierarchies. These hierarchies exist in the data models of transaction OR source systems, and organizational data models & business processes. For example, if you haven't got a linkage between a Sales unit to a Business unit defined in your transaction system, don't expect your OLAP to have that hierarchy defined. In other words, just like data, the input on hierarchies 'mostly' comes from the source OR Source Systems Mapping. 39

DWBI Essential Guide With reference to entity-relationship diagrams in transaction systems- A child and parent entities are reflected in your database design as referential integrity. For example- In the referential integrity you have 'office master table', having a 'city-code field', which is linked to the 'city-master table'. 'City master table' will have the 'state-code field', which will be linked to the 'state-master table'. This is an example of office>city>state hierarchy of location. A transaction system is able to navigate the information from the lowest level of hierarchy to highest level. That is why, a data warehouse can have the storage in dimensional (de-normalized) model form OR relational model (normalized) form, without impacting the concept of hierarchy. As you will see in the Additivity of Measures chapter, the hierarchies drive the additivity OR aggregation rules in big way. You should be reading the hierarchy chapter before you, go to the measures chapter. There are different kinds of hierarchies, and each hierarchy has a different role and a context. Before we go into this classification, let us list three main factors, on which different kind of hierarchy structure are created.

The level-cardinality: This means - if a child level in hierarchy can belong to one OR more than one dimension levels. The instance-cardinality: This means - if a child instance in hierarchy can belong to one OR more than one parent instances. The Analysis criteria: This means- if you are using levels within hierarchy path for one OR more than one analysis criteria. For example- you can use an office for geography as well as sales organization criteria.

Type of Data Entity HierarchiesStrict OR Simple Hierarchies

These are the hierarchies, which can be represented by a tree structure, whereby:

Each level in the tree has only one possible parent level, AND Each instance can belong to only one defined level AND Criteria for analysis are same.

Therefore, in a simple hierarchy, a child will have only one parent, and parent will have only one child level. The simple hierarchies can be further categorized into symmetric, asymmetric, generalized and non-covering hierarchies.

40

DWBI Essential Guide

Non-Strict Hierarchies

In a non-strict hierarchy

Each level in the tree has only one possible parent level AND Criteria for analysis is same, but Each instance can belong to more than one instance in the parent dimension level.

Multiple and Alternate Path Hierarchies

In Multiple and Alternate path hierarchy

Each level in the tree can have more than one possible parent level AND Each instance can belong to more than one instance in the parent dimension level AND Criteria for analysis are same.

Parallel Path Hierarchies

In this hierarchy structure there is flexibility on all factors

Each level in the tree can have more than one possible parent level AND Each instance can belong to more than one instance in the parent dimension level AND Criteria for analysis is different

Dimensional Model Simple HierarchySimple or strict hierarchy is the simplest form of business Hierarchies.Symmetric Hierarchy is the simplest form of a simple hierarchy. It has:

All levels in the hierarchy must exist. There in only one path from bottom most to top. In other words, a level cannot exist in any other hierarchy.

For example- you will not have city in any other hierarchy path. Simple example- is the same ' Office> City> State> Country> Continent> Globe'

41

DWBI Essential GuideAsymmetric Hierarchy

Asymmetric hierarchy will be same as that of symmetric hierarchy apart from the fact that, you may not have lower levels existing in some instances. Let's take an example of Sales Channel. In this you may have 3rd party sales agent>Sales Executive>Sales Manager>Sales Area >Sales Zone>Sales Region In the above, for certain instance, it might be possible that a 3rd party sales agent may not exist and the sale is directly done by the sales executive. Therefore there will be nil instances for lowest level in this hierarchy in certain cases.Generalized hierarchy:

In generalized hierarchy, there may be shared levels within the two different hierarchy paths, but an instance at lower level, cannot belong to two instances in the parent. For example 3rd party sales agent> Sales Executive>Indirect Channel--Sales Manager >Sales Area Manager>Sales Zone head>Sales Region head Sales executive>direct Channel--Sales Manager>Direct Channel-Sales Area manager>Sales Zone head>Sales Region head In the above two cases you have shared level at 'Sales Executive', 'Sales Zone' and 'Sales Region'. However, any instance of sales executive will belong to only one parent- Either to an Indirect Channel -Sales Manager OR Direct Channel- Sales Manager. In other words, If a sales executive is working for two managers (one direct and one indirect), it will not be a simple hierarchy.Non-covering hierarchy

In this hierarchy, an instance of an intermediate level may be missing. For example, in the below hierarchy, Sales executive>direct Channel--Sales Manager>Direct Channel-Sales Area manager>Sales Zone head>Sales Region head For some instances, you may have a direct sales Manager directly reporting to sales zone head as the zone might be smaller and the sales area manager level may not be existing in that zone. 42

DWBI Essential Guide

Dimensional non Strict HierarchyNon-Strict hierarchy has one similarity and one non-similarity with strict hierarchies. A non-strict hierarchy has one level in a hierarchy path to be having only one parent level. However an instance (or member or value) in a level could belong to multiple instances in the parent level. For example- as taken from previous topic on simple hierarchy: Sales executive>direct Channel--Sales Manager>Direct Channel-Sales Area manager>Sales Zone head>Sales Region head The above can be made more complex by an example of non-strict and noncovering hierarchy, whereby a sales executive is reporting to direct channelsales manager (say for sales of certain set of products) and also directly reporting to a direct-channel area manager (for sales of special set of products). If we have a sales executive working only for a single direct channel- sales manager, it will be called a strict hierarchy. However, if we have a sales executive working for more than one manager, it will be a non-strict hierarchy. As you go to the Additivity of measures chapter within OLAP, you will see that unlike strict and simple hierarchies, you cannot have simple summarization of measures. For example, you cannot have the sales revenue achieved by a sales executive and roll it up through two sales managers he is reporting to. If you do this, you will be double counting.

Multiple Path HierarchyMultiple path hierarchies have a dimensional level belong to two different dimensional levels. However the criterion of analysis is same across the multiple paths.

In Multiple and Alternate path hierarchy

Each level in the tree can have more than one possible parent level AND Each instance can belong to more than one instance in the parent dimension level AND Criteria for analysis are same.

43

DWBI Essential Guide An alternate path hierarchy is, when the hierarchy paths merge at certain points (generally once at the higher levels), whereas it is called multiple path, when the paths do not merge. As an example The following hierarchy path is a alternate path hierarchy, whereby sales office level belongs to two different levels as a parent (direct sales channel area and indirect sales channel area), but the paths merge at the level of sales region. Essentially the hierarchy is taking alternate paths to reach at the same level in the end. Sales office>direct sales Channel Area>direct sales channel sector >Direct sales Channel Zone> Sales Region Sales Office>Indirect sales Channel Area>Indirect sales Channel sector > Indirect sales Channel Zone> Sales Region The following hierarchy will be the multiple path hierarchy. In this example the paths are not merging. Sales office>direct sales Channel Area>direct sales channel sector >Direct sales Channel Zone> Sales Region Sales Office>Indirect sales Channel Area>Indirect sales Channel sector > Indirect sales Channel Zone> Indirect Sales Region As you see in the above examples- though the paths are either alternate OR multiple, but the criteria for analysis is same and that is the sales channel and related measures. The next topic is parallel hierarchies, which is combination of the multiple OR alternate path hierarchies, but where the criteria is also different.

Parallel Dimensional HierarchyParallel Hierarchies are most flexible hierarchy paths.

In parallel path dimensional hierarchy system:

Each level in the tree can have more than one possible parent level AND Each instance can belong to more than one instance in the parent dimension level AND Criteria for analysis is different

44

DWBI Essential Guide Sales office>direct sales Channel Area>direct sales channel sector >Direct sales Channel Zone> Sales Region (Sales organization dimension) Sales Office>city>district> state> country (Location Dimension) If you look at the example, sales office level belongs to two different parent level dimensions (Direct sales channel area and city), the instance of an office (say Sydney harbor office) belong to two different instances (Sydney west sales area and Sydney city) and also different criteria for analysis (sales organization and location). Essentially the difference between the parallel and multiple hierarchies is on the criteria for analysis. A parallel hierarchy can be a dependent hierarchy - whereby the paths could be sharing the same levels like the following: Sales office>direct sales Channel Area>direct sales channel sector >Direct sales Channel Zone> Sales Region> Country (Sales Organization Dimension) Sales Office>city>district> state> country (Location Dimension) In the above example, you will have the 'country Level' being shared. Country (like sales office) is also appearing as an instance in two different dimensions.

45

DWBI Essential Guide

Chapter 5 Additivity and Aggregation of Measures-Facts in OLAP AnalysisAdditivity and correct aggregation methods application is fundamental to the success of Business Intelligence. The most common mistakes the modelers and designers make is on - Setting the Right Hierarchies AND Establishing Right Additivity and aggregation rules. You need to go through the chapter of business dimensional hierarchies, before you go through this chapter.

Additivity of Measures-FactsAdditivity and correct aggregation methods application is fundamental to the success of Business Intelligence. The most common mistakes the modelers and designers make is on - Setting the Right Hierarchies AND Establishing Right Additivity and aggregation rules. You need to go through the chapter of business dimensional hierarchies, before you go through this chapter. Additivity of a measure is when you are able to apply the sum operator across all the dimensions. Other aggregations on measures-facts are when you use operators like Average, Maximum and Minimum. The OLAP tools now-a-days have some capability to automatically enforce the correct additivity and averaging rules, given the hierarchy and the type of measure. However, the burden is finally on the modelers and designers.

Before we move further, let's take a look at some more aspects, which will be useful:Completeness of Hierarchy:

This basically means that all the possible instances of a hierarchy path to be available, to make it complete. It means that there should be no missing data in the tables. For example, if you have country and continent level in 46

DWBI Essential Guide the location dimension, one should expect that all the countries in the Europe continent should exist in the tables. Otherwise your summarization for the continent may not work.Classification vs. descriptive attributes:

A classification attribute of a dimension is the attribute, on which the aggregation takes place. A dimensional attribute is the one, which plays the role of a descriptor and is not the basis of aggregation. You will see that OLAP includes all classification attributes and some descriptive attributes.

Non-Additive Measures-FactsNon-Additivity is that when you cannot use a sum operator to generate the needed aggregation. Here are non-additive measures:Ratios and Percentages:

Some examples of the ratios and %ages is the Profit-margin, revenue to asset ratio, default rate etc. If you add the profit-margin % of all the products for a retail company, you may get a figure of much more than 100%. Therefore you need to first take the sum of the numerator (profits) and denominator (revenue) for all the products and then calculate the ratio. When you are applying aggregation on a ratio, one need to take the 'ratio of the sums', instead of 'sum of the ratios'. Similarly for %ages, the same rule will apply. The same constraint will also apply on averaging. Just like sum, even the average operator will fail here. Solution: Store the numerator and denominator in separate fields and the ratio OR % age in a separate field as derived measure-Fact.Measures of Intensity

These are more of clinical and scientific measurements. For example, blood pressure, temperature, gauge pressure, wind speed etc. The handling of these kind measures can be simple average (like average blood pressure of the sample of patients with same medical history and between ages of 40 and 50 comes out to 140/110..). However, the designers could apply very specific rules to calculate the summations (like placing weights to different instances). This is primarily due to the scientific nature. 47

DWBI Essential Guide Solution: Use alternate aggregation functions like averages, minimum and maximum. Track the constraints in the meta-data.Grades and scales

This is same as measures of intensity, but more of business domain. Some examples are risk grade of customer, level of risk scale associated with a loan. Solution: Use alternate aggregation functions like averages, minimum and maximum. Track the constraints in the meta-data.Averages/Maximum/Minimum and similar measures

You may have derived measures in current data OR historical snap-shots. If you have averages OR max-min figures, these will not be additive. In other words, the attributes which do not contain the 'activity', but the 'characterization' measure, do not follow the additive path. 'Characterization' measure is a kind of measure, which characterizes the activity. For example, while the 'turn-around time' is an activity measure, but average TAT, maximum TAT, minimum TAT for a period (say), characterize the TAT activity. Solution: Use alternate aggregation functions like minimum (minimum of minimums) and maximum (maximum of maximums). Track the constraints in the meta-data. However, the solution does not apply on averages.

Semi-Additive Measures-FactsSemi-Additivity is when you can have a measure aggregated on a certain dimension, but not all the dimensions. Another phrase for semi-Additivity is when you have the summarization with an index of in-accuracy. SemiAdditivity happens primarily in four scenarios:

Semi-additive Missing OR dirty dataThere are many reasons and manifestations for missing and dirty data. One can refer to Customer Data Quality and reasons for bad data quality. The missing OR dirty data, provides a wrong picture for the overall submissions. For example if you have empty records for sales of some offices, you will end up under-reporting the sales figures. Similarly, if you have wrong data in the same case, you may end-up under OR over-reporting. 48

DWBI Essential Guide The solution to this issue (you can refer Data Correction techniques in customer data quality to get more detailed listing):

Don't include those offices in the summarization and specify that the report does not include the specific instances. Fill-up an average figure for the instance. For example, if the sales figure for can office is not available for this month, you can (temporarily) assign a value, which is closer to the past patterns. One option is to put the average of last 12 months sales. This is generally a preferred solution. Apart from just putting the average for past periods, you can also use various extrapolation and forecasting techniques to calculate the stop-gap figure. This however, will be done, only if the number of cases of missing OR dirty data is within certain limits (for example- 5% offices not having data..) In case of a high proportions of instances having missing OR dirty data, one needs to apply the above-said tricks and also mention the caveat, that the data could have an inaccuracy index of some percentage.

Historical DataHistorical data falls in two categories, and both have different treatments:Historical snap-shots:

When you have historical snap-shots, and over the time, you have situations, where you have changed the instances of your dimensions. For example- you may have changed your product categorization ('home- segment' and 'small business' product segment is not combined and re categorized into 'handheld' and 'Table-top') OR sales locations ('New-York' sales area and 'NewJersey' sales areas now combined and split into 5 sales areas, as company business has grown..) In this case, when there is an incompatibility across the instances, it is not possible to add the measures across those dimensions, across time. Referring to the above-said example, it might still be possible to add the sales figures for office instance across the time, as that level in the location dimension has not changed. However, adding sales on 'sales area' basis over the time dimension will not be possible. Solution: You may like to apply some smart transformation rules, to translate the historical categories into new categories. For example- you may know that 'Hand-Held' typically formed 30% of the sales in the home and small business segment. You may apply this %age to historical snap-shots and make it aligned to the current categories.

49

DWBI Essential GuideSlowly Changing dimensions (SCD):

Please refer to Special Situations in Dimensional Modeling to understand on what we mean by slowly changing dimensions. In-short, a Business Intelligence system will be storing the various instances of changing values with a dimension, as a new record OR a field, whereas the transaction system will typically overwrite. For example, ZIP code of a customer may be over-written by a transaction system, as customer moves to a new address. The Business Intelligence system, may append a new record for this change, without erasing the previous ZIP-code. This may be needed to do sales analysis on basis of ZIP codes in the previous months. In case of SCD, one is not able to summarize the data on the time dimension. For example, if the customer has changed the ZIP code, you will not be able to summarize the sales related to the customer, as there are two records related to the customer (with two different ZIP-codes). One has to take this statement with a little care. It is 'possible' to summarize, but one has place an extra filter, so that you do not double calculate. Solution: Track the constraints in the meta-data and apply the right filters.Snap-shot data

There are many measures, which are not the 'activity' for a period, but the state of the measure at a given point of time. The example of the 'at the moment' measures is the line items in a balances sheet (which provides the assets and liabilities at a given point of time). example of 'Activity for a period' is the line-items in profit & Loss account (which provides expenses and revenues for a given period). The 'Snap-Shot' measures cannot be added over the time dimension. If you want to find out the account balances for the year, you will not add the balances at the end of each of the 12 months. As you will see in the averaging of measures topic, these are best handled through averaging. Solution: Apply other operators like averages/Maximum/Minimum..Category data

When you have measures, which provide 'type of magnitude' and not the 'magnitude', it is not possible to add them across some dimensions. The example of this difference is- 'Number of Sales units' and 'units of inventory' are value of Magnitude measures whereas 'number of product-categories sold' and 'Number of inventory parts types' is type of magnitude measure. 50

DWBI Essential Guide Example of above Semi-Additivity is - if you have sold 2000 different product models across US in this month, and 1500 models in the previous month, you cannot sum them up to provide the total numbers of models sold in US over past two months. You will need to have a way to identify the distinct models sold across this period. Solution: Apply other operators like averages/Maximum/Minimum.

Business Intelligence End to EndChapter 1 BI Architecture ComponentsThese are the building block of the BI architecture. All possible architecture scenarios will have some or all of these components. This chapter endeavors to de-mystify the definitions of often misused terms.

BI Data Warehouse Source SystemThese are the feeder systems and start point of data flow in the overall BI architecture.

Sources systems are all the data feeding pipes to the Staging Area. TYPICALLY any Transformation on the data is done after the data in its raw OR unchanged form is picked from the source systems. The further details of the Source Systems can be Data Warehouse Design & Architecture in Data Warehousing

51

DWBI Essential GuideCore source Systems

These system include the core systems mostly having well organized database, set schedules, the data being updated on online basis. Typical systems are core product manufacturing systems, accounting systems, Money Management systems, commission/Sales Compensation Systems, tightly coupled job/workflow systems.Field OR Front-end Source Systems

These systems include the systems, which are primarily used by the Customer acquisition and retention staff. These include Customer service systems, Sales automation systems, leads management systems, campaign management systems.Modeling and Analysis systems

This family of systems includes budgeting, planning and forecasting, pricing and valuation type of system.External Data

This is the data, which you receive from regulators, credit bureau, medical bureau, industry associations, market research firms, database marketing companies and other sources. This data (unlike data provided by your suppliers and Customers; which goes into your core OR field systems) follows a standard format generally governed by the supplier.Non-Data Base/Desk top Sources

Wealth of information and critical operational data resides in the spreadsheets and MS access tables in the desktops OR local servers of organization. If you want to shock yourself, just make a study of the number of excel based applications, which have become critical part of operational delivery and reporting. The numbers could go in hundreds, if not in thousands.Off-line Databases

As no organization data management strategy goes through a systematic & planned growth, it is possible that you might have some offline database used by the users to generate their reports. For the sake of speed, one may tend to use that as a source system. However, in a medium to long run, this may become counterproductive because you will make that offline database redundant. 52

DWBI Essen