16
Meta Data – Different types of Meta Data Metadata is "data about data". Metadata are traditionally found in the card catalogs of libraries. As information has become increasingly digital, metadata are also used to describe digital data using metadata standards specific to a particular discipline. By describing the contents and context of data files, the quality of the original data/files is greatly increased. For example, a webpage may include metadata specifying what language it is written in, what tools were used to create it, and where to go for more on the subject, allowing browsers to automatically improve the experience of users. There are three main types of metadata: Descriptive metadata describes a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, and keywords. Structural metadata indicates how compound objects are put together, for example, how pages are ordered to form chapters. Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. There are several subsets of administrative data; two that are sometimes listed as separate metadata types are: Rights management metadata, which deals with intellectual property rights, and Preservation metadata, which contains information needed to archive and preserve a resource. Resource discovery o Allowing resources to be found by relevant criteria;

Business Analytics Notes

Embed Size (px)

Citation preview

Page 1: Business Analytics Notes

Meta Data – Different types of Meta Data

Metadata is "data about data".

Metadata are traditionally found in the card catalogs of libraries. As information has become increasingly digital, metadata are also used to describe digital data using metadata standards specific to a particular discipline. By describing the contents and context of data files, the quality of the original data/files is greatly increased. For example, a webpage may include metadata specifying what language it is written in, what tools were used to create it, and where to go for more on the subject, allowing browsers to automatically improve the experience of users.

There are three main types of metadata:

• Descriptive metadata describes a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, and keywords.

• Structural metadata indicates how compound objects are put together, for example, how pages are ordered to form chapters.

• Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. There are several subsets of administrative data; two that are sometimes listed as separate metadata types are:

− Rights management metadata, which deals with intellectual property rights,and− Preservation metadata, which contains information needed to archive and preserve a resource.

Resource discovery o Allowing resources to be found by relevant criteria; o Identifying resources; o Bringing similar resources together; o Distinguishing dissimilar resources; o Giving location information.

Organizing e-resources

Page 2: Business Analytics Notes

o Organizing links to resources based on audience or topic. o Building these pages dynamically from metadata stored in

databases. Facilitating interoperability

o Using defined metadata schemes, shared transfer protocols, and crosswalks between schemes, resources across the network can be searched more seamlessly.

Cross-system search, e.g., using Z39.50 protocol; Metadata harvesting, e.g., OAI protocol.

Digital identification o Elements for standard numbers, e.g., ISBN o The location of a digital object may also be given using:

a file name a URL some persistent identifiers, e.g., PURL (Persistent

URL); DOI (Digital Object Identifier) o Combined metadata to act as a set of identifying data,

differentiating one object from another for validation purposes.

Archiving and preservation o Challenges:

Digital information is fragile and can be corrupted or altered;

It may become unusable as storage technologies change.

o Metadata is key to ensuring that resources will survive and continue to be accessible into the future. Archiving and preservation require special elements:

to track the lineage of a digital object, to detail its physical characteristics, and

Page 3: Business Analytics Notes

to document its behavior in order to emulate it in future technologies.

Source:NISO. (2004) Understanding Metadata.

Bethesda, MD: NISO Press, pp.1-2.

3.3 Getty's definitions on types of metadata

Type Definition ExamplesAdministrative Metadata used

in managing and administering information resources

- Acquisition information- Rights and reproduction tracking- Documentation of legal access requirements- Location information- Selection criteria for digitization- Version control and differentiation between similar information objects- Audit trails created by record keeping systems

Descriptive Metadata used to describe or identify information resources

- Cataloging records- Finding aids- Specialized indexes- Hyperlinked relationships between resources- Annotations by users- Metadata for record keeping systems generated by records creators

Preservation Metadata related to the preservation management of information resources

- Documentation of physical condition of resources - Documentation of actions taken to preserve physical and digital versions of resources, e.g., data refreshing and migration

Technical Metadata related to how a system functions or metadata behave

- Hardware and software documentation- Digitization information, e.g., formats, compression ratios, scaling routines- Tracking of system response times

Page 4: Business Analytics Notes

- Authentication and security data, e.g., encryption keys, passwords

Use Metadata related to the level and type of use of information resources

- Exhibit records- Use and user tracking- Content re-use and multi-versioning information

Why Dataware houses Fail

User Adoption.

The single measure of success for any BI project. Are the users using it? If not, it has failed.

Users Don’t Know What They Don’t Know.

It is utterly pointless paying a Business Analyst to spend weeks asking users what they want from a BI

project. Users don’t know - and will NEVER know – for absolute certain UNTIL they see something. What

does that mean? It means that waterfall/SDLC as a methodology will never be appropriate for developing

BI solutions. If you are utilising a Gantt chart for managing your BI project right now, you are heading for

failure! It is become more widely known that Agile or Scrum methodologies work best for BI. Incremental,

iterative steps are the way to go.

All BI Solutions Will Require Change.

Whether change comes from external or internal influences, it does not matter. It is inevitable. If your

toolset/method/skills cannot embrace change, you are going to fail. If your ETL processes are like plates

of spaghetti then change is not going to be easy for you. A data warehouse is a journey, not a destination,

and often you will need to change direction.

Everybody Loves Kimball.

And why not? A star-schema or dimensional model, after all, is the single goal of any BI developer. A

single fact table with some nice dimensions around it is Nirvana for an end-user. They’re easy to

understand. It’s the goal for self-serve reporting. What can go wrong? Everything! The point is you can

rarely go from source to star-schema without having to do SOMETHING to the data along the way.

Especially if your fact table requires information from more than table or data source, then you face a lot

of hard work to get that data into your star-schema. In a poll I conducted on LinkedIn a while back, I

asked BI experts how much of a BI project was spent in just getting the data right. The answer came back

Page 5: Business Analytics Notes

as around 75-80%. Over three-quarters of any BI project is spent in just getting the data in a shape for BI!

So when (not if) you are going to need to make changes, you will have a tougher job on your hands

(especially if you built it using traditional ETL tools).

Everybody Ignores Inmon and Linstedt.

Bill Inmon, the Father of Data Warehousing, has written numerous books on the subject. His ‘Corporate

Information Factory’ philosophy makes a lot of sense – take your data from all your sources and compose

it into its Third Normal Form (3NF) in your Enterprise Data Warehouse (EDW). Why? It makes your data

‘generic’ and in its lowest level of granularity. Once you have your data in this state, it makes the perfect

source for your Kimball star-schemas. Dan Linstedt extends on the 3NF model by introducing Data Vault,

which provides a means of maintaining historical information about your data and where it came from. A

unique feature of Data Vault is that you can understand what state your data warehouse was in at any

point in time. So why does everybody ignore Inmon and Linstedt? Most likely because they are too

complex to build and maintain using traditional ETL tools. Instead most developers will manage all the

staging and data transformation in ancillary tables in a way only they understand using their favourite ETL

tool. Good luck for when they finally leave your organisation!

ETL Tools Do Not Build Data Warehouses.

ETL tools were designed to move data from one place to another. Over the years, extra bits may have

been cobbled on to perform tasks to ease the job of a data warehouse developer, but they still rely on too

many other things. A decent target data warehouse model, for example. As discussed in the above

points, ETL tools offer no help in providing a fast and effective means for delivering a design like a Third

Normal Form or Data Vault Enterprise Data Warehouse. This means you need a data modelling tool plus

the skills to design such architectures. Thankfully, we live in the 21st century where true data warehouse

automation tools are emerging. These will help lead data warehousing out of the dark ages – especially

with the advent of Big Data. Inmon and Linstedt have written the rules, now let the data warehouse

automation tools take over!

Summary

To succeed in your data warehouse project, take an approach that embraces rapid change, and surround

yourself with the tools, methods and people that are willing and able to support that.

While you need a star-schema for reporting and analytics, don’t try to take shortcuts to get there. You

cannot go from source to star schema without having to something in between. Bill Inmon and Dan

Linstedt have described how that ‘in-between’ bit should look. Ignore them at your peril! If you have

Page 6: Business Analytics Notes

multiple data sources, then DO look at building a 3NF or Data Vault EDW. To help you do that, look at

getting a true Data Warehouse Automation tool.

If we are to succeed with Big Data, we need to be truly successful in data warehousing.

Molap - Rolap

In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

MOLAP

This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.

Advantages:

Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing and dicing operations.

Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages:

Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.

Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

ROLAP

This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages:

Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.

Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since

Page 7: Business Analytics Notes

they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:

Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.

Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

HOLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Star Schema & Multi Dimension Data Bases

Introduction to Multi-Dimensional Databases

Evolving from econometric research conducted in MIT in the 1960s, the multi-dimensional database has matured into the database engine of choice for data analysis applications. This application category is commonly referred to as OLAP (On-Line Analytical Processing). The multi-dimension database has become popular with industry because it allows high performance access and analysis of large amounts of related data across several applications, operating in different parts of the organization. Given that all business applications operate in a multi-tier environment, and often use different technologies operating on different platforms, it is important that such widely dispersed data can be accessed and analysed in a meaningful way.

The multi-dimensional database may also offer a better concept for visualising the way we already think of data in the real world. For example, most business managers already think of data in a multi-dimensional way, such as when they think of specific products in specific markets over certain periods of time. The multi-dimensional database attempts to present such data to the end user in a useful way.

Star Schema Multi Dimension Data Bases

Page 8: Business Analytics Notes

1.1 Overview of a Multi-Dimensional Database System

Relational databases store data in a two dimensional format, where tables of data are presented as rows and columns. Multi-dimensional database systems offer an extension to this system to provide a multi-dimensional view of the data (Rand). For example, in multi-dimensional analysis, data entities such as products, regions, customers, dates etc. may all represent different dimensions. This intrinsic feature of the database structure will be covered in depth in subsequent sections of this paper.

Some further advantages to this database model are:

The ability to analyse large amounts of data with very fast response times. To "slice and dice" through data, and "drill down or roll up" through various

dimensions of the defined data structure.

To quickly identify trends or problem areas that would have been otherwise overlooked in an industry environment.

Multi-dimensional data structures can be implemented with multi-dimensional databases, or else they can also be implemented in a relational database management system using such techniques as the "Star Schema" and the "Snowflake Schema" (Weldon 1995).

The Star Schema is a means of aggregating data based on a set of known database dimensions, attempting to store a multi-dimensional data structure in a two-dimensional relational database management system (RDBMS). The Snow Flake Schema is an extension of the Star Schema by the principal of applying additional dimensions to the Star Schema in a RDBMS.

Slowly changing Dimensions.

Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than changing on regular schedule, time-base. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimension's attribute value for given date. Example of such

dimensions could be: customer, geography, employee.

There are many approaches how to deal with SCD. The most popular are:

Type 0 - The passive method Type 1 - Overwriting the old value

Type 2 - Creating a new additional record Type 3 - Adding a new column Type 4 - Using historical table

Page 9: Business Analytics Notes

Type 6 - Combine approaches of types 1,2,3 (1+2+3=6) Type 0 - The passive method. In this method no special action is performed

upon dimensional changes. Some dimension data can remain the same as it was first time inserted, others may be overwritten.

Type 1 - Overwriting the old value. In this method no history of dimension changes is kept in the database. The old dimension value is simply overwritten be the new one. This type is easy to maintain

and is often use for data which changes are caused by processing corrections(e.g. removal special characters, correcting spelling errors).

Before the change:

Customer_ID Customer_Name

Customer_Type

1 Cust_1 Corporate

After the change:

Customer_ID Customer_Name

Customer_Type

1 Cust_1 Retail

Type 2 - Creating a new additional record. In this methodology all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the

dimension table. Both the prior and new rows contain as attributes the natural key(or other durable identifier). Also 'effective date' and 'current indicator' columns are used in this method. There could be

only one record with current indicator set to 'Y'. For 'effective date' columns, i.e. start_date and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the

dimensional model in type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.

Before the change:

Customer_ID Customer_Name

Customer_Type

Start_Date End_Date Current_Fla

g

1 Cust_1 Corporate 22-07-2010 31-12-9999 Y

After the change:

Customer_ID Customer_Name

Customer_Type

Start_Date End_Date Current_Fla

g

1 Cust_1 Corporate 22-07-2010 17-05-2012 N

2 Cust_1 Retail 18-05-2012 31-12-9999 Y

Page 10: Business Analytics Notes

Type 3 - Adding a new column. In this type usually only the current and previous value of dimension is kept in the database. The new value is loaded into 'current/new' column and the old one into

'old/previous' column. Generally speaking the history is limited to the number of column created for storing historical data. This is the least commonly needed techinque.

Before the change:

Customer_ID Customer_Name

Current_Type

Previous_Type

1 Cust_1 Corporate Corporate

After the change:

Customer_ID Customer_Name

Current_Type

Previous_Type

1 Cust_1 Retail Corporate

Type 4 - Using historical table. In this method a separate historical table is used to track all dimension's attribute historical changes for each of the dimension.

The 'main' dimension table keeps only the current data e.g. customer and customer_history tables.

Current table:

Customer_ID Customer_Name

Customer_Type

1 Cust_1 Corporate

Historical table:

Customer_ID Customer_Name

Customer_Type

Start_Date End_Date

1 Cust_1 Retail 01-01-2010 21-07-2010

1 Cust_1 Oher 22-07-2010 17-05-2012

1 Cust_1 Corporate 18-05-2012 31-12-9999

Type 6 - Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension table such additional columns as:

current_type - for keeping current value of the attribute. All history records for given item of attribute have the same current value.

Page 11: Business Analytics Notes

historical_type - for keeping historical value of the attribute. All history records for given item of attribute could have different values.

start_date - for keeping start date of 'effective date' of attribute's history. end_date - for keeping end date of 'effective date' of attribute's history.

current_flag - for keeping information about the most recent record. In this method to capture attribute change we add a new record as in type 2. The

current_type information is overwritten with the new one as in type 1. We store the history in a historical_column as in type 3.

Customer_ID Customer_Name

Current_Type

Historical_Type

Start_Date End_Date Current_Fla

g

1 Cust_1 Corporate Retail 01-01-2010 21-07-2010 N

2 Cust_1 Corporate Other 22-07-2010 17-05-2012 N

3 Cust_1 Corporate Corporate 18-05-2012 31-12-9999 Y

Snow Flake Schema Advantages & Limitations

Advantage of Snowflake Schema

The main advantage of Snowflake Schema is the improvement of query performance due to minimized disk storage requirements and joining smaller lookup tables.

It is easier to maintain.

Increase flexibility.

Disadvantage  of Snowflake Schema

The main disadvantage of the Snowflake Schema is the additional maintenance efforts needed to the increase number of lookup tables.

Makes the queries much more difficult to create because more tables need to be joined

The Star schema vs Snowflake schema comparison brings forth four fundamental differences to the fore:

1. Data optimization: 

Page 12: Business Analytics Notes

Snowflake model uses normalized data, i.e. the data is organized inside the database in order to eliminate redundancy and thus helps to reduce the amount of data. The hierarchy of the business and its dimensions are preserved in the data model through referential integrity.

Figure 1 – Snow flake model

Star model on the other hand uses de-normalized data. In the star model, dimensions directly refer to fact table and business hierarchy is not implemented via referential integrity between dimensions.

Figure 2 – Star model

2. Business model:

Primary key is a single unique key (data attribute) that is selected for a particular data. In the previous ‘advertiser’ example, the Advertiser_ID will be the primary key (business key) of a dimension table. The foreign key (referential attribute) is just a field in one table that matches a primary key of another dimension table. In our example, the Advertiser_ID could be a foreign key in Account_dimension.

In the snowflake model, the business hierarchy of data model is represented in a primary key – Foreign key relationship between the various dimension tables.

In the star model all required dimension-tables have only foreign keys in the fact tables.

3. Performance:

The third differentiator in this Star schema vs Snowflake schema face off is the performance of these models. The Snowflake model has higher number of joins between dimension table and

Page 13: Business Analytics Notes

then again the fact table and hence the performance is slower. For instance, if you want to know the Advertiser details, this model will ask for a lot of information such as the Advertiser Name, ID and address for which advertiser and account table needs to be joined with each other and then joined with fact table.

The Star model on the other hand has lesser joins between dimension tables and the facts table. In this model if you need information on the advertiser you will just have to join Advertiser dimension table with fact table.

Star schema explained

Star schema provides fast response to queries and forms the ideal source for cube structures. Learn all about star schema in this article.

4. ETL

Snowflake model loads the data marts and hence the ETL job is more complex in design and cannot be parallelized as dependency model restricts it.

The Star model loads dimension table without dependency between dimensions and hence the ETL job is simpler and can achieve higher parallelism.

This brings us to the end of the Star schema vs Snowflake schema debate. But where exactly do these approaches make sense?

Where do the two methods fit in?

With the snowflake model, dimension analysis is easier. For example, ‘how many accounts or campaigns are online for a given Advertiser?’

The star schema model is useful for Metrics analysis, such as – ‘What is the revenue for a given customer?’