38
DW 2.0 The Next Generation Data warehousing Lakshminarasu Chenduri Data Warehousing Practice

DW 2.0

Embed Size (px)

DESCRIPTION

This presentation discusses about the next generation data warehousing. The sources of this presentation is from Bill Inmon\'s Book - DW2.0 and many other web and text resources.

Citation preview

Page 1: DW 2.0

DW 2.0 The Next Generation Data

warehousing

Lakshminarasu ChenduriData Warehousing Practice

Page 2: DW 2.0

AGENDA

• A quick look around Data warehousing • Evolution of DW2.0• Lifecycle of data• Masterdata and Metadata• Method of Accessing Data• Structured/Unstructured Data• Flow of data in DW2.0• Master Data Management (MDM)• Future Roadmap• Conclusion

Page 3: DW 2.0

BACKGROUND

• A Data Warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data.

• Data mart is a data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers.

• Facts are the metrics that business users would use for making business decisions.

• Dimensions are those attributes that qualify facts. They give structure to the facts.

Page 4: DW 2.0

EXISTING DATA WAREHOUSES

• Data warehouses consists of• Operational database layer • Data access layer • Metadata layer • Informational access layer

• This approach does not define limits for the data retention and aging.

• Unorganized Metadata

• Query optimization is a must for such unorganized databases.

Page 5: DW 2.0

EVOLUTION TO THE DW 2.0 ENVIRONMENT

• The demand for more and different uses of technology.

• Online processing of data, reminds us faster processing of data – dealing with real time data.

• The need for integrated, corporate data.

• The need to include unstructured, textual data in the mix.

• Capacity of data storage and volume of data are proportional.

• Economics of technology in investing new technologies and paying higher prices.

Page 6: DW 2.0

BUSINESS IMPACT

• Credit card fraud analysis

• Inventory management

• Customer profiles

• Frequent Flier programs in Airlines

• Health Analysis - Analysis of Epidemic and Endemic diseases by geographical location wise.

• Climatic and Weather report on historical analysis

• SOA Governance and active-exploratory warehousing

Page 7: DW 2.0

WHAT MADE DW 2.0 SO IMPORTANT?

• DW 2.0 is the definition of data warehouse architecture for the next generation of data warehousing.

• Urge for meeting the dynamic and faster growing of business in terms of capacity, ROI, geographically, organizationally, etc.

• This would also mean to add the neediness to upgrade the existing technology to than that of a higher one, which can handle pedabytes of data.

• To understand how DW 2.0 came about, consider the following shaping factors:

Page 8: DW 2.0

FACTORS THAT INFLUENCED DW 2.0

1. Business Value2. Volume of Integrating Data3. Cost of setting up a Data warehouse4. Metadata and Masterdata Management5. Medium of storage6. Neediness of Data warehouses7. Changing business requirements

Page 9: DW 2.0
Page 10: DW 2.0

LIFECYCLE OF DATA

• As data enters into the data warehouse, it starts a lifecycle.

• Hence its phases are termed as Sectors of data.

• In its lifecycle, a data can be put into any of the four sectors.

• Following picture depicts the Lifecycle of data in DW2.0 Environment

Page 11: DW 2.0
Page 12: DW 2.0

SECTORS OF DATA

Sectors of Data:

• Interactive Sector – Very current data from an application may be as old as a second or a minute.

• Integrated Sector – Nearly current data, may be as old as an hour to day or a week.

• Near line Sector – Less than current, may be as old as a month or 2

• Archival Sector – Very old data, may be as old as a year up to 30 or 40 years.

Page 13: DW 2.0

STATES OF DATA

States of data:

• Application Data – When data enters into a table from an application (Interactive Sectored data)

• Corporate Data – When data enters into a warehouse through ETL (Integrated Sector)

• Archived Data – Data in Archival sector data is called as archived data

Page 14: DW 2.0

WHY DO WE HAVE DIFFERENT SECTORS?

• Query Access pattern differs for each sector

• Distinct volumes of data in each sector w.r.t time frame

• Each sector is optimally served with different technologies

• Demarcation between frequently accessed data and rarely accessed data

Page 15: DW 2.0

MASTER DATA MANAGEMENT(MDM)

• Most valuable information that a business owns, representing core business components – pillars of business

• Captures data that becomes key things that stand common in all parts of the organization.

• Data management teams must be able to visualize and segregate the metadata and the master data of a data warehouse/organization

• Later part of this PPT is dedicated to MDM

Page 16: DW 2.0

CONSIDERATION OF METADATA

• Metadata is a rapidly growing concern of any organization, which relates the business and technical facets.

• Physical embodiment of metadata.

• Metadata should be placed with the actual data. This coupling of Metadata with actual data enhances the visibility of data for a longer time range.

• This will help while examining archival data; it will be clear what the data is.

Page 17: DW 2.0

METADATA INFRASTRUCTURE IN DW 2.0

• In first generation data warehouses, we had only metadata which exposes business and technical view.

• In DW 2.0, the metadata distinctions have been clearly demarcated with their purposes.

• We classify them as,

• Enterprise Metadata• Local Metadata• Business Metadata• Technical Metadata

Page 18: DW 2.0

METADATA EXPLAINED…

• Enterprise metadata is stored in a locale that is central to all of the tools and all of the processes that exist within the DW 2.0 environment.

• Local metadata is stored in a tool or technology that is central to the usage of the local metadata. Eg. ETL Source, target objects, DBMS directory metadata about tables, repository, attributes, indexes, etc.

• Business intelligence (BI) universe metadata is about data used in analytical processing. Data quality screen specifications including the code for data quality tests, severity score of the potential error, and action to be taken when error occurs.

• Technical Metadata is the one which has Source descriptions of all data sources, including record layouts and column definitions. Business names for all tables and columns mapped to appropriate presentation server objects, join paths, computed columns, and business groupings. May also include aggregate navigation and drill across functionality. This also spans to the depth of designing logic and functionality of data transformations.

Page 19: DW 2.0

METADATA IN DIFFERENT SECTORS

MD

MD

MD

Metadata Repository

MD

MD

Very Current Data (Interactive)

Near to Current Data (Integrated Sector)

Less than current Data (Near Line Sector)

Old and Very Old Data (Archival Sector)

Page 20: DW 2.0

METADATA IN ARCHIVAL SECTORS

• Metadata in Archival data is stored with the data itself.

• This is because it is assumed that metadata could be lost over time if it is collocated with its associated archival content data.

Page 21: DW 2.0

ACTIVE AND PASSIVE REPOSITORIES

• A passive repository is one in which the metadata does not interact in any direct manner with the development and/or the query activities of the end user.

• An active repository is one in which the metadata interacts in an ongoing manner with the development and query activities of the system.

Page 22: DW 2.0

ACCESS OF DATA

• Differences in patterns and frequency of data access.

• In case of Interactive data, unit of data changes are w.r.t seconds, even milliseconds.

• In case of Archival data, data changes are w.r.t quarters, years and decades.

• For faster querying performance, interactive data is handled by an optimal technology which fits which would be more expensive

• Archival data which is the least used, so they are served with a different set of technologies which would be less expensive.

Page 23: DW 2.0

Pattern and frequency of data access changes dramatically w.r.t time

Page 24: DW 2.0

STRUCTURED/UNSTRUCTURED DATA

• Two basic types of data:

• Structured Data• Unstructured Data

• Structured data is the one which comes in a repetitive format.

• Best examples of structured data would be data generated by bank transactions, invoice bills, various ticketing systems, retail transactions, etc.

• Data from these types of transactions are easily recorded in the databases with suitable entities, attributes and keys and indexes.

• In short, these are well served by the standard database technology.

Page 25: DW 2.0

UNSTRUCTURED DATA

• On the other hand, Unstructured Data is classified into two ways:

• Textual Unstructured Data (TUD) like Emails, Text messages, PowerPoint presentations, Text documents, telephonic conversations, etc.

• Non-textual Unstructured Data (NTUD) like photographic images, X-Rays, MRI Images, diagrammatic illustrations, etc.

• Current technology is not able to handle non-textual data.

• Unstructured textual data can be handled; they can be captured and analyzed, but not easily with the current database technology.

• Since the TUDs are not repetitive, a specialized effort is needed in handling such data. Though, this does not mean that TUDs are of no value, they are good value indeed.

Page 26: DW 2.0

BLATHER

• Of many of the challenges in DW 2.0, screening of Unstructured Data is a significant one.

• When we consider unstructured data, we are interested only in the data which adds value to the business or that which becomes a key metric in the business in itself.

• Consider a text message, saying “Hi honey, Bit busy here, will come late tonight”.

• Such personal emails, SMS become a part of unstructured data. This type of data which is not useful to the business by any means is said to be “Blather”.

• So screening and eliminating such type of data consumes lot of effort.

Page 27: DW 2.0

THE FLOW OF DATA IN DW2.0 ENVIRONMENT

• Data flows throughout the DW 2.0 environment.

• Data enters the Interactive Sector either directly or through ETL from an external application. Data flows to the Integrated Sector through the ETL process, coming from the Interactive Sector.

• Data flows from the Interactive Sector to the Near Line Sector or the Archival Sector as it ages.

• On a limited basis, data may flow from the Archival Sector back to the Integrated Sector, and occasionally data flows from the Near Line Sector to the Integrated Sector.

Page 28: DW 2.0

The flow of data in DW2.0 Environment Interactive to Integrated Sector

Page 29: DW 2.0

Data flow from Integrated to near line sector

Page 30: DW 2.0

DATA FLOW FROM INTEGRATED TO ARCHIVAL SECTOR(Data ageing)

Page 31: DW 2.0

MASTER DATA MANAGEMENT

• Master data can be described by the way that it interacts with other data.

• Master Data Management (MDM) is the technology, tools, and processes required to create and maintain consistent and accurate lists of master data.

• This type of data is used across organization repeatedly by several business processes.

• This provides a business context through underlying data models.

Page 32: DW 2.0

IBM’s DEFINITION OF MASTER DATA MANAGEMENT

• Here comes the IBM’s definition of Master Data Management:

• Decouples master information from individual applications

• Becomes a central, application independent resource

• Ensures consistent master information across transactional and analytical systems

• Simplifies ongoing integration tasks and new application development

• Addresses key issues such as data quality and consistency proactively rather than “after the fact” in the data warehouse

Page 33: DW 2.0

GROUPING OF DATA AND MASTER DATA

• Data in Corporate world are classified as • Unstructured• Transactional• Metadata• Hierarchical• Master

• Master data falls into four groupings of business:• People• Things• Places• Concepts

Page 34: DW 2.0

DIMENSIONS OF MASTER DATA MANAGEMENT

Page 35: DW 2.0

PHASES IN DEVELOPING A MDM PROJECT

• Identify sources of master data.

• Identify the producers and consumers of the master data.

• Collect and analyze metadata about for your master data.

• Appoint data stewards.

• Implement a data-governance program and data-governance council.

• Develop the master-data model.

• Choose a toolset.

• Design the infrastructure.

• Generate and test the master data.

• Modify the producing and consuming systems.

• Implement the maintenance processes.

Page 36: DW 2.0

FUTURE ROADMAP

• Cloud Computing – Building data warehouses in public clouds which enables data to be location independent, transparent.

• Private and public clouds make the way of handling data in a sensible way.

• But this has few concerns like • Data volumes• Data privacy and governance

• Open Source Data Integration Solutions – This will provide a greater impact on the Cost/ROI

• Many companies have come up with java based ETL engines with a comparable performance; still we need some concrete standards

Page 37: DW 2.0

CONCLUSION

• DW2.0 addresses various business and technical needs of Data warehousing.

• This approach also facilitates for further exploring the warehouse for future; a clear way of managing of data from various perspectives.

• The evolution in industry has poised a growth in the business and its perspectives; on the other hand we see a major growth in the technology as well.

• More involvement in these, led to the emergence of regulatory compliance, SOA, and mergers and acquisitions.

• This has made the creating and maintaining of accurate and complete master data and DW2.0 a business imperative.

Page 38: DW 2.0

REFERENCES

• Book References:

• DW2.0 by William Inmon, Derek Strauss, Genia Neushloss

• New Trends in Data Warehousing and Data Analysis by Stanislaw Kozielski and Robert Wrembel

• Enterprise Master Data Management: An SOA Approach to Managing Core Information, by Allen Dreibelbis, Eberhard Hechler, Ivan Milman, Martin Oberhofer, Paul van Run and Dan Wolfson.

• Web References:

• http://en.wikipedia.org/• http://msdn.microsoft.com/en-us/library/bb190163.aspx• http://www.sqlmag.com/• http://www.ibm.com/