Building next generation data warehouses

Embed Size (px)

Citation preview

Xinxinli Blue Curves

Building Next Generation Data WarehousesAll Things Open 2016Alex Meadows

Principal Consultant (Data and Analytics), CSpring Inc.

Business Analytics Adjunct Professor, Wake Tech

MS in Business Intelligence

Passion in developing BI solutions that provide end users easy access to necessary data to find the answers they demand (even the ones they dont know yet!)

Twitter: @OpenDataAlexLinkedIn: alexmeadows

GitHub: OpenDataAlexEmail: [email protected] AlexSo heres a bit about me. There are three things Im going to ask of you, the first being please feel free to reach out! I love talking and learning about what folks are using out in the wild and sharing. If you want to know more or chat more about any topic within data science/business intelligence just message me via one of the above methods.

The second thing Ill ask is to be aware that some of these solutions may fix your particular problems and youll iterate on them and well find them super-awesome and maybe youll be able to give back and talk about your experiences at a conference or in a trade paper. Note that the business side might not realize the undertaking or super awesome things being done they are designed to be seamless and make users lives easier.

The final ask before we get fully started is please dont be the pointy-haired boss! Were covering a lot of topics at a very high level and a lot of nuances arent being discussed (its only a 40 minute presentation after all). Please dig further and ask plenty of questions.

Agenda(Brief) History of why data warehousing

The challenges

Three paths

Traditional

NoSQL

Hybrid

Q&A

Please feel free to ask questions throughout the presentation!

By the end of this presentation, you will know where traditional data warehousing is failing and have a basic understanding of what technologies and methodologies are helping to address the needs of more data savvy customer bases.

Why Data Warehouses?Started being discussed in 1970

While databases existed, they were not relational/normalized

Network/hierarchical in nature

Design for query, not for data model

Reporting was hardSystem/application queries were not the same as management reporting queries

The concept of data warehouses started in the 1970s and fully came into their own during the late 80s and well into the 90s. Before relational databases, data was stored based on query usage and not necessarily based on the data itself. As a result, reporting was hard. Data would either have to be merged out piece-meal or stored again based on the specific query requirements.

Bill Inmon

Data warehouses: subject-oriented, integrated, time-variant and non-volatile collection of datain support of management's decision making processInto that mess, a gentleman named Bill Inmon created the initial concept of separating reporting and analysis needs away from the OLTP layer.

Bill Inmon

Bottom-up design

Integration of source systems

Third Normal Form

His approach, now considered a bottom-up design integrates data from various OLTP systems, tie them together in a 3NF data model and make those data sets available for reporting. This ties in with the other process that came along a bit after Inmon the star schema.

Ralph KimballMake data accessible

Top-down approach

Dimensional models (star schema)

Another gentleman named Ralph Kimball took the data warehousing concept a step further. Considered a top-down approach, the data from the data warehouse is now transformed or conformed to match the reporting and analysis needs of the business. While many arguments were had and many organizations went pure conformed dimensional model for their data warehouse, the correct way to model is with both a 3NF backend and a star schema on top.

Traditional Model

With that said, here is a typical model/workflow. From OLTP systems, Excel files, etc. The data is moved into a 3NF model. From the 3NF model, star schema are built on top to handle all the reporting/analytics requirements. This model has worked very well but there are several problems that have come out with this model. While I dont have an exact number, a high number of data warehouse projects are considered failures due to these issues. What are they? Glad you asked!

Traditional Model Challenges

How can I get my data integrated faster?Speed of integrating data is a huge problem. Working to cleanse, conform, and process all those different sources into a single warehouse is one thing. Getting the business agreeing on logic and formulas to populate the star schema is also a challenge, triggering many iterations on the integration layer.

Traditional Model Challenges

How long to get new data sources online?How to handle business logic changes?How can I get my data integrated faster?Its also a challenge to bring new sources online. Because of the nature of an Inmon data warehouse, its typical to bring the entire source over so that history is tracked across the entire source. In addition, how will logical changes be managed both from the source to warehouse and the warehouse to star schema? Without the 3NF layer, the star schema cant be reloaded without losing all the history that was collected.

Traditional Model Challenges

What about all thatunstructured data?How long to get new data sources online?How to handle business logic changes?How can I get my data integrated faster?Then of course the other question is how to handle the big yellow elephant?

Traditional Model Challenges

What about all thatunstructured data?How long to get new data sources online?How to handle business logic changes?How can I get my data integrated faster?

What about data scientists?On top of all those other problems, we also have to address a whole new customer base data scientists! They need to have access to data faster and more broader than any other customer base before. Yet, they cant just access data from the data warehouse because the data is too clean to be of real use.

A New Use CaseTraditional DW doesnt meet the demand of the data science workforce

Only gets to the what happened and why.

There are distinct groups of requirements that business intelligence tries to answer. Traditional data warehousing can answer the first two what happened and why it did happen. Where it starts to fail is in the predictive analytics space where again, data scientists want data that is not cleansed and conformed, but still easy to access. Then there is proscriptive analytics applying the predictions found and making automated decisions based on them.

Graph Source: http://www.odoscope.com/technology/prescriptive-analysis/

Traditional

Iterations On Existing Architecture

Data VaultHybrid between 3NF and star schema

Created by Dan Linstedt

Persistent data layer keep everything

Bring data over as needed

Once touching an object, bring it all over

Can be hybrid between relational databases and Hadoop

Massive parallel loading, eventual consistency (with Hadoop)

Of the newer architectures, Data Vault is one of the easier to implement because it is a combination of both the Kimball and Inmon methods. Data is only brought over from source systems as needed as opposed to bringing everything from the source all at once.The other really cool thing about Data Vault is that data can be offloaded into Hadoop as it ages and becomes non-volitile.

Image Source: https://pixabay.com/en/vault-strongbox-security-container-154023/

Here is our basic example that well be using through the rest of this presentation. Its a simple student/teacher/class model that, while not modeled 100% correct, will provide a good example going forward.

Here is that same model in data vault form. Business entities become hub tables. Relationships between hubs get stored in many to many relationship tables called links. Off both hubs and links are dimension-like tables called satellites that store all relative information of their related hub or link. Satellites version data as changes occur.

Pros and ConsPRO

Easily leverage existing infrastructure

Faster iterations between source and solution

Especially as objects are brought over

Can offload historical data into Hadoop

Learning curveSimple to pick up

CON

Table joins

Inter-dependencies between objects

Documentation not widely available (outside of commercial website and book)

1.0 documentation found at:

TDAN Article

2.0 documentation ->

Certification/training:

http://learndatavault.com/

Theres not a large amount of information publicly available outside the book, shown above. The original series of articles can be found on TDAN. There is also certification thru the learn data vault website.

Anchor ModelingStore data how it is and how it was

Structural changes and content changes

Created by Lars Rnnbck

Persistent data layer keep everything, including how the data was structured

Highly normalized (6NF)

Another of the new modeling techniques is anchor modeling. In this model, data is stored in a highly normalized format that focuses both on the actual data but also the context and model over time. As the data model changes, new structures can be made in the anchor model.

There is only one open modeling tool that supports anchor modeling, and thats on the anchor modeling website directly. Some commercial tools do provide support, but there arent many. That said, this is one of the many examples from the website. Each entity becomes an anchor and data about the entity is tied together. This model also removes duplication of data. For instance, if a teacher and a student both had the name Mary, it would only be stored once and be referenced to both anchors.

Documentation:

Anchor Modeling Website

Quite a few presentations, no formal texts outside academic papers

Theres not a lot of documentation out in the wild, with the exception of the website and many presentations and white papers.

Pros and ConsPRO

Stores data and data structure temporally

Designed to be agile

Reduces storage

CON

Joins

High normalization makes for difficult usage

Views mask this complexity

Some data stores arent able to handle this normalization level

BI tools arent designed for this type of modeling

NoSQLVolume, Velocity, Variety, Veracity

Linked Data Stores (Triple Stores)Store data with semantic information

Created by Tim Berners-Lee

Removes/eliminates ambiguity in data

Standardizes data querying (SPARQL)

Can interface with all other linked data sources

Public sources referenced and integrated by calling them

Private sources work the same way, provided permissions allow

Graph data stores are a specialized type of triple storeStore data on edges

Linked data (also known as triple stores) was created by Tim Berners-Lee around the same time as the web was created. Linked Data removes the ambiguity of typical data stores by translating the data model into a clear vocabulary. The other bonus is that there is only one single, unified querying language. When it comes to other linked data sources, its easy to join data sets together by adding a new prefix to a query.Graph data stores are a subtype of triple store in which data is stored in a network graph think seven degrees of Keven Bacon.

Again, using the example from before.

Valerie

ArnoldStudent

Teacher

enrolledInteachesClasshasFirstName

Is a

Third Grade

hasFirstNamePerson

isSubClassOfisSubClassOf

Is aIs aUsing that model, here we have an example of triples. A triple is made up of three parts: subject, predicate, and object. For instance, a student has a first name of Arnold. Another would be that Arnold is a Student and that a Student is a subclass of Person.

RDF/XML

RDF is another way of formatting the data in triples. Now there are other formats, but RDF/XML is one of the more common transport mechanisms since most tools can read XML. The same kinds of triples mentioned in the previous slide can be seen here.

SPARQLPREFIX: school: SELECT ?s ?nameWHERE {?s school:isEnrolledIn ?class .?s school:hasFirstName ?name .?class school:hasCourseName "Third Grade" .?s?name

school:Student#493Arnold

school:Student#494Carlos

school:Student#495Phoebe

school:Student#496Ralphie

school:Student#497Wanda

SPARQL is similar enough to SQL to be familiar but different enough to require some tutorials ;) Here we are looking at our school data (as noted by the school prefix) and retrieving all students first names that are in Third Grade. The WHERE clause has three triple statements to bring the result set back. Each triple is denoted by a period.

There are a few books on linked data but these are two of the better of the bunch. The Manning publication is a great overview of Linked Data while the Semantic Web book focuses on building web ontologies (the vocabulary like we discussed earlier).

Pros and ConsPRO

Clearly defined business logic

Fast iterations on ontology

Single, unified querying language

Can join datasets via PREFIX with no additional work

CON

BI tools still playing catch-up

Tool ecosystem is small

But Awesome!

Few organizations have adopted (but this is changing)

Other NoSQLColumnar

Designed with queries in mind

Some are tuned for star schema performance

Document StoresDesigned with data/queries in mind

Key-value stores

Object Stores

Data stored as objects

Merger of database and programming

OthersNew types are still being created

Watch out for flavors of the month

The are many other types of NoSQL databases, but not enough time to cover here. They can still be useful in augmenting traditional data warehouses.

Hybrid

Data VirtualizationIntegration is logical not physical

Doesnt matter what type of data is being integrated*

NoSQL

Relational

Allows for more traditionally designed tools to access more modern data stores

Allows for easier, more iterative work flows

Business logic lives in the integration layer

Data Virtualization is a great way to bridge the gap between NoSQL and SQL based tools. This allows for traditional business intelligence tools to access data stores that they wouldnt normally be able to. The cool thing about virtualization tools is that business logic lives in this integration layer allowing for faster changes to the process that builds the data endpoint.

Logical Layering

With all the various sources, the virtualization tool will have one or many translation layers. These translation layers interpret the data between the source system and SQL. Between the initial translation layer and the final virtual data marts are any number of rules layers. These rules layers act in a similar manner to ETL (data integration) but they are inside the virtualization tool. From there, data marts can be created virtually as well. At any of those layers changes can be made quickly and will immediately impact the layers above the one where changes are made.

With data virtualization, traditional tools can continue to access data marts, both virtual and real. In addition, tools that can access the source systems can go either into the virtualized layers or access the systems directly, depending on the use case/need.

Going through my slide deck, I realized that I forgot to mention Mr. van der Lans book really the only book in the space thats tool agnostic and discusses the concept of a logical data warehouse definitely check it out!

Pros and ConsPRO

Easily leverage existing infrastructure

Faster iterations between source and solution

Integration between NoSQL and RDBS simplified

Can keep data warehouse and augment as needed

Uses SQL

Self-documenting

CON

Joining can be intensive

Large memory, compute requirements

Heavy loads on source systems

Can offload to virtualization shards

Textual DisambiguationTake unstructured data and interpret context

Store disambiguated data in RDBMS (9th normal form)

Augment traditional data warehouse with new unstructured data.

The final method well be discussing is some of Inmons latest work Textual Disambiguation. At its basic core, the methodology takes unstructured data and interprets it into its language components and defines its textual context. From there, the data can be stored in an even higher normalized form that we discussed with anchor modeling and augment the traditional warehouse with a veritable cornucopia of new information that can be utilized using SQL.

Image Source: http://www.datatransformed.com.au/textual%20etl.htm#.WArA3RIrLRZ

Pros and ConsPRO

Easily leverage existing infrastructure

Closes the gap between unstructured data and traditional data

Clear understanding and interpretation of unstructured data

CON

Full language context required

Slang, acronyms, etc. can be a problem

Time to delivery varies

Multiple language barrier

Defining context

Non-agileHard to break data down into smaller components

ConclusionBusiness Intelligence has to move forward

Remove legacy tools that havent evolved past reporting

Tweak platform to support agile, incremental change

Businesses are already demanding moreFaster turn around

More access

Deeper insights

Is your team ready to make the move?

Image Source: https://pixabay.com/p-1014060/?no_redirect