Building next generation data warehouses

  • Published on
    17-Feb-2017

  • View
    214

  • Download
    0

Transcript

Xinxinli Blue Curves

Building Next Generation Data WarehousesAll Things Open 2016Alex Meadows

Principal Consultant (Data and Analytics), CSpring Inc.

Business Analytics Adjunct Professor, Wake Tech

MS in Business Intelligence

Passion in developing BI solutions that provide end users easy access to necessary data to find the answers they demand (even the ones they dont know yet!)

Twitter: @OpenDataAlexLinkedIn: alexmeadows

GitHub: OpenDataAlexEmail: ameadows@cspring.comAbout AlexSo heres a bit about me. There are three things Im going to ask of you, the first being please feel free to reach out! I love talking and learning about what folks are using out in the wild and sharing. If you want to know more or chat more about any topic within data science/business intelligence just message me via one of the above methods.

The second thing Ill ask is to be aware that some of these solutions may fix your particular problems and youll iterate on them and well find them super-awesome and maybe youll be able to give back and talk about your experiences at a conference or in a trade paper. Note that the business side might not realize the undertaking or super awesome things being done they are designed to be seamless and make users lives easier.

The final ask before we get fully started is please dont be the pointy-haired boss! Were covering a lot of topics at a very high level and a lot of nuances arent being discussed (its only a 40 minute presentation after all). Please dig further and ask plenty of questions.

Agenda(Brief) History of why data warehousing

The challenges

Three paths

Traditional

NoSQL

Hybrid

Q&A

Please feel free to ask questions throughout the presentation!

By the end of this presentation, you will know where traditional data warehousing is failing and have a basic understanding of what technologies and methodologies are helping to address the needs of more data savvy customer bases.

Why Data Warehouses?Started being discussed in 1970

While databases existed, they were not relational/normalized

Network/hierarchical in nature

Design for query, not for data model

Reporting was hardSystem/application queries were not the same as management reporting queries

The concept of data warehouses started in the 1970s and fully came into their own during the late 80s and well into the 90s. Before relational databases, data was stored based on query usage and not necessarily based on the data itself. As a result, reporting was hard. Data would either have to be merged out piece-meal or stored again based on the specific query requirements.

Bill Inmon

Data warehouses: subject-oriented, integrated, time-variant and non-volatile collection of datain support of management's decision making processInto that mess, a gentleman named Bill Inmon created the initial concept of separating reporting and analysis needs away from the OLTP layer.

Bill Inmon

Bottom-up design

Integration of source systems

Third Normal Form

His approach, now considered a bottom-up design integrates data from various OLTP systems, tie them together in a 3NF data model and make those data sets available for reporting. This ties in with the other process that came along a bit after Inmon the star schema.

Ralph KimballMake data accessible

Top-down approach

Dimensional models (star schema)

Another gentleman named Ralph Kimball took the data warehousing concept a step further. Considered a top-down approach, the data from the data warehouse is now transformed or conformed to match the reporting and analysis needs of the business. While many arguments were had and many organizations went pure conformed dimensional model for their data warehouse, the correct way to model is with both a 3NF backend and a star schema on top.

Traditional Model

With that said, here is a typical model/workflow. From OLTP systems, Excel files, etc. The data is moved into a 3NF model. From the 3NF model, star schema are built on top to handle all the reporting/analytics requirements. This model has worked very well but there are several problems that have come out with this model. While I dont have an exact number, a high number of data warehouse projects are considered failures due to these issues. What are they? Glad you asked!

Traditional Model Challenges

How can I get my data integrated faster?Speed of integrating data is a huge problem. Working to cleanse, conform, and process all those different sources into a single warehouse is one thing. Getting the business agreeing on logic and formulas to populate the star schema is also a challenge, triggering many iterations on the integration layer.

Traditional Model Challenges

How long to get new data sources online?How to handle business logic changes?How can I get my data integrated faster?Its also a challenge to bring new sources online. Because of the nature of an Inmon data warehouse, its typical to bring the entire source over so that history is tracked across the entire source. In addition, how will logical changes be managed both from the source to warehouse and the warehouse to star schema? Without the 3NF layer, the star schema cant be reloaded without losing all the history that was collected.

Traditional Model Challenges

What about all thatunstructured data?How long to get new data sources online?How to handle business logic changes?How can I get my data integrated faster?Then of course the other question is how to handle the big yellow elephant?

Traditional Model Challenges

What about all thatunstructured data?How long to get new data sources online?How to handle business logic changes?How can I get my data integrated faster?

What about data scientists?On top of all those other problems, we also have to address a whole new customer base data scientists! They need to have access to data faster and more broader than any other customer base before. Yet, they cant just access data from the data warehouse because the data is too clean to be of real use.

A New Use CaseTraditional DW doesnt meet the demand of the data science workforce

Only gets to the what happened and why.

There are distinct groups of requirements that business intelligence tries to answer. Traditional data warehousing can answer the first two what happened and why it did happen. Where it starts to fail is in the predictive analytics space where again, data scientists want data that is not cleansed and conformed, but still easy to access. Then there is proscriptive analytics applying the predictions found and making automated decisions based on them.

Graph Source: http://www.odoscope.com/technology/prescriptive-analysis/

Traditional

Iterations On Existing Architecture

Data VaultHybrid between 3NF and star schema

Created by Dan Linstedt

Persistent data layer keep everything

Bring data over as needed

Once touching an object, bring it all over

Can be hybrid between relational databases and Hadoop

Massive parallel loading, eventual consistency (with Hadoop)

Of the newer architectures, Data Vault is one of the easier to implement because it is a combination of both the Kimball and Inmon methods. Data is only brought over from source systems as needed as opposed to bringing everything from the source all at once.The other really cool thing about Data Vault is that data can be offloaded into Hadoop as it ages and becomes non-volitile.

Image Source: https://pixabay.com/en/vault-strongbox-security-container-154023/

Here is our basic example that well be using through the rest of this presentation. Its a simple student/teacher/class model that, while not modeled 100% correct, will provide a good example going forward.

Here is that same model in data vault form. Business entities become hub tables. Relationships between hubs get stored in many to many relationship tables called links. Off both hubs and links are dimension-like tables called satellites that store all relative information of their related hub or link. Satellites version data as changes occur.

Pros and ConsPRO

Easily leverage existing infrastructure

Faster iterations between source and solution

Especially as objects are brought over

Can offload historical data into Hadoop

Learning curveSimple to pick up

CON

Table joins

Inter-dependencies between objects

Documentation not widely available (outside of commercial website and book)

1.0 documentation found at:

TDAN Article

2.0 documentation ->

Certification/training:

http://learndatavault.com/

Theres not a large amount of information publicly available outside the book, shown above. The original series of articles can be found on TDAN. There is also certification thru the learn data vault website.

Anchor ModelingStore data how it is and how it was

Structural changes and content changes

Created by Lars Rnnbck

Persistent data layer keep everything, including how the data was structured

Highly normalized (6NF)

Another of the new modeling techniques is anchor modeling. In this model, data is stored in a highly normalized format that focuses both on the actual data but also the context and model over time. As the data model changes, new structures can be made in the anchor model.

There is only one open modeling tool that supports anchor modeling, and thats on the anchor modeling website directly. Some commercial tools do provide support, but there arent many. That said, this is one of the many examples from the website. Each entity becomes an anchor and data about the entity is tied together. This model also removes duplication of data. For instance, if a teacher and a student both had the name Mary, it would only be stored once and be referenced to both anchors.

Documentation:

Anchor Modeling Website

Quite a few presentations, no formal texts outside academic papers

Theres not a lot of documentation out in the wild, with the exception of the website and many presentations and white papers.

Pros and ConsPRO

Stores data and data structure temporally

Designed to be agile

Reduces storage

CON

Joins

High normalization makes for difficult usage

Views mask this complexity

Some data stores arent able to handle this normalization level

BI tools arent designed for this type of modeling

NoSQLVolume, Velocity, Variety, Veracity

Linked Data Stores (Triple Stores)Store data with semantic information

Created by Tim Berners-Lee

Removes/eliminates ambiguity in data

Standardizes data querying (SPARQL)

Can interface with all other linked data sources

Public sources referenced and integrated by calling them

Private sources work the same way, provided permissions allow

Graph data stores are a specialized type of triple storeStore data on edges

Linked data (also known as triple stores) was created by Tim Berners-Lee around the same time as the web was created. Linked Data removes the ambiguity of typical data stores by translating the data model into a clear vocabulary. The other bonus is that there is only one single, unified querying language. When it comes to other linked data sources, its easy to join data sets together by adding a new prefix to a query.Graph data stores are a subtype of triple store in which data is stored in a network graph think seven degrees of Keven Bacon.

Again, using the example from before.

Valerie

ArnoldStudent

Teacher

enrolledInteachesClasshasFirstName

Is a

Third Grade

hasFirstNamePerson

isSubClassOfisSubClassOf

Is aIs aUsing that model, here we have an example of triples. A triple is made up of three parts: subject, predicate, and object. For instance, a student has a first name of Arnold. Another would be that Arnold is a Student and that a Student is a subclass of Person.

RDF/XML

RDF is another way of formatting the data in triples. Now there are other formats, but RDF/XML is one of the more common transport mechanisms since most tools can read XML. The same kinds of triples mentioned in the previous slide can be seen here.

SPARQLPREFIX: school: SELECT ?s ?nameWHERE {?s school:isEnrolledIn ?class .?s school:hasFirstName ?name .?class school:hasCourseName "Third Grade" .?s?name

school:Student#493Arnold

school:Student#494Carlos

school:Student#495Phoebe

school:Student#496Ralphie

school:Student#497Wanda

SPARQL is similar enough to SQL to be familiar but different enough to require some tutorials ;) Here we are looking at our school data (as noted by the school prefix) and retrieving all students first names that are in Third Grade. The WHERE clause has three triple statements to bring the result set back. Each triple is denoted by a period.

There are a few books on linked data but these are two of the better of the bunch. The Manning publication is a great overview of Linked Data while the Semantic Web book focuses on building web ontologies (the vocabulary like we discussed earlier).

Pros and ConsPRO

Clearly defined business logic

Fast iterations on ontology

Single, unified querying language

Can join datasets via PREFIX with no additional work

CON

BI tools still playing catch-up

Tool ecosystem is small

But Awesome!

Few organizations have adopted (but this is changing)

Other NoSQLColumnar

Designed with queries in mind

Some are tuned for star schema performance

Document StoresDesigned with data/queries in mind

Key-value stores

Object Stores

Data stored as objects

Merger of database and programming

OthersNew types are still being created

Watch out for flavors of the month

The are many other types of NoSQL databases, but not enough time to cover here. They can still be useful in augmenting traditional data warehouses.

Hybrid

Data VirtualizationIntegration is logical not physical

Doesnt matter what type of data is being integrated*

NoSQL

Relational

Allows for more traditionally designed tools to access more modern data stores

Allows for easier, more iterative work flows

Business logic lives in the integration layer

Data Virtualization is a great way to bridge the gap between NoSQL and SQL based tools. This allows for traditional business intelligence tools to access data stores that they wouldnt normally be able to. The cool thing about virtualization tools is that business logic lives in this integration layer allowing for faster changes to the process that builds the data endpoint.

Logical Layering

With all the various sources, the virtualization tool will have one or many translation layers. These translation layers interpret the data between the source system and SQL. Between the initial translation layer and the final virtual data marts are any number of rules layers. These rules layers act in...

Recommended

View more >