View
89
Download
0
Category
Tags:
Preview:
Citation preview
3
So What is it?
● Misnomer and marketing speak● “Unstructured” data
– Text heavy – Without obvious/clear structure
● Comes from many places, in many styles
11
Hadoop to the Rescue
● Cross system analytics?● Data quality confidence?● Source of truth?● Tool chain support?● Giant yellow elephants?
12
Hadoop to the Rescue
● Cross system analytics?● Data quality confidence?● Source of truth?● Tool chain support?● Giant yellow elephants?
If any are ignored...
16
Reservoirs...
● Contain data that is...
– Managed– Transformed– Filtered– Secured– Portable– Fit for purpose
Source: Gartner
18
Data Warehouse Models
● Traditional models don't cover semi-structured data
● Modern models are hybrids that cross the structured semi-structured boundary
20
Data Vault
● Developed by Dan Linstedt
● Tie technical keys across structured and semi-structured data sources
● Semi-structured data can me made more structured and loaded into relational data vault
● Tools have to support crossing sources
● More details: http://www.tdan.com/view-articles/5054/
22
Anchor
● Developed by Lars Rönnbäck
● 6th normal form data warehouse
● Have to transform semi-structured data to match the anchor model
● Provides flexible model that should be able to have marts built upon it
● More details: http://www.anchormodeling.com/
23
Textual Disambiguation
● Developed by Bill Inmon
● Breaking semi-structured data down by context
● Converts the data into structured format, consumable by tools
● Store data within the data warehouse – 8th/9th normal form
● White papers and more details are on Bill's website: http://www.forestrimtech.com/
25
Working With “Unstructured” Data
● Most data tools require structure (Database schema, clear-cut data formatting)
● Business and technical knowledge required
– Business to provide the pattern “the grammar or syntax”
– Technical to provide the “how”
27
Identifying Context
● It's a really nice car.
● It's internal temperature requires adjustment
● It's hot to the touch
● It's on fire
29
How to Implement
● Map/Reduce code, Hive queries, data integration tools (Pentaho, Talend)
● Have to create the grammar/syntax rules for particular business
● MDM is _not_ the solution
● Best to have a data warehouse based on subject/relationships
– Data Vault
– Anchor
– Textual Disambiguation
30
Data Symbiosis
● Data in data lake can't stand on it's own
– Ties back to rest of the structured data
– Requires firm understanding of business rules/logic
● Provides richer data sets
● Difficult to do before data lakes, after adding a data lake the problems magnify
– But so do the rewards!
31
Data Quality
● Not just a problem for Data Warehouses!● Measuring “fit for purpose”● Same rules used for data warehouses
apply to big data
32
Principles of Data Quality
● Consistency● Correctness● Timeliness● Precision● Unambiguous● Completeness● Reliability
● Accuracy● Objectivity● Conciseness● Usefulness● Usability● Relevance● Quantity
Source: Data Quality Fundamentals, The Data Warehouse Institute
33
Why Data Quality?
● Main way to control/tame your data problems
● Most hidden costs because it's hardest to fix
● Target upstream for problem solutions
34
How to Implement
● Data integration tools ● Custom coding (Map/Reduce, etc.)● Data Profiling ● MDM (as central “dictionary”/”grammar”
handler)
36
Does Your Tool Chain...
● Support Hadoop?
● Interface with non-traditional database solutions (i.e. not an RDBMS)?
● Allow for integration across disparate sources?
● Support data quality?
38
Hadoop Ecosystem
● Bridges some of the gaps
– Hive – SQL to Hadoop interface (jdbc support)
● Provides even more power
https://hadoopecosystemtable.github.io/
Plus dozens of others... and growing
39
Sources
● http://en.wikipedia.org/wiki/File:Pitfall!_Coverart.png
● http://www.networkcomputing.com/big-data-defined/d/d-id/1204588
● http://www.appliedi.net/
● http://imgbuddy.com/internet-of-things-icon.asp
● http://www.smashingapps.com/, et. al.
● http://www.colleenkerriganphotographs.com/p663330184/h217016CE#h217016ce
Recommended