70
Data Warehousing 101 Everything you never wanted to know about big databases but were forced to find out anyway Josh Berkus Open Source Bridge 2011

Data Warehousing 101(and a video)

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Data Warehousing 101(and a video)

Data Warehousing 101Everything

you never wantedto know aboutbig databases

but were forcedto find out anyway

Josh BerkusOpen Source Bridge 2011

Page 2: Data Warehousing 101(and a video)

contentscovering● concepts of DW● some DW

techniques● databases

not covering● hardware● analytics/reporting

tools

Page 3: Data Warehousing 101(and a video)
Page 4: Data Warehousing 101(and a video)

BIGDATA

Page 5: Data Warehousing 101(and a video)

1970

Page 6: Data Warehousing 101(and a video)

What is a“data warehouse”?

Page 7: Data Warehousing 101(and a video)

Big Data?

Page 8: Data Warehousing 101(and a video)
Page 9: Data Warehousing 101(and a video)

OLTP vs DW ● many single-row

writes● current data● queries generated

by user activity● < 1s response

times● 1000's of users

● few large batch imports

● years of data● queries generated

by large reports● queries can run for

minutes/hours● 10's of users

Page 10: Data Warehousing 101(and a video)

OLTP vs DW

big data for many

concurrent requests to

small amounts of data each

big data

for low concurrency

requests to very large amounts of data each

Page 11: Data Warehousing 101(and a video)

synonyms&

subclasses

Page 12: Data Warehousing 101(and a video)

archiving

Page 13: Data Warehousing 101(and a video)

archivingWORN data: “write once, read never”● grows indefinitely● usually a result of regulatory compliance● main concern: storage efficiency

Page 14: Data Warehousing 101(and a video)

data mining

Page 15: Data Warehousing 101(and a video)

data miningthe database where you don't know what's in there, but you want to find out● lots of data (TB to PB)● mostly “semi-structured”● data produced as a side effect of other

business processes● needs CPU-intensive processing

Page 16: Data Warehousing 101(and a video)

BI: Business IntelligenceDSS: Decision Support OLAP: Online Analytical

ProcessingAnalytics

Page 17: Data Warehousing 101(and a video)

BI/DSS/OLAP/Analytics

Page 18: Data Warehousing 101(and a video)

BI/DSS/OLAP/Analyticsdatabases which support visualization of large amounts of data● data is fairly well understood● most data can be reduced to categories,

geography, and taxonomy● primarily about indexing

Page 19: Data Warehousing 101(and a video)

What is a“dimension”?

Page 20: Data Warehousing 101(and a video)

dimensions vs. facts

FactTable

customers / accounts

categorysubcategory

sub-subcategory

Page 21: Data Warehousing 101(and a video)

dimension examples● location/region/country/quadrant● product categorization● URL● transaction type● account heirarchy● IP address● OS/version/build

Page 22: Data Warehousing 101(and a video)

dimension synonyms

● facet● taxonomy●secondary index●view

Page 23: Data Warehousing 101(and a video)

What is ETL?

Page 24: Data Warehousing 101(and a video)

Extract, Transform, Load● how you turn external raw data into useful

database data● Apache logs → web analytics DB● CSV POS files → financial reporting DB● OLTP server → 10-year data warehouse

● also called ELT when the transformation is done inside the database

Page 25: Data Warehousing 101(and a video)

Purpose of ETL/ELTgetting data into the data warehouse

● clean up garbage data● split out attributes● “normalize” dimensional data● deduplication● calculate materialized views / indexes

Page 26: Data Warehousing 101(and a video)

ETL Tools

K.E.T.T.L.E.

Page 27: Data Warehousing 101(and a video)

ETL Tools

Page 28: Data Warehousing 101(and a video)

Ad-hoc scripting

Page 29: Data Warehousing 101(and a video)

ELT Tipsthink volume

● bulk processing or parallel processing● no row-at-a-time, document-at-a-time

● insert into permanent storage should be the last step● no updates

Page 30: Data Warehousing 101(and a video)

Queues not Extract

Page 31: Data Warehousing 101(and a video)

What kind of database should I

use for DW?

Page 32: Data Warehousing 101(and a video)

5 Types

1. Standard Relational

2. MPP

3. Column Store

4. Map/Reduce

5. Enterprise Search

` `

Page 33: Data Warehousing 101(and a video)

standard relational

Page 34: Data Warehousing 101(and a video)

standard relationalthe all-purpose solution for not-that-big data● adequate for all tasks

● but not excellent at any of them● easy to use

● low resource requirements● well-supported by all software● familiar

● not suitable for really big data

Page 35: Data Warehousing 101(and a video)

MySQL

PostgreSQL

DW Database

0 5 10 15 20 25 30

0 5 10 15 20 25 30

Sweet Spots

Page 36: Data Warehousing 101(and a video)

What's MPP?

Page 37: Data Warehousing 101(and a video)

MassivelyParallelProcessing

Page 38: Data Warehousing 101(and a video)

appliance software

Page 39: Data Warehousing 101(and a video)

MPPcpu-intensive data warehousing

● data mining, some analytics● supporting complex query logic● moderately big data (1-200TB)● drawbacks: proprietary, expensive● now hybridizes

● with other types

Page 40: Data Warehousing 101(and a video)

What's acolumn store?

Page 41: Data Warehousing 101(and a video)

column store

Page 42: Data Warehousing 101(and a video)

column store

inversion of a row store:indexes become datadata becomes indexes

Page 43: Data Warehousing 101(and a video)

column stores

Page 44: Data Warehousing 101(and a video)

column storesfor aggregations and transformations of highly structured data● good for BI, analytics, some archiving● moderately big data (0.5-100TB)● bad for data mining● slow to add new data / purge data● usually support compression

Page 45: Data Warehousing 101(and a video)

What's map/reduce?

Page 46: Data Warehousing 101(and a video)

map/reduce

Page 47: Data Warehousing 101(and a video)

map/reduce

Page 48: Data Warehousing 101(and a video)

map/reduce// mapfunction(doc) { for (var i in doc.links) emit([doc.parent, i], null); }}// reducefunction(keys, values) { return null;}

Page 49: Data Warehousing 101(and a video)

map/reduce// Mapfunction (doc) { emit(doc.val, doc.val)}

// Reducefunction (keys, values, rereduce) { // This computes the standard deviation of the mapped results var stdDeviation=0.0; var count=0; var total=0.0; var sqrTotal=0.0;

if (!rereduce) { // This is the reduce phase, we are reducing over emitted values from // the map functions. for(var i in values) { total = total + values[i]; sqrTotal = sqrTotal + (values[i] * values[i]); } count = values.length; } else { // This is the rereduce phase, we are re-reducing previosuly // reduced values. for(var i in values) { count = count + values[i].count; total = total + values[i].total; sqrTotal = sqrTotal + values[i].sqrTotal; } }

var variance = (sqrTotal - ((total * total)/count)) / count; stdDeviation = Math.sqrt(variance);

// the reduce result. It contains enough information to be rereduced // with other reduce results. return {"stdDeviation":stdDeviation,"count":count, "total":total,"sqrTotal":sqrTotal};};

Page 50: Data Warehousing 101(and a video)

map/reduce vs. MPP● open source● petabytes● write routines by

hand● inefficient● generic● cheap HW / cloud● DIY tools

● proprietary● terabytes● advanced query

support● efficient● specific● needs good HW● integrated tools

Page 51: Data Warehousing 101(and a video)

What's enterprise search?

Page 52: Data Warehousing 101(and a video)

enterprise search

ElasticSearch

Page 53: Data Warehousing 101(and a video)

enterprise searchwhen you need to do DW with a huge pile of partly processed “documents”● does: light data mining, light BI/analytics

● best “full text” and keyword search● supports “approximate results”● lots of special features for web data

Page 54: Data Warehousing 101(and a video)

E.S. vs. C-Store● batch load● semi-structured

data● uncompressed● star schema● sharding● approximate results

● batch load● fully normalized

data● compressed● snowflake schema● parallel query● exact results

Page 55: Data Warehousing 101(and a video)

What's awindowing query?

Page 56: Data Warehousing 101(and a video)

regular aggregate

Page 57: Data Warehousing 101(and a video)

windowing function

Page 58: Data Warehousing 101(and a video)

TABLE events (event_id INT,event_type TEXT,start TIMESTAMPTZ,duration INTERVAL,event_desc TEXT

);

Page 59: Data Warehousing 101(and a video)

SELECT MAX(concurrent)FROM (SELECT SUM(tally) OVER (ORDER BY start)AS concurrent

FROM (SELECT start, 1::INT as tally

FROM events UNION ALL SELECT (start + duration), -1 FROM events ) AS event_vert) AS ec;

Page 60: Data Warehousing 101(and a video)

stream processing SQL● replace multiple queries with a single

query● avoid scanning large tables multiple times

● replace pages of application code● and MB of data transmission

● SQL alternative to map/reduce● (for some data mining tasks)

Page 61: Data Warehousing 101(and a video)

What's a materialized view?

Page 62: Data Warehousing 101(and a video)

query results as table● calculate once, read many time

● complex/expensive queries● frequently referenced

● not necessarily a whole query● often part of a query

● might be manually or automatically updated● depends on product

Page 63: Data Warehousing 101(and a video)

non-relational matviews● CouchDB Views

● cache results of map/reduce jobs● updated on data read

● Solr / Elastic Search “Faceted Search”● cached indexed results of complex searches● updated on data change

Page 64: Data Warehousing 101(and a video)

maintaining matviewsBEST: update matviews

at batch load time

GOOD: update matview accordingto clock/calendar

FAIR: update matview on data request

BAD for DW: update matviewsusing a trigger

Page 65: Data Warehousing 101(and a video)

matview tips● matviews should be small

● 1/10 to ¼ of RAM on each node● each matview should support several

queries● or one really really important one

● truncate + append, don't update● index matviews like crazy

● if they are not indexes themselves

Page 66: Data Warehousing 101(and a video)

What's OLAP?

Page 67: Data Warehousing 101(and a video)

cubesSite

Repeat V

isitors

Browse

r

Page 68: Data Warehousing 101(and a video)

drill-down

Page 69: Data Warehousing 101(and a video)

OLAP● OnLine Analytical Processing● Visualization technique

● all data as a multi-dimensional space● great for decision support

● CPU & RAM intensive● hard to do on really big data

● Works well with column stores

Page 70: Data Warehousing 101(and a video)

Contact● Josh Berkus: [email protected]

● blog: blogs.ittoolbox.com/database/soup● twitter: @fuzzychef

● PostgreSQL: www.postgresql.org● pgexperts: www.pgexperts.com

This talk is copyright 2011 Josh Berkus and is licensed under the Creative Commons Attribution license. Many images were taken from google images and are copyright their original creators, whom I don't actually know. Logos are trademark their respective owners, and are used here under fair use.