19
CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Embed Size (px)

Citation preview

Page 1: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

CS 345:Topics in Data Warehousing

Tuesday, September 28, 2004

Page 2: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Outline of Today’s Class

• What is data warehousing?

• Transaction processing vs. data analysis

• Course logistics

• Data integration

Page 3: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

A Brief History of Information Technology

• The “dark ages”: paper forms in file cabinets• Computerized systems emerge

– Initially for big projects like Social Security– Same functionality as old paper-based systems

• The “golden age”: databases are everywhere– Most activities tracked electronically– Stored data provides detailed history of activity

• The next step: use data for decision-making– The focus of this course!– Made possible by omnipresence of IT– Identify inefficiencies in current processes– Quantify likely impact of decisions

Page 4: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Databases for Decision Support

• 1st phase: Automating existing processes makes them more efficient.– Automation → Lots of well-organized, easily accessed

data

• 2nd phase: Data analysis allows for better decision-making. – Analyze data → better understanding– Better understanding → better decisions

• “Data Entry” vs. “Thinking”– Data analysts are decision-makers: managers,

executives, etc.

Page 5: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

OLTP vs. OLAP

• OLTP: On-Line Transaction Processing– Many short transactions

(queries + updates)– Examples:

• Update account balance• Enroll in course• Add book to shopping cart

– Queries touch small amounts of data (one record or a few records)

– Updates are frequent– Concurrency is biggest

performance concern

• OLAP: On-Line Analytical Processing– Long transactions, complex

queries– Examples:

• Report total sales for each department in each month

• Identify top-selling books• Count classes with fewer

than 10 students– Queries touch large

amounts of data– Updates are infrequent– Individual queries can

require lots of resources

Page 6: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Why OLAP & OLTP don’t mix (1)

• Transaction processing (OLTP):– Fast response time important (< 1 second)– Data must be up-to-date, consistent at all times

• Data analysis (OLAP):– Queries can consume lots of resources– Can saturate CPUs and disk bandwidth– Operating on static “snapshot” of data usually OK

• OLAP can “crowd out” OLTP transactions– Transactions are slow → unhappy users

• Example: – Analysis query asks for sum of all sales– Acquires lock on sales table for consistency– New sales transaction is blocked

Different performance requirements

Page 7: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Why OLAP & OLTP don’t mix (2)

• Transaction processing (OLTP):– Normalized schema for consistency– Complex data models, many tables– Limited number of standardized queries and updates

• Data analysis (OLAP):– Simplicity of data model is important

• Allow semi-technical users to formulate ad hoc queries– De-normalized schemas are common

• Fewer joins → improved query performance• Fewer tables → schema is easier to understand

Different data modeling requirements

Page 8: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Why OLAP & OLTP don’t mix (3)

• An OLTP system targets one specific process– For example: ordering from an online store

• OLAP integrates data from different processes– Combine sales, inventory, and purchasing data– Analyze experiments conducted by different labs

• OLAP often makes use of historical data– Identify long-term patterns– Notice changes in behavior over time

• Terminology, schemas vary across data sources– Integrating data from disparate sources is a major

challenge

Analysis requires data from many sources

Page 9: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Data Warehouses

• Doing OLTP and OLAP in the same database system is often impractical– Different performance requirements– Different data modeling requirements– Analysis queries require data from many sources

• Solution: Build a “data warehouse”– Copy data from various OLTP systems– Optimize data organization, system tuning for OLAP– Transactions aren’t slowed by big analysis queries– Periodically refresh the data in the warehouse

Page 10: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Course Logistics

• Course web site:http://cs345.stanford.edu

• Course format will be lecture-based– As opposed to a paper-reading course

• Prerequisite:– Knowledge of SQL

Page 11: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Assigned Work

• Five homework assignments– One problem set– Four programming assignments

• Not a lot of code to write• Emphasis will be on interacting with Oracle

• Course project– Open-ended– Focus on a topic of your choosing– Any of these types:

• Research project, or…• Programming project, or…• Survey of research literature

– May be done individually or in groups of two• Final Exam

Page 12: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

High-Level Course Outline

• Logical Database Design– How should the data be modeled?– Designing the data warehouse schema

• Query Processing– Analysis queries are hard to answer efficiently– What techniques are available to the DBMS?

• Physical Database Design– How should the data be organized on disk?– What data structures should be used?

• Data Mining– What use is all this data?– Which questions should we ask our data warehouse?

Page 13: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Additional Topics

• Related topics to be touched on briefly:– Data integration– Data cleaning– Approximate query answering– Data lineage– Data visualization– Incremental maintenance of materialized views– Answering queries using views– Indexing special data types (spatial, text, geographic)– Metadata management

• Projects can be done in these areas

Page 14: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

The Textbook

• “The Data Warehouse Toolkit”by Ralph Kimball and Margy Ross

• Written by well-known data warehouse designer• Clearly written and readable • Lots of generic but realistic examples• Semi-technical (no math!)• Business-focused• We’ll use it for the first one-third of the course• Get the second edition!

Page 15: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Course Objectives

• Gain practical understanding of how data warehouses are built and used

• Gain exposure to data modeling “best practices”• Learn techniques used to process complex

queries over very large data sets• Understand the performance trade-offs that

come from alternative data structures• Learn commonly-used methods for mining and

analysis of large data sets• Become familiar with current research directions

in data warehousing and related areas

Page 16: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Loading the Data Warehouse

Source Systems Data Staging Area Data Warehouse

(OLTP)

Data is periodically extracted

Data is cleansed and transformed

Users query the data warehouse

Page 17: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Data Integration is Hard

• Data warehouses combine data from multiple sources• Data must be translated into a consistent format• Data integration represents ~80% of effort for a typical

data warehouse project!• Some reasons why it’s hard:

– Metadata is poor or non-existent– Data quality is often bad

• Missing or default values• Multiple spellings of the same thing

(Cal vs. UC Berkeley vs. University of California)

– Inconsistent semantics• What is an airline passenger?

Page 18: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Federated Databases

• An alternative to data warehouses• Data warehouse

– Create a copy of all the data – Execute queries against the copy

• Federated database – Pull data from source systems as needed to answer queries

• “lazy” vs. “eager” data integration

Data Warehouse Federated Database

Query

Answer

QueryExtraction

Rewritten Queries

Answer

SourceSystems

SourceSystems

WarehouseMediator

Page 19: CS 345: Topics in Data Warehousing Tuesday, September 28, 2004

Warehouses vs. Federation

• Advantages of federated databases:– No redundant copying of data– Queries see “real-time” view of evolving data– More flexible security policy

• Disadvantages of federated databases:– Analysis queries place extra load on transactional systems– Query optimization is hard to do well– Historical data may not be available– Complex “wrappers” needed to mediate between analysis server

and source systems• Data warehouses are much more common in practice

– Better performance– Lower complexity– Slightly out-of-date data is acceptable