Upload
pearl-gilmore
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
© 2009 IBM Corporation
Scalable Data Managementwith DB2
Matthias NicolaIBM Silicon Valley [email protected]
Information Management – DB2
© 2009 IBM Corporation2
Information Management Software
Agenda
Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF)
– Overview– Data partitioning, clustering, placement– Join Methods
TPoX Scalability in a DPF database– Scalability vs. Performance– Benchmark configuration & results
pureScale Overview Summary
Information Management – DB2
© 2009 IBM Corporation3
Information Management Software
DB2 Workgroup Edition
DB2 Everyplace
DB2 Enterprise Edition /IBM InfoSphere Warehouse
DB2 Data Server Editions
DB2 for z/OS
DB2
DB2 Express-C (free!)
Information Management – DB2
© 2009 IBM Corporation4
Information Management Software
Business Value of Scalability
More historical data = more precise forecasts– Data mining needs a lot of data for pattern accuracy– OLAP needs a lot of data for forecast accuracy
Predictable costs when growth occurs– Often the budget is the controlling factor, not technology– Low maintenance cost is important
No forced migrations from technology limitations– Enabling very large databases
Information Management – DB2
© 2009 IBM Corporation5
Information Management Software
DB2 Scalability for OLTP and Data Warehousing
Database Partitioning Feature (DPF) DB2 pureScale Range partitioning Multi-Dimensional Clustering (MDC) Compression Self-Tuning Memory Management (STMM) Automatic Storage Workload Management High Availability, Disaster Recovery Recovery Security and Compliance Utilities: Load, Backup & Restore, Redistribute Archiving etc.
Information Management – DB2
© 2009 IBM Corporation6
Information Management Software
Agenda
Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF)
– Overview– Data partitioning, clustering, placement– Join Methods
TPoX Scalability in a DPF database– Scalability vs. Performance– Benchmark configuration & results
pureScale Overview Summary
Information Management – DB2
© 2009 IBM Corporation7
Information Management Software
Database is divided into multiple database partitions Database partitions run on same or separate servers (shared-nothing) Each partition has its own table spaces, log, configuration, etc. Data is spread over N database partitions Queries are executed in parallel on all database partitions
select … from table
Tables
FCM network
Database
…Partition n
data+log
Engine
Partition 3
data+log
Engine
Partition 2
data+log
Engine
Partition 1
data+log
Engine
DB2's Database Partitioning Feature (DPF)
Information Management – DB2
© 2009 IBM Corporation8
Information Management Software
Flexible configuration options
Storage server
I/O Channels
SMP server SMP server
DB2 Partition
DB2 Partition
DB2 Partition
DB2 Partition
DB2 Partition
DB2 Partition
DB2 Partition
DB2 Partition
Storage server
I/O Channels
SMP server SMP server
DB2 Partition
DB2 Partition
DB2 Partition
DB2 Partition
DB2 Partition
DB2 Partition
DB2 Partition
DB2 Partition
FCM (Fast Communication Manager)
Example: 4 physical machines, 2 database partitions per machine
Possible hardware configurations
All database partitions on a single machine (logical partitions) easy exploitation of multi-core systems
All database partitions on separate machines (physical partitions)
Hybrid: multiple machines with several logical partitions on each
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.partition.doc/doc/c0004569.html
Information Management – DB2
© 2009 IBM Corporation9
Information Management Software
DB2's Database Partitioning Feature (DPF)
….
….
Information Management – DB2
© 2009 IBM Corporation10
Information Management Software
The Distribution Map
0 1 2 3 4 5 6 7 … … 32k
1 2 3 4 1 2 3 4 … …
5DB2 hash algorithm
Distribution map
Partition1 Partition2 Partition3 Partition4
Distribution key
C1000120
• Distribution key can consist of one or multiple columns.
• Avoid low cardinality columns, such as "gender", "state", etc.
• Unique indexes must contain all columns of the distribution key
i
p(i)
column name
column value
Information Management – DB2
© 2009 IBM Corporation11
Information Management Software
Agenda
Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF)
– Overview– Data partitioning, clustering, placement– Join Methods
TPoX Scalability in a DPF database– Scalability vs. Performance– Benchmark configuration & results
pureScale Overview Summary
Information Management – DB2
© 2009 IBM Corporation13
Information Management Software
Database Partition 1 Database Partition 2 Database Partition 3
DB2 Database Partitioning Feature = Divide Work
Information Management – DB2
© 2009 IBM Corporation14
Information Management Software
January
February
March
Database Partition 1 Database Partition 2 Database Partition 3
Range Partitioning Further Reduces I/O
CREATE TABLE sales (recordID INT, salesdate DATE, ... details XML)DISTRIBUTE BY HASH (recordID) PARTITION BY RANGE (salesdate) EVERY 1 MONTHS ;
Information Management – DB2
© 2009 IBM Corporation15
Information Management Software
February
March
January
Database Partition 1 Database Partition 2 Database Partition 3
Multi-Dimensional Clustering to Further Reduce I/O
CREATE TABLE sales (recordID INT, salesdate DATE, productID INTEGER, storeID INTEGER, ... details XML)DISTRIBUTE BY HASH (recordID) PARTITION BY RANGE (salesdate) EVERY 1 MONTHS ORGANIZE BY (productID, storeID) ;
Information Management – DB2
© 2009 IBM Corporation16
Information Management Software
February
March
January
Database Partition 1 Database Partition 2 Database Partition 3
Compression Reduces I/O by a Factor of 3x to 4x
Information Management – DB2
© 2009 IBM Corporation19
Information Management Software
Data Partitioning and Placement Options
Part. 1
Part. 2
Part. 3
Part. 5
Part. 6
Part. 7
Part. 4
Part. 8
Database Partitions
Table 1: Sales
Can distribute a table across some or all database partitions.Can replicate a table to have an identical copy on each partition.
Table 2: Customer
Table 3:Product
Table 3:Product
(copy)
Table 3:Product
(copy)
Table 3:Product
(copy)
Table 3:Product
(copy)
Table 3:Product
(copy)
Table 3:Product
(copy)
Table 3:Product
(copy)
Information Management – DB2
© 2009 IBM Corporation20
Information Management Software
Agenda
Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF)
– Overview– Data partitioning, clustering, placement– Join Methods
TPoX Scalability in a DPF database– Scalability vs. Performance– Benchmark configuration & results
pureScale Overview Summary
Information Management – DB2
© 2009 IBM Corporation21
Information Management Software
Join Processing - Example
create table tab1(pk1 int, c1 int,...) distribute by hash (pk1);
create table tab2(pk2 int, c2 int,...) distribute by hash (pk2);
tab1
databasepartition 1
databasepartition 2
tab2 tab1 tab2
pk1 c1 2 3 8 1212 15
pk1 c1 1 3 3 4 7 711 10
pk2 c2 3 2 5 3 7 415 7
pk2 c2 4 8 8 1510 1012 12
tab1 tab2
pk1 c1 1 3 2 3 3 4 7 7 8 1211 1012 15
pk2 c2 3 2 4 8 5 3 7 4 8 1510 1012 1215 7
Logical data in the tables: Physical data distribution:
distribute by hash*
*For simplicity, this example hashes odd key values to partition 1 and even key values to partition 2
Information Management – DB2Information Management Software
© 2009 IBM Corporation22
create table tab1(pk1 int, c1 int,...) distribute by hash (pk1);create table tab2(pk2 int, c2 int,...) distribute by hash (pk2);
select * from tab1, tab2 where tab1.pk1 = tab2.pk2;
tab1
pk1 1 3 711
partition 1
pk2 3 5 715
pk1 2 812
partition 2
pk2 4 81012
tab2 tab1 tab2
• Both tables are partitioned by the join key
• Any join matches are guaranteed to be within any given partition ("co-located")
• No join matches across partitions
• Allows local joins within each partition, no data movement
• Best case, best performance
Collocated Join
Information Management – DB2Information Management Software
© 2009 IBM Corporation23
select * from tab1, tab2 where tab1.c1 = tab2.pk2;
tab1
partition 1
pk2 3 5 715
partition 2
pk2 4 81012
tab2 tab1 tab2
pk1 c1 3 411 10 8 12
tab1'
partition 1
pk2 3 5 715
partition 2
pk2 4 81012
tab2 tab1' tab2
pk1 c1 1 3 2 3 7 712 15
DTQ
Send rows from tab1 to those partitions where they can find join matches in tab2,i.e. redistribution of tab1, based on hashing of the join key c1.
pk1 c1 2 3 8 1212 15
pk1 c1 1 3 3 4 7 711 10
permanent storage on the fly / in memory
Directed Join
Information Management – DB2Information Management Software
© 2009 IBM Corporation24
tab1
partition 1
pk2 3 5 715
partition 2
pk2 4 81012
tab2 tab1 tab2 tab1'
partition 1
pk2 3
partition 2
tab2'
pk1 c1 1 3 2 3
DTQ
Value predicates are used to optimize (reduce) the data flowand eliminate irrelevant partitions from the join processing.
pk1 c1 2 3 8 1212 15
pk1 c1 1 3 3 4 7 711 10
select * from tab1, tab2
where tab1.c1 = 3 and tab1.c1 = tab2.pk2;
Single Partition Directed Join
Information Management – DB2Information Management Software
© 2009 IBM Corporation25
select * from tab1, tab2 where tab1.c1 = tab2.c2;
pk1 c1 3 411 10 8 12
tab1'
partition 1 partition 2
tab2' tab1' tab2'
pk1 c1 1 3 2 3 7 712 15
tab1
partition 1 partition 2
tab2 tab1 tab2
DTQ
Redistribute both tables by hashing on their join keys so that matching rows end up on the same partition.
pk1 c1 2 3 8 1212 15
pk1 c1 1 3 3 4 7 711 10
pk2 c2 3 2 5 3 7 415 7
pk2 c2 4 8 8 1510 1012 12
DTQ
pk2 c2 5 315 7 4 11 8 15
pk2 c2 3 2 7 410 1012 12
Repartitioned Join
Information Management – DB2Information Management Software
© 2009 IBM Corporation26
select * from tab1, tab2
tab1'
partition 1
pk2 3 5 715
partition 2
pk2 4 81012
tab2 tab1' tab2tab1
partition 1
pk2 3 5 715
partition 2
pk2 4 81012
tab2 tab1 tab2
BTQ
Broadcast a copy of one table to all database partitions.
pk1 1 3 711 2 812
pk1 1 3 711 2 812
pk1 2 8 12
pk1 1 3 7 11
Broadcast Join
Information Management – DB2Information Management Software
© 2009 IBM Corporation27
tab1
partition 1
pk2 3 5 715
partition 2
pk2 4 81012
tab2 tab1(copy) tab2
Good choice for small tables with infrequent insert/update/delete activity, such as dimension tables in a star schema.
pk1 1 3 711 2 812
pk1 1 3 711 2 812
permanent storage
Data Placement Option: Replicated Table
Information Management – DB2
© 2009 IBM Corporation28
Information Management Software
Agenda
Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF)
– Overview– Data partitioning, clustering, placement– Join Methods
TPoX Scalability in a DPF database– Scalability vs. Performance– Benchmark configuration & results
pureScale Overview Summary
Information Management – DB2
© 2009 IBM Corporation29
Information Management Software
Scalability vs. Performance
Performance: Time to complete a given task with given resources Scalability: Ability to add resources to
– complete the same task more quickly– handle a bigger task in about the same time
Example: Mowing the lawn…– Peter does it alone in 8 hours– Peter and Bob work together and take 4 hours
Scalability is perfect, performance is poor!
– Jim does it alone in 1 hour– Jim and John together do it in 1hrs20min
Performance is great, scalability is awful !
– Mary mows the lawn in 30 minutes– Mary and Susan together need 15 minutes
Performance is great, scalability is also great !
30
Scalability Metrics
# of partitions
Qu
ery
elap
sed
tim
e Fixed Database Size
Database size & # of partitions
Qu
ery
elap
sed
tim
e Increasing Database Size
Basic assumption: Queries executed against a bigger database examine more data
Mathematically, these two approaches are
equivalent….
Make queries against a DB of a fixed size faster by adding partitions (“speedup”). Amount of data per partition shrinks.
Hold response time constant for a growing database by adding partitions in proportion (“scaleup”/"scale-out"). Amount of data per partition remains constant.
31
Our Test Design
n partitions250GB
Qu
ery
elap
sed
tim
e
n*2 partitions500GB
n*4 partitions1 TB
Increasing database size: 250GB / 500GB / 1TB Increasing number of database partitions
Fixed ratio of data volume to number of partitions
Show constant query elapsed times to prove scalability
32
TPoX Benchmark TPoX = Transaction Processing over XML Data Open Source Benchmark: http://tpox.sourceforge.net/ Financial transaction processing scenario: “online brokerage” Realistic test for XML databases
DB
Customers
BrokerageHouse DB
Customers
BrokerageHouse
1
n
n n
n n 1 1
1
1
1 Customer Holding Account
Order Security
FIXML (41 XSD files) Security.xsd
4 – 20 kb 1 – 2 kb 2 – 9 kb
FIXML: Standardized Financial XML Schema for Securities Trading !
Custacc
CustAcc.xsd
33
ID Name DateOfBirth Address Phone … Account
Account
ID Currency OpeningDate Balance … Holding
Holding
Holding…
Symbol Name Type Quantity
Symbol Name Type Quantity
ID Currency OpeningDate Balance … Holding Symbol
Name Type Quantity
CustAcc ID Symbol Name SecurityType SecurityInformation StockInformation
Sector Industry Category OutstShares
FundInformation FundFamily Sector Industry AssetGroup FixedIncome ExpenseRatio TotalAssets MinInitialInvestment MinSubsequentInvest.
Price/LastTrade Ask/Bid 50DayAvg 200DayAvg …
ID OrignDt TrdDt Acct Side Qty Sym …
SecurityOrder
Document structures and join relationships
34
TPoX Data & Schema
Scale Factor “M”, 1 TB raw data 500M Order documents, 50M CustAcc documents 20,833 Securities, independent of scale factor 3 Simple Tables + XML Indexes
create table custacc ( cadoc XML ) create table security ( sdoc XML ) create table order ( odoc XML )
1
n
n n
n n 1 1
1
1
1 Customer Holding Account
Order Security
CustAcc.xsd FIXML
(41 XSD files) Security.xsd
FIXML: financial industry XML Schema
CustAcc: modeled after a real banking systemthat uses XML
Security: informationsimilar to investmentweb sites
Database schema for a non-DPF DB2 database:
35
custidinteger
cdocXML
custidinteger
secsymvarchar
odocXML
TPoX Database Schema for DPF
order table (500M rows) custacc table (50M rows)
!
- Extract certain XML element values into relational cols as distribution keys- Goal: enable partitioning of both tables by a common key
ID Name DateOfBirth Address Phone … Account
Account
ID Currency OpeningDate Balance … Holding
Holding
Holding…
Symbol Name Type Quantity
Symbol Name Type Quantity
ID Currency OpeningDate Balance … Holding Symbol
Name Type Quantity
CustAcc
ID OrignDt TrdDt Acct Side Qty Sym …
Order
36
What is TPoX-DSS*?
• Decision Support workload on top of the regular XML data of the TPoX benchmark
• A set of complex SQL/XML queries– Includes massive table scans, aggregation,
grouping, OLAP functions, etc.
• Focus on single-user query response time
* we might come up with a better name in the near future
37
Business Questions Complex SQL/XML Queries
Q1: Popular SecuritiesFind securities that have more shares bought than sold across all orders. List their order quantities grouped by year.
Q2: Top 10 Most Popular Trading Weeks, Ranked by Order VolumeFor each year, find the ten most active weeks and return the buy, sell, and total order volumes for each week.
Q3: Average Account Balance of Premiun CustomersCalculate the average account balance of all premium customers, grouped by their number of accounts.
Q4: Average Balance per Number Of AccountsCalculate the average account balance of all customers, grouped by their number of accounts.
Q5: Percentage of buy orders per sector and genderFor each stock in a given sector of securities, find the percentage of buy orders placed by male vs. female clients.
38
Business Questions Complex SQL/XML Queries
Q6: Max Stock Orders for an IndustryList the 20% (or: x%) most expensive orders for customer in a given state and for a given industry (subset of securities).
Q7: Order Amounts for Two Major CurrenciesCalculate the min, max and avg order amount for all orders in a given timeframe grouped by buy/sell for two major currencies.
Q8: Order Amounts for All CurrenciesCalculate the min, max and avg order amount for all orders in a given timeframe grouped by buy/sell and the order’s currency.
Q9: Balance per CurrencyEach account is in a specific currency. Calculate the average account balance for each currency.
Q10: Sleeping CustomersFind all customers having less than x orders in a given timeframe.
39
TPoX DSS: Query Characteristics
Query Tables Characteristics
Q1 Popular Securities O, S 2 x XMLTABLE, Group By, Order By
Q2 Top 10 Most Popular Trading Weeks O Full scan of all orders,OLAP Function rank()
Q3 Average Account Balance of Premiun Customers
C Indexed access to premium customers, Group By, Order By
Q4 Average Balance per Number Of Accounts
C Full scan of all customers
Q5 Percentage of buy orders per sector and gender
C, O, S Aggregation, SQL OLAP Functions, 3 x XMLTABLE, 2 x XMLEXISTS
Q6 Max Stock Orders for an Industry C, O, S 2 x XMLTABLE, 2 x XMLEXISTS
Q7 Order Amounts for Two Major Currencies O Several predicates, CASE expression
Q8 Order Amounts for All Currencies O 4 aggregation functions,Group By two XML attributes
Q9 Balance per Currency C Full scan of all accounts, aggregation and grouping
Q10 Sleeping Customers C, O Common table expression
All queries available upon request, in SQL/XML notation.
40
Q5: Percentage of buy orders per sector and genderSELECT DISTINCT secsector, gender, SUM(ordqty) OVER (PARTITION BY secsector, gender) AS orderqty, SUM(ordqty) OVER (PARTITION BY secsector, gender) * 100 / SUM(ordqty) OVER (PARTITION BY secsector) AS percentageFROM security, order, custacc, XMLTABLE(' declare namespace s="http://tpox-benchmark.com/security";
$SDOC/s:Security' COLUMNS secsector VARCHAR(30) PATH '*:SecurityInformation//*:Sector', secname VARCHAR(50) PATH '*:Name') AS T1,
XMLTABLE(' declare default element namespace "http://www.fixprotocol.org/FIXML-4-4"; $ODOC/FIXML/Order'
COLUMNS ordqty BIGINT PATH '*:OrdQty/@Qty') AS T2, XMLTABLE(' declare namespace c="http://tpox-benchmark.com/custacc";
$CADOC/c:Customer' COLUMNS gender VARCHAR(10) PATH '*:Gender') AS T3
WHERE order.secsym = security.secsym AND order.custid = custacc.custid AND XMLEXISTS(' declare namespace s="http://tpox-benchmark.com/security";
$SDOC/s:Security/s:SecurityInformation/*[s:Industry="OfficeSupplies" and s:MinInitialInvestment=5000]') AND XMLEXISTS(' declare default element namespace "http://www.fixprotocol.org/FIXML-4-4";
$ODOC/FIXML/Order[@Side = "2"]') ORDER BY secsector, gender;
Information Management – DB2
© 2009 IBM Corporation41
Information Management Software
Agenda
Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF)
– Overview– Data partitioning, clustering, placement– Join Methods
TPoX Scalability in a DPF database– Scalability vs. Performance– Benchmark configuration & results
pureScale Overview Summary
45
Data Partitioning in a Cluster
Node 1
Node 2
Node 3
Node 5
Node 6
Node 7
Node 4
Node 8
8 processing nodes
8 database partitions, 250GB
16 database partitions, 500 GB
32 database partitions, 1TB
Each node has 2 Intel Xeon 5169 dual-core CPUs, and 32GB RAM.4 cores per node we use 4 database partitions per node.
47
Query response times for 500GB and 1TB are close to the 250GB results!
Scalability Results: ClusterTPoX/DSS Query Response Times (Cluster)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Query
Ela
pse
d t
ime
(sec
on
ds)
250GB / 8 partitions
500GB / 16 partitions
1TB / 32 partitions
Source: IBM internally measured results, September 2009
Information Management – DB2
© 2009 IBM Corporation48
Information Management Software
Agenda
Introduction DB2 Scalability for OLTP and Data Warehousing DB2's Database Partitioning Feature (DPF)
– Overview– Data partitioning, clustering, placement– Join Methods
TPoX Scalability in a DPF database– Scalability vs. Performance– Benchmark configuration & results
pureScale Overview Summary
Information Management – DB2Information Management Software
© 2009 IBM Corporation49
DB2 pureScaleGoals
Unlimited CapacityAny transaction processing or ERP workloadStart smallGrow easily, with your business
Application TransparencyAvoid the risk and cost of tuning your applications to the database topology
Continuous AvailabilityMaintain service across planned and unplanned events
Webcast: http://www.channeldb2.com/video/db2-purescale-a-technologyWeb site: http://www.ibm.com/software/data/db2/linux-unix-windows/editions-features-purescale.html
Information Management – DB2Information Management Software
© 2009 IBM Corporation50
Cluster Interconnect
DB2 pureScale : Technology Overview
Single Database View
Clients
Database
Log Log Log Log
Shared Storage Access
CS CS CSCS
CS CS
CS
Member Member Member Member
Primary 2nd-ary
DB2 engine runs on several host computers Co-operate with each other to provide coherent access to the
database from any member
Data sharing architecture Shared access to database Members write to their own logs Logs accessible from another host (used during recovery)
PowerHA pureScale technology Efficient global locking and buffer management Synchronous duplexing to secondary ensures availability
Low latency, high speed interconnect Special optimizations provide significant advantages on RDMA-
capable interconnects (eg. Infiniband)
Clients connect anywhere,…… see single database
Clients connect into any member Automatic load balancing and client reroute may change
underlying physical member to which client is connected
Integrated cluster services Failure detection, recovery automation, cluster file system In partnership with STG (GPFS,RSCT) and Tivoli (SA MP)
Information Management – DB2Information Management Software
© 2009 IBM Corporation51
Scale with Ease
Log
LogLogLog
Without changing applications Efficient coherency protocols
designed to scale without application change
Applications automatically and transparently workload balanced across members
Without administrative complexity No data redistribution required
To 128 members in initial release Limited by testing resources
Single Database View
DB2 DB2 DB2 DB2
Log
DB2
Information Management – DB2Information Management Software
© 2009 IBM Corporation54
What is a PowerHA pureScale ?
Software technology that assists in global buffer coherency management and global locking
Derived from System z Parallel Sysplex & Coupling Facility technology
Software based
Services provided include Group Bufferpool (GBP) Global Lock Management (GLM) Shared Communication Area (SCA)
Members duplex GBP, GLM, SCA state to both a primary and secondary
Done synchronously Duplexing is optional (but recommended) Set up automatically, by default
Shared database(Single database partition)
Log Log
GBP GLM SCA
Primary
Secondary
db2 agents & other threads
log buffer, dbheap, &other heaps
bufferpool(s)
db2 agents & other threads
bufferpool(s)
log buffer, dbheap, &other heaps
Information Management – DB2Information Management Software
© 2009 IBM Corporation55
The Role of the GBP
GBP acts as fast disk cache Dirty pages stored in GBP, then later, written
to disk Provides fast retrieval of such pages when
needed by other members
GBP includes a “Page Registry” Keeps track of what pages are buffered in
each member and at what memory address Used for fast invalidation of such pages when
they are written to the GBP
Force-at-Commit (FAC) protocol ensures coherent access to data across members
DB2 “forces” (writes) updated pages to GBP at COMMIT (or before)
GBP synchronously invalidates any copies of such pages on other members
New references to the page on other members will retrieve new copy from GBP
In-progress references to page can continue
bufferpool(s)
Member 2
GBP GLM SCA
bufferpool(s)
Member 1
Client B :Update T1 set C1=X where C2=YCommit
PageRegistry
M1
Write P
age
Read
Page
M2M2
“Sile
nt” I
nval
idat
e
Client A :Select from T1 where C2=YClient C :
Select from T1 where C2=Y
Information Management – DB2Information Management Software
© 2009 IBM Corporation57
Stealth System Maintenance
Log
LogLogLog
Goal: allow DBAs to apply system maintenance without negotiating an outage window
Procedure:1. Drain (aka Quiesce)
2. Remove & Maintain 3. Re-integrate
4. Repeat until done
Enables continuous availability
Single Database View
DB2 DB2 DB2 DB2
Information Management – DB2Information Management Software
© 2009 IBM Corporation58
Achieving Efficient Scaling : Key Design Points Deep RDMA exploitation over low
latency fabric Enables round-trip response time
~10-15 microseconds
Silent Invalidation Informs members of page updates requires
no CPU cycles on those members
No interrupt or other message processing required
Increasingly important as cluster grows
Hot pages available without disk I/O from GBP memory RDMA and dedicated threads enable read
page operations in ~10s of microseconds
GBP GLM SCA
Buffer Mgr
Lock Mgr Lock Mgr Lock Mgr Lock Mgr
Can I have this lock ?
Yup, here you are.
New
page image
Rea
d P
age
Information Management – DB2Information Management Software
© 2009 IBM Corporation61
of Transaction Throughput
Information Management – DB2
© 2009 IBM Corporation63
Information Management Software
Questions / Discussion
Information Management – DB2
© 2009 IBM Corporation65
Information Management Software
Features to Minimize Planned Outages
Backup: Fast, scalable, granular– Online or offline– Fully parallel and scalable– Can be throttled– Partition-level backup– Table space-level backup– Full, Incremental, or Delta– Volume snapshot support
Load: Fast, scalable and granular– Fully parallel and scalable– Partition-level– Online load– Online index rebuild
Automatic log management
Other utilities– Online statistics collection– Online index create and
reorganization– Online reorganization– Online inspect
Dynamic operations– Configuration parameters– Buffer pool operations– Container operations
Space management– Online container management– Automatic storage– Online index reorganization
Information Management – DB2
© 2009 IBM Corporation66
Information Management Software
Features to Minimize Unplanned Outages
Hardware failures– Integration with TSA cluster
manager– Built-in redundancy can't be turned
off– Consistency bits– Log mirroring– Automatic mirroring of critical data
files– Support for RAID
Fast recovery– Continuous check pointing– Parallel recovery– Automatic recovery tuning– Filtered recovery– Dynamic debugging capability
High availability – Clustering / failover support– Integrated with TSM– Automatic client reroute
Human and Application Errors– Point-in-Time (POT) recovery– Drop table recovery
Miscellaneous– Infinite active logging– Online container operations
Information Management – DB2
© 2009 IBM Corporation67
Information Management Software
OLAP Optimization Advisor
InfoSphere Warehouse will design the aggregates to support dimensional analysis for you using:– Hybrid line– Statistics– Meta-data that describes the
cubes • Hierarchies, dimensions,
measures, etc.– Optimizes to understand
impact to load times and performance trade-off
Information Management – DB2
© 2009 IBM Corporation68
Information Management Software
InfoSphere Warehouse
Universal Cube Access(ODBO, XMLA)
IBM Cognos 8 BI IBM DataQuant& DB2 QMF
Microsoft Excel Cubeware Cockpit
Portals, Web Applications, Dashboards, Interactive Reports,Ad Hoc Analysis, Common Desktop Tools
Universal Cubing Services Access
Information Management – DB2
© 2009 IBM Corporation69
Information Management Software
InfoSphere Warehouse Data Mining
DB2 InfoSphere Warehouse
SQL
ScoringFunctions
Modeling ModelResults
Structured &Unstructured
Data
Data Mining Embedded into Applications and Processes
Mining VisualizerWeb Analytical AppsBI Analytical ToolsSOA Processes
SQL Interface
• Enterprise-Level Data Mining• High-Speed, In-Database Scoring
In-Database Data Mining
Information Management – DB2
© 2009 IBM Corporation70
Information Management Software
InfoSphere Warehouse Text Analytics
Analyze and extract structured data from text– Makes data available to normal reporting and analysis tools– From customer call center records, claim forms, etc.
Benefits– Target specific information hidden within text– Competitive edge by driving further business insight– Drives a greater ROI for your applications
Business value examples– Better product categorization– Early warning on customer attrition– Fraud detection– Product defect analysis– Better customer profiling
Simple text analysis capabilities for text columns stored in warehouse tables
Pattern matching rules and simple linguistics
Enhance existing reports and data mining with insights gleaned from text
Simple rules and dictionary editor
Information Management – DB2
© 2009 IBM Corporation71
Information Management Software
Key Features:
Database design, or reverse engineer an existing database or DDL (RDA)
– View/Modify the schema
– Compare/Sync DB objects
– Analyze design (best practices and dependencies), Validation
– DB2 Storage Modeling: Table Space, Buffer Pool, Partition
Generate script & Deploy: on data models, and flow models
Impact Analysis: on data models and flow models
InfoSphere Warehouse Design Studio
Leverage and extend InfoSphere Data Architect: – Design and modify database physical models (schema & storage design, etc)– Design and model OLAP objects– Design and model warehouse transformation and mining flows
71
72
What’s new in TPoX 2.0 TPoX 2.0 includes pervasive change to the benchmark TPoX 2.0 test results not comparable to previous versions of TPoX
TPoX V1.3 and Earlier TPoX 2.0
Based on Toxgene A single java based program
3rd party tool, lack of support Complete rewrite
Slow (> 5 days for 1TB data) Fast (6 hours for 1TB data)
Can’t generate dense account IDs for CUSTACC
Account IDs are now dense
Large amount of small XML files Small amount of larger files, each contains 50K XML documents
TPoX V1.3 and Earlier
TPoX 2.0
# of CUSTACC vs # of ORDER 1:5 1:10
XML document size range 1-20KB 1-23KB
ACCOUNT IDs of customer Not dense Dense
Total XML document size of “100GB” scale
Slightly less than 100GB
Slight larger than 100GB
avg # of accounts per customer 1.5 2.0
Data Generator
Data Distribution
TPoX V1.3 and Earlier TPoX 2.0
Workload description file in proprietary format, hard to read
Workload description file in XML format, easy to read and create
WorkloadDriver reads input documents from large amount of small files
WorkloadDriver reads input documents from smaller amount of larger files, improved performance for reading XML input documents
Update transaction U1, U5 and U6 select account for update based on customer ID
Update transaction U1, U5 and U6 select account for update based on account ID
Workload and WorkloadDriver
NOTE: please refer to TPoX V2.0 Release Note at http://sourceforge.net/projects/tpox for more detail
Changes have improved performance of generating and consuming TPoX XML data in large scale TPoX benchmarks !