Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
IBM® DB2® for Linux®, UNIX®, and Windows®
Best Practices Data Life Cycle Management
Christopher Tsounis
Executive IT Specialist
Information Management Technical Sales
Enzo Cialini
Senior Technical Staff Member
DB2 Data Server Development
Last updated: 2009-10-23
IBM®
Data Life Cycle Management Page 2
Executive Summary............................................................................................. 4
Introduction .......................................................................................................... 5
Partitioning techniques ....................................................................................... 6
What is database partitioning?..................................................................... 6
What is table partitioning?............................................................................ 6
Multi-dimensional clustering............................................................................. 8
Features of MDC that benefit roll-in and roll-out of data........................ 9
Using database partitioning, table partitioning and multi-dimensional
clustering in the same database design .......................................................... 10
Additional techniques to support life cycle management ........................... 11
Large table spaces ........................................................................................ 11
SET INTEGRITY operation......................................................................... 12
Asynchronous index cleanup..................................................................... 12
Designing and implementing your table partitioning strategy .................. 13
Design best practices: .................................................................................. 13
Maximizing the benefits of partition elimination: .................................. 16
Operational considerations:........................................................................ 16
Rolling in data: Which solution to use? .......................................................... 18
Best practices for roll-in of compressed table partitions: ............................. 20
Best practice for roll-in and roll-out with continuous updates: .................. 21
After roll-out: How to manage data growth and retention? ....................... 22
Using UNION ALL views .......................................................................... 22
Using IBM Optim Data Growth Solution................................................. 23
Best Practices....................................................................................................... 30
Conclusion .......................................................................................................... 32
Further reading................................................................................................... 33
Contributors.................................................................................................. 33
Notices ................................................................................................................. 34
Data Life Cycle Management Page 3
Trademarks ................................................................................................... 35
Data Life Cycle Management Page 4
Executive Summary Today’s database applications frequently require scalability and rapid roll-in and roll-out
of data with minimal disruption to data access by applications. Roll-in of data refers to
the addition of new data as it becomes available while roll-out of data refers to the
moving out (usually archiving) of historic data. Many applications today are accessed 24
X 7, therefore eliminating the previously available batch window for data updates. Also,
many applications require the continuous feed of data updates while applications
concurrently access the data.
The DB2 database system provides a variety of facilities that enable scalability and
facilitate the continuous feed or roll-in and roll-out of data, with minimal interruption of
data access. This document recommends best practices to design and implement these
DB2 facilities to achieve these goals.
Data Life Cycle Management Page 5
Introduction This paper describes the best DB2 design practices to facilitate the life-cycle management
of DB2 data. Life-cycle management is the efficient addition (roll-in) of new data, and
archival (roll-out) of data no longer required in the main database. The DB2 database
system provides the following features that you can use in combination to facilitate life
cycle management:
• Database partitioning
• Table partitioning
• Multi-dimensional clustering
• UNION ALL views
In addition to these DB2 features, the IBM Optim Data Growth solution facilitates
archiving for data life cycle management.
An important benefit of DB2 database system partitioning facilities is the ability to
deploy and modify these facilities without impacting existing application code.
This paper is part of a family of related best practice papers, you would also benefit by
reading the following best practice papers:
• Physical Database Design
• Minimizing Planned Outages
• Row Compression.
The target audience for this paper is personnel responsible for database design for DB2
applications. Database personnel who want to achieve scalability and efficient life cycle
management of data should also find it valuable. This paper assumes you have moderate
experience in designing DB2 databases.
This paper is based on the facilities available in DB2 Version 9.5 and DB2 Version 9.7.
Subsequent releases of the DB2 database system might provide enhancements that alter
the best practices recommendations in this document.
Data Life Cycle Management Page 6
Partitioning techniques
What is database partitioning? Database partitioning (formerly known as DPF) distributes data across logical nodes of
the database by using a key-hashing algorithm. The goal of database partitioning is to
maximize scalability by evenly distributing data across clusters of computers. Database
partitioning further enhances scalability by reducing the granularity of DB2 utility
operations. It also parallelizes query and update operations on the database.
The following example demonstrates how to specify database partitioning:
CREATE TABLE Test
(Account_Number INTEGER,
Trade_date DATE )
DISTRIBUTE BY (Account Number) USING HASHING
Note: In DB2 Version 9.1, the PARTITION KEY clause is renamed DISTRIBUTE BY.
Database partitioning is completely transparent, so it does not impact existing
application code. Also, you can modify partitioning online using the redistribution
utility, without affecting application code.
When you design your database partitioning strategy, use a partitioning key column
with high cardinality to help ensure even distribution of data across logical nodes. A
column with high cardinality has many unique values (rather than most values being the
same). Also, unique indexes must be a superset of the partitioning key.
Try to use the same partitioning key on tables that get joined. This increases the
collocation of joins.
What is table partitioning? Table partitioning (frequently called range partitioning) splits data by specific ranges of
key values over one or more physical objects within a logical database partition. The
goal of table partitioning is to organize data to facilitate optimal data access and the
rollout of data. Table partitioning might also facilitate the roll-in of data for certain
applications, however, often multi-dimensional clustering (discussed in the section
“Multi-Dimensional Clustering”, next) is a better choice to enhance roll-in. Database
partitioning is the best practice for reducing the granularity of utility operations for the
scalability of very large databases.
Table partitioning has the following benefits:
Data Life Cycle Management Page 7
• Improved query performance by eliminating irrelevant partitions. The optimizer
can limit SQL access to only relevant partitions in the WHERE clause.
• Optimized roll-in and roll-out processing of ranges. Table partitioning allows
easy addition and removal of table partitions with no data movement.
Applications can perform read and write operations against older data when
partitions are added (queries are drained for a brief period).
• Maintained compression ratios across data that changes over time. Each table
partition has its own compression dictionary. Thus, compressed data in older
partitions is not affected by the changing characteristics of newly inserted data.
• Optimized management of very large tables. Table-partitioned tables can be
virtually unlimited in size, because the limits are per partition (not per table).
You can place ranges of data across multiple table spaces to facilitate the backup
and restore of this data.
• Greater flexibility of index placement in SMS table spaces. You can store indexes
in separate SMS large table spaces (supported for table-partitioned tables only).
Separate index placement into DMS table spaces is available for all tables.
DB2 Version 9.7 enhances table partitioning with the ability to create partitioned indexes.
Partitioned indexes are stored locally with the table partitions. The benefits of partition
indexes are:
• Avoids the overhead of maintaining global indexes during Set Integrity
processing when attaching table partitions.
• Avoids Asynchronous Index Cleanup when detaching table partitions
• Improves the performance of reorganize by partition operations
• May improve query performance by reducing the cost of index processing due to
more compact indexes.
The following example demonstrates specifying table partitioning:
CREATE TABLE Test
(Account_Number INTEGER,
Trade_date DATE)
IN ts1, ts2, ts3
PARTITION BY RANGE (Trade_date)
( STARTING '1/1/2000' ENDING '3/31/2000',
STARTING '4/1/2000' ENDING '6/30/2000',
Data Life Cycle Management Page 8
STARTING '7/1/2000' ENDING '9/30/2000')
The following example demonstrates the creation of a partitioned table with a partitioned
index. Partitioned indexes are created by default in DB2 Version 9.7 whenever possible:
CREATE TABLE T1 (I1 INTEGER, I2 INTEGER) PARTITION BY RANGE(I1) (STARTING(1) ENDING (10) EVERY (5));
CREATE INDEX IND1 ON T1(I1) PARTITIONED;
The following example demonstrates the creation of a non-partitioned index:
CREATE INDEX IND2 ON T1(I2) NOT PARTITIONED;
There are many additional techniques available to specify how a table is partitioned that
are described in your DB2 documentation.
Multi-dimensional clustering Multi-dimensional clustering (MDC) is a unique capability available only with the DB2
database system. MDC organizes data in a table by multiple key values (cells). The goal
of MDC is to facilitate access to data by using multiple dimensions, therefore keeping
data access only to selected relevant cells. MDC helps to ensure that the data is always
clustered by dimensions avoiding the need for reorganization of data (MDC is designed
to keep data in order).
MDC also utilizes block indexes on each dimension (and the combined dimensions)
versus row ID (RID) indexes. This can result in a substantial reduction in the index size
and index levels. For example, if 100 rows can fit into a DB2 cell, the block index will
only point to the cell rather than each of the 100 rows. This results in a reduction in I/O
for reading and updating data (the index is only updated when the block is full).
MDC facilities the roll-in and roll-out of data and is completely transparent to
applications.
The following example demonstrates how to specify multi-dimensional clustering:
CREATE TABLE order
(Account_Number INTEGER,
Trade_Date DATE,
Region CHAR(10,
order_month INTEGER generated always as month(orde r_dt))
IN ts1
Data Life Cycle Management Page 9
ORGANIZED BY DIMENSIONS (region, order_month)
When designing your MDC strategy, specify low-cardinality columns to avoid sparsely
populated cells. Sparsely populated cells can significantly increase disk space usage. A
column with low cardinality is likely to have many values that are the same (rather than
many unique values). You can also use a generated column to produce a highly clustered
dimension. For example, a generated column or built-in-function can convert a date into
a month. This reduces the cardinality significantly (for a year of data, the cardinality is
reduced from 365 to 12).
Features of MDC that benefit roll-in and roll-out of data MDC is designed to maintain clustering in all dimensions avoiding the need for
reorganization of data. This can greatly reduce I/O during the roll-in process (but does
use sequential big block I/O). Also, because indexes on MDC dimensions are block
indexes, this allows MDC to avoid excessive index I/O during the roll-in process. Block
indexes are smaller and shallower than a normal RID-based index, because the index
entries point to a block rather than an entire row.
Also, during the roll-in process, MDC reduces index maintenance because the block
index is only updated once when the block is full (not for each row inserted as with other
indexes). This also helps to reduce I/O.
INSERT statements run faster when you use MDC, because MDC reuses existing empty
blocks without the need for page splitting. Locking is also reduced for inserts because
they occur at a block level rather than a row level.
MDC improves the roll-out of data, because entire pages are deleted rather than each
row. Logging is also reduced with MDC deletes (just a few bytes per page).
Use a single-column MDC design to facilitate roll-in and roll-out and minimize an
increase in disk space usage.
See the section called “Best practice for roll-in and roll-out with continuous updates” for
a hypothetical application with characteristics that benefit from using MDC for rolling in
data.
Data Life Cycle Management Page 10
Using database partitioning, table partitioning and
multi-dimensional clustering in the same database
design The best practice approach for deploying large scale applications is to implement
database partitioning, table partitioning, and MDC simultaneously in the same database
design. Database partitioning provides scalability and helps ensure the even distribution
of data across logical partitions; table partitioning facilitates query partition elimination
and rollout of data; and MDC improves query performance and facilitates the roll-in of
data.
For example:
CREATE TABLE Test
(A INT, B INT, C INT, D INT …)
IN Tablespace A, Tablespace B, Tablespace C …
INDEX IN Tablespace B
DISTRIBUTE BY HASH (A)
PARTITION BY RANGE (B) (STARTING FROM (100) ENDING (300) EVERY (100))
ORGANIZE BY DIMENSIONS (C,D)
Table partitioning may not fully solve scaling issues in DB2. Continue to use database
partitioning to solve scalability issues for large scale data warehouses. DB2 database
partitioning and the shared nothing architecture is the best way to provide linear scaling
of your application while minimizing the software bottleneck.
Data Life Cycle Management Page 11
Additional techniques to support life cycle
management
Large table spaces Using large table spaces (the default for DB2 Version 9.1) better accommodates larger
tables or indexes. It also allows more rows per page within the DB2 server.
Use large table spaces for tables for deep compression (many rows per page) and table
partition global indexes (when expected to exceed 64 GB in a 4K page). If you are not
affected by these issues, then large table spaces do not have to be utilized. Also, you can
avoid the need for large table spaces by placing each table partition global index into a
separate table space (highly recommended). Local partition indexes further reduce the
need for the deployment of large table spaces.
The following diagram compares the space available, in terms of the number of records
that can be stored on various-sized pages, for a regular table space and a large table
space.
Page Size REG TBSP
Max Size
RID – 4 Bytes
REG TBSP
Max
Records/Min
Record
Length
LARGE TBSP
Max Size
RID – 6 Bytes
LARGE TBSP
Max
Records/Min
Record Length
4 KB 64 GB 251 / 14 2 TB 287 / 12
8 KB 128 GB 253 / 30 4 TB 580 / 12
16 KB 256 GB 254 / 62 8 TB 1165 / 12
32 KB 512 GB 253 / 127 16 TB 2335 / 12
Note: If you alter a table space to large, it does not take effect until all indexes for the
tables have been reorganized.
Data Life Cycle Management Page 12
SET INTEGRITY operation Running SET INTEGRITY is required when you attach a new partition to a table and
when you detach a table partition with a materialized query table (MQT). (Note that data
in the new partition is not visible until the SET INTEGRITY process completes.) SET
INTEGRITY is a potentially long-running operation that validates data and maintains
global indexes. This maintenance activity is logged and might produce a large volume of
log entries. DB2 Version 9.7 supports partitioned indexes that can be created prior to
attaching a new partition. The greatly reduces the time required for the SET INTEGRITY
operation.
The key benefit of SET INTEGRITY is that existing data is available for read and write
access during its operation. You can minimize the impact of SET INTEGRITY for large
volumes of data by using MDC, implementing partition indexes and by minimizing your
use of global indexes and MQTs. User-maintained MQTs are an alternative that you can
specify to speed up SET INTEGRITY.
The section “Designing and implementing your table partitioning strategy” contains
recommendations on the use of SET INTEGRITY.
The section “Best practices for roll-in of compressed table partitions” describes how to
attach a table partition without requiring the execution of SET INTEGRITY.
Asynchronous index cleanup Asynchronous index cleanup (AIC) is a new DB2 feature that reclaims space in an index
after a table partition is detached (the cleanup runs at a low priority as a background
process). The detachment of a table partition is near instantaneous because of the AIC
feature. Detachment does not have to wait for index cleanup to complete. AIC is a
background process that is invoked automatically by the DB2 database system. AIC is
not performed for partitioned local indexes.
Data Life Cycle Management Page 13
Designing and implementing your table partitioning
strategy Applications that benefit from table partitioning use the following kinds of tables:
• Very large tables
• Tables with queries accessing range-subsets of a table
• Tables with roll-out requirements
• Tables with roll-in requirements. As an alternative, consider MDC for roll-in of
data.
Design best practices: When designing your table partitioning strategy, consider the following table
partitioning design best practices:
• Partition a column (or columns) on dates (to facilitate roll-out)
• Partition a column (or columns) that assist partition elimination, as discussed
later in this section
• Match the granularity of ranges with roll-in and roll-out criteria. This avoids the
need for reorganization to reclaim space when you run DETACH.
• Consider placing different ranges in separate table spaces to facilitate back up
and recovery. The DB2 database system can backup and restore an entire range
partition when it is placed in a separate table space.
• Consider separating active from historical data.
• To position for enhancements in a future release of the DB2 database system,
consider making unique indexes a superset of the table partitioning key. Non-
unique indexes can be on any columns.
• Specify the placement of each of your global indexes into their own table space
(use large table spaces, if required). It is a good practice to minimize the size of
the table spaces containing global indexes, in order to improve backup time.
Also, use database partitioning to reduce the granularity of global indexes.
• Use partitioned indexes instead of global partition indexes where possible.
• Split up the global index into multiple table spaces to ensure that a single table
space does not grow too large.
Data Life Cycle Management Page 14
• Consider strategies to minimize the impact of SET INTEGRITY. Consider the
logging impact and elapsed time of SET INTEGRITY when attaching large
ranges. SET INTEGRITY can also impact restart time if there is a failure.
Prototype in your environment to see if the elapsed time is acceptable.
Otherwise, consider alternative design strategies, discussed in the sections
“Rolling in data: Which solution to use?” and “Best practices for roll-in of
compressed table partitions”.
• For deep compression, DB2 Version 9.5 is strongly recommended, because of its
ability to automatically build compression dictionaries during LOAD, IMPORT
or INSERT operations. DB2 Version 9.1 requires table reorganization to compress
data in a table partition if a compression dictionary is not present.
With DB2 Version 9.7, consider the following partitioned index design best practices
• Partitioned indexes improve ATTACH and DETACH processing time and may
improve query performance in a large database environment. The design
guidelines for the creation of partitioned local indexes are:
o Non-Unique indexes are partitioned indexes by default. There are no
design restrictions to create non-unique indexes partitioned.
o Unique Indexes can be partitioned only if the key is a superset of the
table partitioning key (For DPF configurations – the key must be a
superset of the database partitioning key). For example:
Database partition key: Account_Num
Table partition key: Sales_Month
Potential unique index that can be partitioned:
Account_Num, Sales_Month, Store_Num
• To gain the benefits of partitioned indexes verify that uniqueness is required for
the application:
o A downstream data source may enforce uniqueness
o Non-unique indexes may increase sorting time for DISTINCT, ORDER
BY, and GROUP BY predicates
o Uniqueness may not be required.
• Unique partitioned indexes have to be created prior to the attachment to avoid
index maintenance overhead
• Placing index partitions in a separate table space is a best practice
Data Life Cycle Management Page 15
A major benefit of placing partition indexes in their own table space is that there
is no data movement when attaching a partition if the separate index was built
prior to the ATTACH.
Partition indexes are placed in the same table space as the table by default.
Partition indexes may be placed in a separate table space. To place a partition
index in a separate table space, use the partition level INDEX IN clause of
CREATE TABLE DDL or ALTER ADD the partition (DMS storage only). The table
level INDEX IN clause of the CREATE TABLE DDL is for non-partition indexes
only.
• Partition index migration considerations and best practices
Partition indexes created in DB2 Version 9.5 and migrated to DB2 Version 9.7 are
placed in the same table space as the table. Data movement is required to put the
index into a separate table space after a migration from DB2 Version 9.5.
To migrate to partition indexes in separate table spaces in DB2 Version 9.7
1. Create a new partition index in a separate table space.
create index date_part on sales(date, status) partitioned;
2. Drop the existing original partition index from the same table space as the
data.
drop index dateidx;
3. Rename the new partition index in a separate table space to the same name
as the original partition index.
rename index date_part to dateidx;
An alternative method is to create a new table with partitioned indexes and
move the data using an online table move in order to place the partitioned
indexes into a separate tablespace.
• Other partition index design considerations
o Indexes on the source table to be attached need to match indexes on the
target table
� Some correction possible before a failure occurs
� Indexes that do not match will be dropped
Data Life Cycle Management Page 16
o Although several catalog statistics are moved during an ATTACH, the
best practice is to run RUNSTATS after an ATTACH.
Maximizing the benefits of partition elimination: Table partitioning provides a powerful facility to limit data access to the partitions that
are required to satisfy the SQL WHERE clause. In order to benefit from partition
elimination, do the following tasks:
• Prefix the cluster key with the partition key
• Ensure the range partitioning column is frequently used in the WHERE clause
• Ensure the leading columns of the composite partition key are in the WHERE
clause
• If you are using generated columns, use them where appropriate to assist in
partition elimination. Generated columns can be partition keys.
• Use generated columns as MDC dimensions, where appropriate to reduce the
granularity of the dimension.
• Use multiple, separate ranges to eliminate unnecessary searches, if possible. For
example, partition elimination could access only the months of January and
December instead of the whole year.
• If you are using joins, partition elimination is used for inner access of the nested
loop join only.
• Partition elimination of parameter markers is pushed down at run time when
values are bound at execution time
Operational considerations: Use the following best practices to enhance the operational characteristics of table
partitioning:
• Issue a COMMIT statement after each step of your roll-in or roll-out procedure to
release locks (for example, after ATTACH, DETACH, or SET INTEGRITY, and so
on)
• Explicitly name each table partition. These names are easier to manage than the
system generated names.
• Always terminate a failed LOAD utility run. Subsequent operations (for
example, DROP TABLE) cannot proceed until LOAD is terminated.
• If you are appending data to a partition, specify LOAD INSERT. Performing
LOAD REPLACE of a partition replaces an entire table (all partitions).
Data Life Cycle Management Page 17
• Avoid attaching a partition with the same name as a detached partition. This
results in a duplicate name until asynchronous index cleanup (AIC) completes.
Data Life Cycle Management Page 18
Rolling in data: Which solution to use? There are several factors that affect how you choose the best roll-in solution for your
installation:
• Minimizing the time it takes to bring new data into the system and make it
available
• Minimizing the amount of logging activity that occurs as part of the SET
INTEGRITY operation during roll in
• Whether you have a requirement for continuous updates rather than daily batch
process.
• Maximizing compression for new ranges to effectively manage data skew
The following methods are the two different techniques for the roll-in of data with table
partitions.
1. ALTER/ATTACH
With the ALTER/ATTACH method you first populate the table offline, and then
attach the partition. You must run SET INTEGRITY (a potentially long-running
operation for large data volumes). The impact of running SET INTEGRITY may
be reduced by using partitioned indexes in DB2 version 9.7.
Advantages:
• Concurrent access
• All previous partitions are available for updates
• No partial data view (new data cannot be seen until Set Integrity
completes)
Disadvantages:
• Additional log space is required
• Long elapsed times
• Draining of queries is required
2. ALTER/Add
With the ALTER/Add method, you attach an empty table partition, and then
populate it using the LOAD utility or INSERT statements.
You do not need to run SET INTEGRITY.
Data Life Cycle Management Page 19
Advantages:
• Faster elapsed times
• SET INTEGRITY is not required
• Less log space for global index maintenance
Disadvantages:
• Partial data view occurs when you use INSERT statements (not with
LOAD utility).
• LOAD utility allows read-only access to older partitions
Recommendation:
For larger data volumes, utilize the ALTER/Add method for roll-in of a table partition or
utilize MDC for roll-in if many non-partitioned indexes are deployed.
Data Life Cycle Management Page 20
Best practices for roll-in of compressed table
partitions: These best practices use the ALTER/Attach method of attaching a table partition, which
are described in the preceding section.
For Version 9.1, rapidly attach a table partition with compressed data (large data
volumes) by using the following technique:
1. Load a subset of data (a true random sample) into a separate DB2 table
2. Alter the standalone table to enable compression
3. Reorganize the subset of data to build a compression dictionary
4. Empty the table or retain minimal data (so the dictionary is retained)
5. ALTER/ATTACH the table as a new table partition (the dictionary is retained)
6. Execute SET INTEGRITY (this is rapid, due to minimal data)
7. Populate data by using the LOAD utility or INSERT statements (compression
will occur). For applications with continuous updates, load data into a staging
table using the LOAD utility. Then, use an insert with a sub-select from the
staging table or run an ETL (extract, transform, and load) job to update the
primary tables (compression will occur). The roll-in of data can be improved
further if you exploit the benefits of MDC within the table partition.
For Version 9.5, the technique to rapidly attach a table partition is simplified by
automatic dictionary creation:
1. ALTER/Add the empty table.
2. Populate the table with data, by using the LOAD utility or an INSERT/SELECT
statement (data is compressed with automatic dictionary creation).
Note that a full offline reorganization of a fully-loaded partition is likely to achieve better
compression than can be achieved with this method. DB2 Version 9.7 fix pack 1 supports
rapid reorganization by partition when using partitioned indexes to improve
compression results.
Data Life Cycle Management Page 21
Best practice for roll-in and roll-out with continuous
updates: This database design combines various features of the DB2 database system to facilitate
roll-in and roll-out of data with continuous update requirements.
This design is for applications with the following characteristics:
• Continuous updates occur all day long (which prevents performing ALTER/Add
to attach a partition).
• Data is added daily.
• Queries frequently access a certain day.
• Table partitioning on day results in too many partitions (for example, 365 days
times 3 years).
• Roll-out occurs weekly or monthly (typically on a reporting boundary).
Recommended database design:
To facilitate the roll-in of data, specify a single-dimension MDC on day (see the section
“Features of MDC that benefit roll-in and roll-out of data”).
To facilitate the roll-out of data, specify a table partition range per week or month. This
provides the same time dimension as MDC but at a coarser scale.
Applications with long running reports might not be able to drain queries for the
execution of the DB2 LOAD utility. The best practice in this case is to use the LOAD
utility to rapidly load data into staging tables. Then populate the primary tables using an
insert with a sub-select.
Data Life Cycle Management Page 22
After roll-out: How to manage data growth and
retention? To satisfy corporate policy, government regulations, or audit requirements, you might
need to retain your data and keep it accessible for long durations of time. For example,
the Health Insurance Portability and Accountability Act (HIPAA) contains medical
record retention requirements for health-care organizations. The Sarbanes-Oxley Act sets
out certain record retention requirements for corporate accountants. Additionally, some
enterprises are also finding value in performing analytics on historical data and are
therefore retaining data for longer durations.
Therefore, in addition to implementing a suitable roll-in and roll-out strategy and an
appropriate database design, you need to consider the complete lifespan of your data
and include a policy for data retention and retrieval. You could do nothing and
continually add hardware capacity and resources to maintain the additional data growth
for retention purposes, however there are better practices for data retention, as described
in this paper.
Using UNION ALL views One practice is to keep all the data in the database but roll out certain ranges for retention
and create UNION ALL views over the ranges that require easy accessibility.
The following example demonstrates how to create a UNION ALL view:
CREATE VIEW all_sales AS
(
SELECT * FROM sales_0105
WHERE sales_date BETWEEN '01-01-2005' AND '01-31-20 05'
UNION ALL
SELECT * FROM sales_0205
WHERE sales_date BETWEEN '02-01-2005' AND '02-28-20 05'
UNION ALL
...
UNION ALL
SELECT * FROM sales_1207
WHERE sales_date BETWEEN '12-01-2007' AND '12-31-20 07'
);
Data Life Cycle Management Page 23
Using UNION ALL views addresses data retention and real time accessibility while
keeping all the data maintained online in the database using primary storage. A problem
caused by this method is that you might be unnecessarily maintaining this data in
associated backup images. Also, historical data typically does not require high
performance, so does not need the indexing or other high-cost factors encountered with
your primary data.
There are a variety of ways you could use UNION ALL views:
• Access active data using UNION ALL views and keep your historical data
compressed in a range-partitioned table.
• Keep active data in a range-partitioned table and use a UNION ALL view to
access historical data in another a range-partitioned table.
Using UNION ALL views has some limitations. When you have a large number of
ranges, use range-partitioned tables because, for UNION ALL views some complex
predicates and joins are not pushed down.
However, in some situations UNION ALL views are advantageous. For example, a
UNION ALL view may work in a federated environment, whereas a range-partitioned
table does not.
Although UNION ALL views may be useful in some environments, DB2 version 9.7
users should strongly consider migrating to table partitioning.
Using IBM Optim Data Growth Solution Depending on your service level agreement (SLA) objectives for your historical data,
usually the best practice to address both data growth and retention is to implement data
archiving with IBM Optim™ Data Growth Solution.
IBM Optim Data Growth Solution is a leading solution for addressing growth,
compliance and management of data. It preserves application integrity by archiving
complete business objects, rather than single tables. For example, it retains foreign keys
and preserves metadata within the archive. These features enable you to have:
• Flexible access to data.
• The ability to selectively or fully restore archived data into the original database
table, or into a new table, or even into an alternate database.
The following steps guide you through the process of determining how best to
implement your archiving strategy.
STEP 1: Classify your applications
First, you need to classify your applications according to their archival requirements.
By understanding which transactions you need to retain from your application data,
Data Life Cycle Management Page 24
you can group applications with similar data requirements for archive accessibility
and performance. Some applications require only current transactions be retained;
some require access to only historical transactions; and others require access to a mix
of current and historical transactions (with a varying current-to- historical ratio).
Also, consider the service level agreement (SLA) objectives for your archived data.
An SLA is a formal agreement between groups that defines the expectations between
them and includes objectives for items such as services, priorities, and
responsibilities. SLA objectives are often formulated using response time goals. For
example, a specific human resources report might need to run, on average, within 5
minutes.
STEP 2: Assess the temperature of your data:
Data derives its “temperature” from the following criteria:
• How frequently the data is accessed
• How long it takes to access the data
• How rapidly the data changes (volatility)
• User and application requirements
The temperature varies from enterprise to enterprise, but typically the data
temperatures fall into common classifications across industries. The following table
provides guidelines for data temperatures.
Regulatory Data that needs to be available on an exception basis.Dormant
Deep Historical Data – Queries rarely access this data but it must be available for periodic access.Cold
Traditional Decision Support Data – Queries access this data less frequently and data retrieval doesn’t require the urgency of a quick turnaround in response time.
Warm
Tactical Data – The bulk of the queries are for current data, accessed frequently, heavily and requiring quick response time turnaround.Hot
FactoidData
Temperature
Regulatory Data that needs to be available on an exception basis.Dormant
Deep Historical Data – Queries rarely access this data but it must be available for periodic access.Cold
Traditional Decision Support Data – Queries access this data less frequently and data retrieval doesn’t require the urgency of a quick turnaround in response time.
Warm
Tactical Data – The bulk of the queries are for current data, accessed frequently, heavily and requiring quick response time turnaround.Hot
FactoidData
Temperature
There are various means of assessing the temperature of data. Consider business and
application definitions and requirements, roll-out criteria, and workload and query
tracking statistics as potential methods for determining how to classify your data
according to temperature. Gather the following potential workload and query
information to assess the data temperature:
• Which objects are (and are not) being accessed
Data Life Cycle Management Page 25
• The frequency each object is accessed
• The common time intervals at which objects are accessed,
For example: THIS_WEEK, LAST_WEEK, THIS_QUARTER,
LAST_QUARTER.
• Which data within an object is being accessed
You can use DB2 Version 9.5 workload management (WLM) to assist in discovering
data temperatures. The WLM historical analysis tool provides statistics on which
tables, indexes and columns have, or have not, been accessed, along with the
associated frequency.
The WLM historical analysis tool consists of 2 scripts:
• wlmhist.pl: generates historical data
• wlmhisrep.pl: produces reports from the historical data
To discover which data within an object is being accessed, analyze the SQL statement
using an ACTIVITIES event monitor to collect data on workload activities, including
the SQL statement text. You might want to collect information about workload
management objects such as workloads, service classes, and work classes (through
work actions). Enable activity collection using the COLLECT ACTIVITY DATA …
WITH DETAILS clause of the CREATE or ALTER statements for the workload
management objects for which you want to collect information, as shown in the
following example:
ALTER SERVICE CLASS sysdefaultsubclass
UNDER sysdefaultuserclass
COLLECT ACTIVITY DATA ON ALL WITH DETAILS
The WITH DETAILS clause enables collection of the statement text for both static and
dynamic SQL.
If applications make use of parameter markers within the statement text, you should
also include the AND VALUES clause, (so that you have COLLECT ACTIVITY
DATA … WITH DETAILS AND VALUES). The AND VALUES clause collects the
data values associated with the parameter markers in addition to the detailed
statement information.
STEP 3: Discover and classify your business objects
Business objects, such as insurance claims, invoices, or purchase orders, represent
business transactions. By classifying your business objects, you can begin to define
Data Life Cycle Management Page 26
rules and associated business drivers for managing these objects at different stages in
the data life cycle.
From a database perspective, a business object represents a group of related rows
from related tables.
Simplified example of a business object:
Given the following three tables:
2/1/2006E11OPERATIONOP1010
2/1/2006E01OPERATION SUPPORTOP1000
12/1/2002D11W L PROD CONT PROGSMA2113
12/1/2002D11W L ROBOT DESIGNMA2112
12/1/2002D11W L PROGRAM DESIGNMA2111
2/1/2006D11W L PROGRAMMINGMA2110
2/1/2006D01WELD LINE AUTOMATIONMA2100
2/1/2006C01USER EDUCATIONIF2000
PRJENDATEDEPTNOPROJNAMEPROJNO
2/1/2006E11OPERATIONOP1010
2/1/2006E01OPERATION SUPPORTOP1000
12/1/2002D11W L PROD CONT PROGSMA2113
12/1/2002D11W L ROBOT DESIGNMA2112
12/1/2002D11W L PROGRAM DESIGNMA2111
2/1/2006D11W L PROGRAMMINGMA2110
2/1/2006D01WELD LINE AUTOMATIONMA2100
2/1/2006C01USER EDUCATIONIF2000
PRJENDATEDEPTNOPROJNAMEPROJNO
PROJECT
E11O’CONNELL310
D11CIALINI170
C01TYRRELL140
D11CASSELLS160
D11GOODMAN150
C01VINCENT130
D11TSOUNIS60
WORKDEPTLASTNAMEEMPNO
E11O’CONNELL310
D11CIALINI170
C01TYRRELL140
D11CASSELLS160
D11GOODMAN150
C01VINCENT130
D11TSOUNIS60
WORKDEPTLASTNAMEEMPNO
EMPLOYEE
OPERATIONSE11
ADMINISTRATION SYSTEMSD21
MANUFACTURING SYSTEMSD11
INFORMATION CENTERC01
DEPTNAMEDEPTNO
OPERATIONSE11
ADMINISTRATION SYSTEMSD21
MANUFACTURING SYSTEMSD11
INFORMATION CENTERC01
DEPTNAMEDEPTNO
DEPARTMENT
The business object is:
Project Department
Employee
For data retention and archiving purposes, you want the complete business object to
be represented such that you have a historical “point-in-time” snapshot of a business
transaction. Creating a historical snapshot requires both transactional detail and
related master information, which involves multiple tables in the database.
Archiving complete business objects allows the archives to be intact and accurate and
to provide a standalone repository of transaction history. To respond to inquiries or
discovery requests, you can query this repository without the need to access “hot”
data.
Data Life Cycle Management Page 27
In this example, to ensure the complete object is available, the archived business
object must consist of associated data from the DEPARTMENT and EMPLOYEE
tables. After archiving, you would only want to delete the data in the production
PROJECT table and not in the associated EMPLOYEE and DEPARTMENT data.
You can discover business objects based on data relationships within the schema, as
demonstrated in this example. However, you might also want to include other
related tables that do not have any schema relationship, but, for example, might be
related through use of an application. In addition, you might elect to remove certain
discovered relationships from the business object.
STEP 4: Produce your comprehensive data classification:
After you have classified your applications and business objects and determined
their associated data temperatures, you can produce a data classification table to
summarize this information. This table articulates the aging of the data.
The following table provides a sample data classification:
>10yrs6-10yrs3-5yrs0-2yrsClaimsAppA
DeleteOffline Archive
Online Archive
ProductionBusiness Object
Application
>10yrs6-10yrs3-5yrs0-2yrsClaimsAppA
DeleteOffline Archive
Online Archive
ProductionBusiness Object
Application
STEP 5: Determine the post-archive storage type
To determine what storage type is most appropriate for your aged data, consider the
following questions:
• Who needs to access the archive data, and for what purpose?
• What are the response time expectations?
• How will the archive data age?
• How many storage tiers and what type of storage should be deployed, for
example, SAN, WORM, or tape?
For example, for online archive you could use ATA disks or large capacity slower
drives. For offline archive, you could use tape or WORM (IBM DR550, EMC
Centera).
Data Life Cycle Management Page 28
Non DBMSRetention PlatformATA File ServerIBM DR550EMC Centera
CurrentData
0-2 years
Offline Retention Platform
CDTapeOptical
ProductionDatabase
Archive
OnlineArchive
3-5 years
OfflineArchive
6+ years
RestoreRestore
IBM Federation
Report WriterXMLODBC / JDBCNative Application
Universal Access to Application Data
Application Independent Access
STEP 6: Access to archived data
The Optim Data Growth Solution access layer uses SQL92 capability and various
protocols (as shown in the above figure) to provide access to the archived data. This
accessibility is out-of-line from the production database, and so does not use any
resources from the production database system.
Alternatively, you can use a federated system (using IBM DB2 Federated Server) to
provide transparent access to the archive from the production database.
Both methods allow for direct access to archived data, without the need to retrieve or
restore the archived data.
The following example demonstrates how to use a UNION ALL view to access both
active and archived data. The example renames the database table called project to a
different name, and then creates a UNION ALL view that is also named project.
RENAME TABLE project TO project_active
CREATE VIEW project AS
SELECT * FROM project_active
WHERE prjendate >= (CURRENT_DATE – 5 YEARS)
UNION ALL
SELECT * FROM project_arch
Data Life Cycle Management Page 29
WHERE prjendate < (CURRENT_DATE – 5 YEARS)
As an alternative, the following example avoids the need to rename the table in the
database. Instead, the example creates a UNION ALL view called project_all that
the application can query from to get the complete project data set:
CREATE VIEW project_all AS
SELECT * FROM project
WHERE prjendate >= (CURRENT_DATE – 5 YEARS)
UNION ALL
SELECT * FROM project_arch
WHERE prjendate < (CURRENT_DATE – 5 YEARS)
Data Life Cycle Management Page 30
Best Practices
• For database partitioning, use a partitioning key column with
high cardinality and frequently used by a join predicate.
• Use database partitioning to improve scalability for large scale
data warehouses.
• Use table partitioning for very large tables, tables with queries
that access range-subsets of data, and for roll-out requirements.
• For MDC, specify low-cardinality columns or use generated
columns to reduce cardinality.
• Use a single-column MDC design to facilitate roll-in and roll-out
to minimize increased disk space usage.
• For large scale applications, implement database partitioning,
table partitioning, and MDC simultaneously.
• Use large table spaces for tables with deep compression if you
believe you will have very small row sizes. For table partitioning,
place each table partition global index in a separate table space
(this might avoid the need for large table spaces) or use
partitioned local indexes.
• For larger data volumes, use the ALTER/Add method to roll-in a
table partition, or use MDC.
• For Version 9.1, to attach a table partition with compressed data
build a dictionary with minimal data prior to ALTER/ATTACH
to avoid table reorganization
• For Version 9.5, to attach a table partition, use the ALTER/Add
method.
Data Life Cycle Management Page 31
• For continuous updates, facilitate roll-in of data by specifying a
single-dimension MDC on day
• Use federation to facilitate access to archived data from
production databases.
• Use UNION ALL views for transparent access to archived data.
• IBM Optim Data Growth Solution is the recommended tool for
data retention and retrieval.
Data Life Cycle Management Page 32
Conclusion
Careful selection of the most appropriate partitioning method for your DB2 database,
and using the most efficient roll-in and roll-out technique for your system can maximize
your system’s overall performance and efficiency.
Devote sufficient time to analyzing and understanding your data so that you can make
the best use of the guidelines in this paper and take advantage of the features the DB2
database system provides to help make your system as efficient as possible.
You can use database partitioning to provide scalability and to help ensure even
distribution of data across partitions. Follow the guidelines in the section “Designing and
implementing your table partitioning strategy” to devise the most effective table
partitioning strategy. Use MDC to help improve the performance of queries and to
facilitate the roll-in of data.
If you need to roll-in large volumes of data from compressed table-partitions, upgrade to
Version 9.5 of the DB2 database system and use the ALTER/Add method to attach a table
partition.
If you need to accommodate continuous updates, your best strategy is to use MDC to
facilitate the roll-in process.
To determine how to handle the needs of your historical data, follow the guidelines in
the section “After roll-out: How to manage data growth and retention?”.
Before you are ready to roll out your data and archive it, you need to determine a policy
for data retention and retrieval-of-data-from-archive that suits your organization.
You can better understand your organization’s technical requirements for retention and
retrieval by analyzing the following factors:
The kind of transactions you need to retain
The “temperature” of your data
How your business objects are composed
Your policy should include what kind of post-archive storage is most appropriate, and
how best to access the archived data. The guidelines in the section “After roll-out: How
to manage data growth and retention?” can assist you in producing your policy.
Data Life Cycle Management Page 33
Further reading • DB2 Best Practices - http://www.ibm.com/developerworks/db2/bestpractices/
• Leveraging DB2 Data Warehouse Edition for Business Intelligence -
http://www.redbooks.ibm.com/redbooks/SG247274/wwhelp/wwhimpl/java/html
/wwhelp.htm
• Database Partitioning, Table Partitioning, and MDC for DB2 9 -
http://www.redbooks.ibm.com/redbooks/SG247467/wwhelp/wwhimpl/java/html
/wwhelp.htm
• DB2 V9.5 Information Center -
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp
• DB2 V9.7 Information Center -
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp
• Optim Data Growth Management -
http://www.optimsolution.com/solutions/DataGrowth.asp
Contributors
Tim Vincent
Chief Architect DB2 LUW
Bill O’Connell
Data Warehousing CTO
Miriam Goodwin
Technical Sales Specialist
Tim Smith
Optim Product Manager
Phrederick Tyrrell
Data Warehousing Competitive Specialist
Aamer Sachedina
Senior Technical Staff Member
DB2 Technology Development
Matthew Huras
DB2 LUW Kernel, Chief Architect
Joyce Simmonds
DB2 Information Management
Data Life Cycle Management Page 34
Notices This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other
countries. Consult your local IBM representative for information on the products and services
currently available in your area. Any reference to an IBM product, program, or service is not
intended to state or imply that only that IBM product, program, or service may be used. Any
functionally equivalent product, program, or service that does not infringe any IBM
intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in
this document. The furnishing of this document does not grant you any license to these
patents. You can send license inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where
such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES
CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-
INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do
not allow disclaimer of express or implied warranties in certain transactions, therefore, this
statement may not apply to you.
Without limiting the above disclaimers, IBM provides no representations or warranties
regarding the accuracy, reliability or serviceability of any information or recommendations
provided in this publication, or with respect to any results that may be obtained by the use of
the information or observance of any recommendations provided herein. The information
contained in this document has not been submitted to any formal IBM test and is distributed
AS IS. The use of this information or the implementation of any recommendations or
techniques herein is a customer responsibility and depends on the customer’s ability to
evaluate and integrate them into the customer’s operational environment. While each item
may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee
that the same or similar results will be obtained elsewhere. Anyone attempting to adapt
these techniques to their own environment do so at their own risk.
This document and the information contained herein may be used solely in connection with
the IBM products discussed in this document.
This information could include technical inaccuracies or typographical errors. Changes are
periodically made to the information herein; these changes will be incorporated in new
editions of the publication. IBM may make improvements and/or changes in the product(s)
and/or the program(s) described in this publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only
and do not in any manner serve as an endorsement of those Web sites. The materials at
those Web sites are not part of the materials for this IBM product and use of those Web sites is
at your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.
Any performance data contained herein was determined in a controlled environment.
Therefore, the results obtained in other operating environments may vary significantly. Some
measurements may have been made on development-level systems and there is no
guarantee that these measurements will be the same on generally available systems.
Furthermore, some measurements may have been estimated through extrapolation. Actual
results may vary. Users of this document should verify the applicable data for their specific
environment.
Data Life Cycle Management Page 35
Information concerning non-IBM products was obtained from the suppliers of those products,
their published announcements or other publicly available sources. IBM has not tested those
products and cannot confirm the accuracy of performance, compatibility or any other
claims related to non-IBM products. Questions on the capabilities of non-IBM products should
be addressed to the suppliers of those products.
All statements regarding IBM's future direction or intent are subject to change or withdrawal
without notice, and represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To
illustrate them as completely as possible, the examples include the names of individuals,
companies, brands, and products. All of these names are fictitious and any similarity to the
names and addresses used by an actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate
programming techniques on various operating platforms. You may copy, modify, and
distribute these sample programs in any form without payment to IBM, for the purposes of
developing, using, marketing or distributing application programs conforming to the
application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions.
IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall
not be liable for any damages arising out of your use of the sample programs.
Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both. If these and
other IBM trademarked terms are marked on their first occurrence in this information with a
trademark symbol (® or ™), these symbols indicate U.S. registered or common law
trademarks owned by IBM at the time this information was published. Such trademarks may
also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml
Windows is a trademark of Microsoft Corporation in the United States, other countries, or
both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.