11
© Copyright IBM Corporation 2013 Trademarks Guaranteed delivery with InfoSphere DataStage Page 1 of 11 Guaranteed delivery with InfoSphere DataStage Paul Stanley ([email protected]) Senior Connectivity Architect IBM 17 October 2013 This article describes how the InfoSphere® DataStage Distributed Transaction Stage can be used to guarantee delivery of data. It also addresses the use of local transactions within DataStage database stages. Finally, it describes how the Change Data Capture Transaction stage works with InfoSphere Data Replication to guarantee delivery of changes to a target database. Introduction InfoSphere DataStage provides powerful capabilities to extract data from a source system, transform it, and load it into a target system. Many DataStage users require solutions that guarantee that data is moved from the source system to the target system without any possibility of data loss. This article describes in detail various approaches that you can use with InfoSphere DataStage to guarantee the delivery of data from a source to a target. It covers the approach of using the Distributed Transaction stage (DTS) with a transaction manager for distributed XA transactions and without a transaction manager using local database transactions. It describes using multiple input links with a database connector to perform multiple database operations within a local transaction. It discusses how guaranteed delivery is accomplished when moving data from InfoSphere Data Replication to DataStage using the Change Data Capture Transaction stage. Finally, it describes the best practice of combining messaging stages with database stages to provide reliable data processing and transformation services. Global transactions with the DTS This section explains the XA 2-phase commit architecture at a high level, and it shows how the DTS interacts with WebSphere® MQ and database resources to accomplish global transactions. The X/Open Group standard eXtended Architecture (XA) defines a protocol for updating multiple resources within a single transaction. A resource can be a database or it can be a messaging system, such as WebSphere MQ. Achieving atomicity, consistency, isolation, and durability (ACID) across multiple resources connected via unreliable network connections is a difficult problem to solve. XA achieves this by using a two-phase commit (2PC) protocol. A transaction manager manages the protocol, communicating with multiple resource managers that manage the database or messaging resource transactions. The two phases of the 2PC protocol are:

Dm 1310datastage PDF

Embed Size (px)

Citation preview

Page 1: Dm 1310datastage PDF

© Copyright IBM Corporation 2013 TrademarksGuaranteed delivery with InfoSphere DataStage Page 1 of 11

Guaranteed delivery with InfoSphere DataStagePaul Stanley ([email protected])Senior Connectivity ArchitectIBM

17 October 2013

This article describes how the InfoSphere® DataStage Distributed Transaction Stage canbe used to guarantee delivery of data. It also addresses the use of local transactions withinDataStage database stages. Finally, it describes how the Change Data Capture Transactionstage works with InfoSphere Data Replication to guarantee delivery of changes to a targetdatabase.

IntroductionInfoSphere DataStage provides powerful capabilities to extract data from a source system,transform it, and load it into a target system. Many DataStage users require solutions thatguarantee that data is moved from the source system to the target system without any possibilityof data loss. This article describes in detail various approaches that you can use with InfoSphereDataStage to guarantee the delivery of data from a source to a target. It covers the approachof using the Distributed Transaction stage (DTS) with a transaction manager for distributed XAtransactions and without a transaction manager using local database transactions. It describesusing multiple input links with a database connector to perform multiple database operations withina local transaction. It discusses how guaranteed delivery is accomplished when moving data fromInfoSphere Data Replication to DataStage using the Change Data Capture Transaction stage.Finally, it describes the best practice of combining messaging stages with database stages toprovide reliable data processing and transformation services.

Global transactions with the DTSThis section explains the XA 2-phase commit architecture at a high level, and it shows how theDTS interacts with WebSphere® MQ and database resources to accomplish global transactions.

The X/Open Group standard eXtended Architecture (XA) defines a protocol for updating multipleresources within a single transaction. A resource can be a database or it can be a messagingsystem, such as WebSphere MQ. Achieving atomicity, consistency, isolation, and durability (ACID)across multiple resources connected via unreliable network connections is a difficult problemto solve. XA achieves this by using a two-phase commit (2PC) protocol. A transaction managermanages the protocol, communicating with multiple resource managers that manage the databaseor messaging resource transactions. The two phases of the 2PC protocol are:

Page 2: Dm 1310datastage PDF

developerWorks® ibm.com/developerWorks/

Guaranteed delivery with InfoSphere DataStage Page 2 of 11

• Commit-request: In this phase, the transaction manager sends request messages to eachresource. The resources prepare the transaction to the point of commit, without actuallycommitting the transaction. For example, a resource could write the data to the database,but flag each record as uncommitted. The resource then sends a status message back to thetransaction manager to indicate success or failure.

• Commit: If all of the resources reported success from the first phase, then the resourcemanager sends a commit message to each resource. The resources complete the transactionand then report their final status back to the resource manager. If the resource managerreceives any failure messages from any of the resources, it then sends an abort message toall of the resources, which rollback their uncommitted transactions.

2PC works because if a resource reports success from the first phase, then it is issuing aguarantee that it will commit the data. Underlying the protocol is a set of handshaking messagesthat handle a variety of cleanup situations, such as network failures or failure of any componentsinvolved in the transaction.

The DTS takes advantage of 2PC by using the WebSphere MQ transaction manager, which isa standard feature of WebSphere MQ Server, to coordinate the transaction with the resourcemanagers as shown in Figure 1:

Figure 1. The relationship between DTS, MQ, and resource managers

DTS first connects to WebSphere MQ and instructs the transaction manager to start the XAtransaction. The transaction manager communicates with the resource managers to start thetransactions. DTS then utilizes the DB2, Oracle or WebSphere MQ connectors to perform writeoperations to the resource. Finally, DTS instructs the transaction manager to either commit orrollback the XA transaction.

DTS jobs typically use WebSphere MQ as the source of data, and databases as the target. Byusing DTS, delivery from the source queue to the target database is guaranteed: if the transactionis committed successfully, the source message is deleted from the source queue, and the datais written to the target database. If the job fails, nothing is written to the target database, and themessage is retained on the source queue. This is guaranteed because the deletion of the sourcemessage is made within the same XA transaction as the updates to the target database, and soeither both operations occur, or neither does.

DTS jobs often use work queues to partition the work into parallel pipelines. The use of work filesis best explained with the aid of a diagram, shown in Figure 2:

Page 3: Dm 1310datastage PDF

ibm.com/developerWorks/ developerWorks®

Guaranteed delivery with InfoSphere DataStage Page 3 of 11

Figure 2. The use of work queues by the DTS

This job works with WebSphere MQ to guarantee delivery in the following way:

1. Under local sync-point control, the MQ Connector reads a message from the source queueperforming a destructive read (that is, it will remove the message from the queue).

2. The MQ Connector writes the message to the work queue. If running in parallel, there will bemultiple work queues, one per parallel pipeline. The MQ Connector then commits the localtransaction. Note that the use of a local transaction here means that the movement of themessage from source queue to work queue is guaranteed. If it should fail, the deletion isrolled back, the message is restored on the source queue, and the job is aborted.

3. The DTS stage writes updates to the target database.4. The DTS stage deletes the message from the work queue.

By using work queues, the MQ source connector is able to run in parallel. Each instance of thisstage reads different messages from the source queue because it is reading in a destructivefashion, and WebSphere MQ ensures that only one of multiple readers reads any particular sourcemessage.

When the transaction is committed, if all goes well, the message is deleted from the work queueand the database updates are committed, all as part of a single XA transaction. Should any failureoccur, such a database write error, then the XA transaction is rolled back, which undoes anychanges to the target database and restores the MQ message to the work queue. If the job issubsequently restarted, the MQConnector first reads messages from the work queue and providesthem to the job, and then continues reading from the source queue. This means that in the eventof a failure, simply restarting the job allows it to continue to process messages from the point it leftoff.

Local transactions with the DTSAs well as working with XA-compliant resources such as DB2 and Oracle, DTS also works withnon-XA-compliant resources. This includes ODBC and Teradata targets. This section explainshow resources that do not support the XA architecture can still be utilized within DataStage jobs,and how guaranteed delivery is still possible. It also explains the pitfalls and benefits of this designpattern.

DTS works with non-XA resources by creating one or more "local" transactions in addition to theXA transaction. In this context, a resource is uniquely defined by the connection properties, suchas the database and username used to access the database. If there are multiple links that accessthe same resource (that is, the same database, with the same user credentials), then a singletransaction is used across those links. At commit time, the local transaction is committed beforethe XA transaction. This local commit either succeeds, which is the most likely case, or fails,

Page 4: Dm 1310datastage PDF

developerWorks® ibm.com/developerWorks/

Guaranteed delivery with InfoSphere DataStage Page 4 of 11

which results in the transaction being rolled back by the resource. After the local transaction hasbeen committed or rolled back, then the XA transaction is committed or rolled back. The possiblecommit or rollback scenarios when combining a local transaction with the XA transaction includethe following:

• The local commit succeeds, then the XA commit succeeds (the normal case).• The local commit succeeds, but the XA commit fails (very unlikely).• The local commit fails. When this occurs, DTS rolls back the XA transaction. If DTS should

crash or some other system failure occurs before rolling back the XA transaction, then MQtransaction manager automatically rolls it back anyway.

And for rollback:

• The local rollback succeeds, then the XA rollback occurs.• The local rollback fails. This scenario is not actually possible; resources assume a rollback

position unless they are asked to commit and can successfully follow through on that commitrequest.

Of these scenarios, the only one of concern is the second one, where the local transaction commitsucceeds but the subsequent XA commit fails. In this case, the worst that happens is that thedata is written to the target database under the local transaction, but the message remains on thesource queue. If this should occur, as long as the job is idempotent, then the job can simply berun again. In none of the possible scenarios previously described is data ever lost. The followingsection explains how to create an idempotent job that can be restarted upon job failure.

Idempotency and DataStage jobsThis section explains how to write jobs so that they offer idempotent behavior; that is, they can berestarted upon failure without compromising data. Such jobs need to be able to accept data thatthey may have already processed, and either process it again or ignore it. To help explain this,consider a job that is not idempotent. Consider a simple file to Oracle connector job, shown inFigure 3:

Figure 3. A simple file to Oracle connector job

with the properties of the Oracle connector specified as shown in Figure 4:

Figure 4. Oracle connector properties

Page 5: Dm 1310datastage PDF

ibm.com/developerWorks/ developerWorks®

Guaranteed delivery with InfoSphere DataStage Page 5 of 11

Assume that table TABLE_WITH_PK has a unique constraint, meaning that it cannot havetwo rows with the same key value. It is clear to see that if this job runs a second time, it will failbecause the same data will be 'replayed' to the Oracle connector, which will lead to duplicate keyrow errors. There are a number of possible ways to overcome this:

• Change the Write mode property to one of Insert new rows only, Insert then update, orUpdate then insert. The first option ignores rows that already exist in the database, thesecond executes an insert on the row, and if this returns row exists, then executes an updatestatement with the same data. The latter option is similar but executes the statement in thereverse order.

• Add a reject link to the target stage and configure it so that records that cause row errors aresent to the reject link.

• Use a sparse lookup stage to see whether a record already exists in the target. The resultsof the lookup can be used to determine whether to send records to the target stage, to ignorethem, or to divert existing records to a log file, or other target.

Which of these approaches to use depends on the particular use case and whether processingis required on the source data. If transformations or other functions are required on the sourcedata, it may be more efficient to determine that a record already exists by using a sparse lookupstage early in the job, such that the transformation can be skipped. If the source data is largelyjust passed straight through to the target, through a small number of intermediate stages, it maybe more efficient to allow the data to reach the target connector and have that connector reject orignore the data.

Other ways to achieve idempotency can be accomplished by using additional target tables to storethe transactional state. To do so requires that a target stage can write to multiple target tables aspart of a single transaction. The database connectors have such a capability, and this ability isdescribed in the following section.

Local transactions with database connectors

Since Information Server version 8.5, database connectors have provided the ability to supportmultiple input links where each link targets a particular database table and has its own set ofproperties, such as the write mode. These multiple links are all executed within the same databasetransaction, so the ACID capabilities required for guaranteed delivery are maintained: either all ofthe table updates are written, or none of them are.

There are many use cases for such functionality, including:

• Writing related records, such as maintaining parent-child relationships within a singletransaction.

• Using different write modes for each link, where each link is configured to update the sametable. For example, link 1 may perform deletions from a table, while link 2 performs updates tothe same table.

• Storing a transactional marker to hold the state of the transaction.

Page 6: Dm 1310datastage PDF

developerWorks® ibm.com/developerWorks/

Guaranteed delivery with InfoSphere DataStage Page 6 of 11

This latter approach is often used in checkpoint-restart scenarios to avoid resending the same datato the target database in the event that a job has to be restarted due to some failure condition.One way to accomplish this is to utilize a target table to store the number of the last processedrow. Because this is committed in the same transaction as the target updates, it is guaranteed toprecisely reflect the last processed row count. Early in the same job, a lookup stage reads thislast processed value and passes it to a transformer stage that contains a constraint expression tocompare the row number to the count. This stage discards any rows whose row count is less thanthe last processed value, because these have already been processed by the target stage. Theresult is that only those records that have not yet been processed by the target stage are deliveredto that stage.

One more stage is required, which is a Wave Generator stage. When configured with multipleinput links, database connectors only commit the transaction at the end of each wave. The WaveGenerator offers a number of different ways of determining when to issue a wave marker. For thiscase, an absolute row count would suffice, such that the target connector commits the transactionevery N rows.

The complete job looks like Figure 5:

Figure 5. Checkpoint-restart job

A job such as the one in Figure 5 can be the solution to the idempotency problem, and guaranteesthe delivery of data from source to target.

Other design patternsThis section explains some other patterns that can achieve guaranteed delivery with otherresources.

Guaranteed delivery with InfoSphere Data Replication

The Change Data Capture Transactional stage (CDCTS) works with InfoSphere Data Replication(DR) to replicate and process mission-critical data events in real time. DR uses a log-basedcapture method to detect updates to various database systems and can then either replicate theseupdates to another database system, or it can convey these updates to files, WebSphere MQmessages, or to the CDCTS. The CDCTS communicates with DR over TCP/IP to receive theseupdates to the source database, and also to pass status messages.

Page 7: Dm 1310datastage PDF

ibm.com/developerWorks/ developerWorks®

Guaranteed delivery with InfoSphere DataStage Page 7 of 11

Figure 6 shows a simple DR DataStage job:

Figure 6. Data replication job

The CDC stage is configured with the name of the DR subscription and must have at leasttwo output links: one or more of these links carry the database updates, and exactly one linkpasses the bookmark. The bookmark is similar to the last processed count described in theprevious section. It is used to identify the last committed point of the current transaction. ACDCTS job uses a target database connector, which has at least two input links. One of theselinks carries the bookmark value, which targets a simple table, named the bookmark tablethat is created to store the bookmark value. As with the prior example, the target databaseconnector commits the transaction at the end of transactional waves. The boundaries of thesetransactional waves are determined by DR. DR sends commit messages to CDCTS that align withthe transaction boundaries of the updates to the source database. When CDCTS receives thesecommit messages, it sends out transactional wave markers to its output links. When the targetdatabase stage receives the wave markers on each link, it commits the transaction. Becausea single transaction encompasses both the bookmark table and database updates, the storedbookmark value is guaranteed to be in sync with the database updates.

As the job runs, DR periodically asks the CDCTS to report the last-committed bookmark value.The CDCTS queries the bookmark table via an ODBC connection and reports the bookmark backto DR. DR uses this to clean up logs because it can be certain that a particular transaction hasbeen successfully written to the target table.

In the event of a job, network, or system failure, the job is left in an incomplete state. When the jobis restarted, the CDCTS reports to DR the last-committed bookmark value, and DR then uses thatinformation to determine which record to start from.

Once more, delivery from DR to the target database is guaranteed, and by the use of thebookmark mechanism, there are never duplicate records sent to the target database.

Guaranteed delivery with pipelined stagesThere are other creative ways that jobs can be written to ensure guaranteed delivery. Most ofthese rely on the use of transactional wave markers, as seen in the previous examples. Anothersuch example is a case where data must be moved from a source queue to a target database,but the messaging system is not WebSphere MQ. DTS only supports WebSphere MQ, so toachieve similar functionality with other messaging systems, such as Java Messaging System(JMS) requires a job such as the one in Figure 7:

Page 8: Dm 1310datastage PDF

developerWorks® ibm.com/developerWorks/

Guaranteed delivery with InfoSphere DataStage Page 8 of 11

Figure 7. Pipeline job

The SourceMessage and DeleteMessage stages use a JMS solution coded on top of the JavaIntegration Stage. This JMS stage also outputs wave markers periodically. The solution works,because the Teradata connector provides the ability to send successful records to its reject link(shown in Figure 8), if configured to do so from the reject link properties:

Figure 8. Teradata connector reject properties

The sequence of events within the job execution is:

1. The JMS stage reads the source message, but leaves it on the source queue. This stage alsoemits end-of-wave markers at prescribed intervals.

2. The data from JMS is written to the Teradata database by the Teradata connector. If the writeis successful, then the connector forwards the data to its reject link, which is configured toforward only successful records.

3. When the Teradata connector receives the end-of-wave marker from the JMS source stage, itcommits the transaction and forwards the end-of-wave marker to its output link.

4. When the Remove Duplicates stage receives the end-of-wave marker, it removes duplicatesand then sends its results, including an end-of-wave marker to its output.

5. The final JMS stage deletes the source message when it receives an end-of-wave marker.

Notice that the previous solution would not work correctly if there was no end-of-wave markerbecause the Teradata sends records its output link after the data is written, not after it iscommitted. But by using wave markers and including the Remove Duplicates stage, the sequenceis guaranteed.

There is a possibility that after the database transaction has been committed, the job fails. In thisevent, the final deletion of the source message does not occur, and the message is left on thesource queue. No data is ever lost, but such a failure would mean that if the job is restarted, the

Page 9: Dm 1310datastage PDF

ibm.com/developerWorks/ developerWorks®

Guaranteed delivery with InfoSphere DataStage Page 9 of 11

Teradata connector sees the same data that it has already processed. For this reason, as withthe use of local transactions with DTS, the job design needs to be idempotent. With a correctlyconstructed idempotent job, if the job is aborted for any reason, simply restarting it allows it tocontinue from where it left off.

Conclusion

This article describes a number of approaches and methodologies that can be used to guaranteedelivery of data from a source system to a target system via DataStage jobs. There is not a singlesolution that works best in all circumstances due to the different nature of the resources beingaccessed, so a combination of techniques is required. Factors that determine which approach touse include whether or not the resource supports XA transactions, the size of the transactionaldata, latency requirements, and restartability requirements.

Acknowledgements

The author sincerely thanks his colleagues Tony Curcio and Ernie Ostic for reviewing andproviding their valuable feedback that helped to refine this article.

Page 10: Dm 1310datastage PDF

developerWorks® ibm.com/developerWorks/

Guaranteed delivery with InfoSphere DataStage Page 10 of 11

Resources

Learn

• Learn how to integrate InfoSphere Change Data Capture and DataStage.• Stay current with developerWorks technical events and webcasts focused on a variety of

IBM products and IT industry topics.• Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and

tools as well as IT industry trends.• Follow developerWorks on Twitter.• Watch developerWorks on-demand demos ranging from product installation and setup demos

for beginners, to advanced functionality for experienced developers.

Get products and technologies

• Evaluate IBM products in the way that suits you best: Download a product trial, try a productonline, use a product in a cloud environment, or spend a few hours in the SOA Sandboxlearning how to implement Service Oriented Architecture efficiently.

Discuss

• Get involved in the My developerWorks community. Connect with other developerWorksusers while exploring the developer-driven blogs, forums, groups, and wikis.

Page 11: Dm 1310datastage PDF

ibm.com/developerWorks/ developerWorks®

Guaranteed delivery with InfoSphere DataStage Page 11 of 11

About the author

Paul Stanley

Paul Stanley is a senior software architect in Information Integration Solutions,Information Management Group. He has been architecting and managing thedevelopment of connectivity components for InfoSphere Information Server andWebSphere Transformation Extender for over 15 years.

© Copyright IBM Corporation 2013(www.ibm.com/legal/copytrade.shtml)Trademarks(www.ibm.com/developerworks/ibm/trademarks/)