53
The Big Data in OpenEdge Experience of Using Pro2Oracle + OE CDC Valeriy G. Bashkatov Progress Technologies OpenEdge Principal Consultant Good day everyone and welcome to my session! My name is Valeriy Bashkatov. I’m from Progress Technologies company which is official distributer of Progress Software in Russia, CIS countries and Latvia. And today I will tell you about my experience of implementing Pro2Oracle in one of our largest clients, for whom the concept of the Big Data in OpenEdge is not the future, but the existing present. But first, a few words about, Who Am I? 1

Experience of Using Pro2Oracle + OE CDC

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Experience of Using Pro2Oracle + OE CDC

The Big Data in OpenEdge

Experience of Using

Pro2Oracle + OE CDC

Valeriy G. Bashkatov

Progress Technologies

OpenEdge Principal Consultant

Good day everyone and welcome to my session!

My name is Valeriy Bashkatov. I’m from Progress Technologies company which is official distributer of Progress Software in Russia, CIS countries and Latvia.

And today I will tell you about my experience of implementing Pro2Oracle in one of our largest clients, for whom the concept of the Big Data in OpenEdge is not the future, but the existing present.

But first, a few words about, Who Am I?

1

Page 2: Experience of Using Pro2Oracle + OE CDC

2

Who Am I?

▪ With Progress since 2003:

I have been working with Progress for sixteen years as a Technical Support Engineer for Russian users.

I met Progress in Kazakhstan at TexaKaBank, where I worked as a database administrator.Then I continued to work with Progress at Sberbank as the head of technical support.In two thousand eight I moved to Russia, where I worked for CSBI company. There I provided Progress support to the developers of bank information system - Bankier, as well as end-user support.

And for the last six years I have been working at Progress Technologies. Where in zone of my responsibility hundreds of clients from small to large businesses, including the largest banks in Russia.I will tell you about one of them today.

2

Page 3: Experience of Using Pro2Oracle + OE CDC

3

Who Am I?

Trainings (78 new DBA since 2008):

• “OpenEdge DBA for beginners”.

• “Advanced DBA for OpenEdge“

I also have two of my own OpenEdge administration trainings that I conduct in Moscow and St. Petersburg. Seventy-eight new administrators have already completed these trainings.

3

Page 4: Experience of Using Pro2Oracle + OE CDC

4

Who Am I?

▪ Books (in Russian):

1. “Fundamentals of OpenEdge database administration”

2. “OpenEdge platform for beginners”

3. “OpenEdge Table Partitioning for DBA”

4. “OpenEdge Table Partitioning for programmers”

5. “Database data encryption (Transparent Data Encryption)”

6. “After-Imaging in OpenEdge”

7. “Learning OpenEdge Replication”

I also have several books about OpenEdge in Russian. All these books are available free of charge in the community.

4

Page 5: Experience of Using Pro2Oracle + OE CDC

5

Who Am I?

▪ https://rupug.pro

RuPUG Education CenterOwn your future learning new skills online

In addition, I have my own website where I support Russian Progress Users Group.

There I post technical articles about OpenEdge and provide some online trainings for RuPUG members.

5

Page 6: Experience of Using Pro2Oracle + OE CDC

https://www.facebook.com/groups/ProgressOEZone/

And, of course, I manage the largest OpenEdge group on social networks - "Progress –OpenEdge Zone".

Today, this group has more than two thousand members.

Progress 6

Page 7: Experience of Using Pro2Oracle + OE CDC

7

What is Pro2?

▪ Pro2™ is a solution for live, near real-time replication of OpenEdge data to a MS-SQL

Server, Oracle or OpenEdge target.

▪ Delivers quick and easy access to mission-critical data from your OpenEdge system

• Without disrupting normal business operations or risking transactional database performance

and a stability

▪ Data replication: not disaster recovery

And now, for those who are new to Pro2, let’s spend a few minutes going over what Pro2 is?

Progress OpenEdge Pro2 is a solution that provides easy, fast data replication from OpenEdge into a separate OpenEdge, SQL Server or Oracle database.

It is important to note that unlike other OpenEdge Replication product, which is used for disaster recovery, Pro2 can replicate a subset of tables and fields to the target database. Also, the structure of the target database does not need to match the structure of the source database. Most customers have additional tables, fields, and indexes in the target database to better support their business intelligence system.

Now I'm going to give you some information about the biggest client I've ever worked with.

7

Page 8: Experience of Using Pro2Oracle + OE CDC

8

The VTB Bank & OpenEdge

The name of this client is VTB Bank

8

Page 9: Experience of Using Pro2Oracle + OE CDC

9

▪ 15 000 - active users in the database (UTC+3 … UTC+10)

The VTB Bank & OpenEdge

First, the Bank has fifteen thousand users who work every day around the clock in five time zones. The maximum time difference between branch offices is 7 hours.

9

Page 10: Experience of Using Pro2Oracle + OE CDC

10

▪ 15 000 - active users in the database (UTC+3 … UTC+10)

▪ 17.5 TB - size of the largest database

The VTB Bank & OpenEdge

Secondly, the largest database has a size of seventeen and a half Terabytes.

10

Page 11: Experience of Using Pro2Oracle + OE CDC

11

▪ 15 000 - active users in the database (UTC+3 … UTC+10)

▪ 17.5 TB - size of the largest database

▪ 55 TB - total size of 15 databases of the largest branch

The VTB Bank & OpenEdge

Third, the total size of the fifteen databases of the largest branch office is fifty-five Terabytes.

11

Page 12: Experience of Using Pro2Oracle + OE CDC

12

▪ 15 000 - active users in the database (UTC+3 … UTC+10)

▪ 17.5 TB - size of the largest database

▪ 55 TB - total size of 15 databases of the largest branch

▪ 600 TB - total size of all OpenEdge databases

The VTB Bank & OpenEdge

And the total size of all production databases is six hundred Terabytes.

12

Page 13: Experience of Using Pro2Oracle + OE CDC

13

▪ 15 000 - active users in the database (UTC+3 … UTC+10)

▪ 17.5 TB - size of the largest database

▪ 55 TB - total size of 15 databases of the largest branch

▪ 600 TB - total size of all production databases

▪ ~120 TB - growth of data for the last year

The VTB Bank & OpenEdge

At the same time, over the last year, the size of OpenEdge data has been growing by about one hundred twenty Terabytes.

13

Page 14: Experience of Using Pro2Oracle + OE CDC

14

▪ 15 000 - active users in the database (UTC+3 … UTC+10)

▪ 17.5 TB - size of the largest database

▪ 55 TB - total size of 15 databases of the largest branch

▪ 600 TB - total size of all databases (include standby)

▪ ~120 TB - growth of data for the last year

▪ 1 million CRUD operation per second, without performance degradation

The VTB Bank & OpenEdge

And finally, just think about these numbers! Every second, about one million read, create, update and delete operations are performed in that databases.

I am sure, that you will agree with me that this is a huge amount of changes!

And yes, all this is done in OpenEdge platform!

14

Page 15: Experience of Using Pro2Oracle + OE CDC

15

The Need For Pro2

Poor performance of an existing solution for uploading data to a data Warehouse

Reconciliation with other systems takes too long

So why was Pro2 chosen in the Bank?

The Bank had its own solution, which was based on the file exchange between OpenEdge and Data Warehouse.

But this solution didn't satisfy the needs of the business because of the very low productivity of uploading data into Oracle. As I recall, it took up to three days.

And as you understand, this is unacceptable in the modern world!

Progress 15

Page 16: Experience of Using Pro2Oracle + OE CDC

16

The VTB Bank & Pro2Oracle

And that is why, when Pro2 was released by Progress Software, we immediately began working on its implementation for the Bank.

Next, I'll tell you about what we encountered while working with Pro2Oracle.As a result, many things have been improved in Pro2, allowing us to work with so many changes.

Progress 16

Page 17: Experience of Using Pro2Oracle + OE CDC

17

Source tables - 2018

Some largest tables for Pro2Oracle replication

Table Db Analyze Recs Size (Gb)

Table #1 4 223 377 285 499.4

Table #2 3 161 770 239 217.3

Table #3 2 441 435 657 169.9

Table #4 2 399 959 772 178.9

Table #5 2 384 362 565 165

Table #6 2 242 546 551 68.3

Table #7 2 095 401 605 128.8

Table #8 2 052 205 505 98.4

Table #9 1 989 082 913 78.9

Table #10 1 906 007 478 78.9

Table #11 1 753 559 790 42.3

Table #12 1 709 223 562 83.8

Table #13 1 671 115 196 78.7

Table #14 1 661 420 151 78

Table #15 1 573 281 277 43.4

Table #16 1 500 882 369 58.2

Table #17 1 347 514 073 32.1

This slide shows the largest tables with which we had to work.

Look at the first table, it has more than four billion records.And for all tables we had to process almost forty billion records in the shortest possible time.

Does anyone work with such large tables?(Yes? Great! I am happy that we are not alone in this universe.)

Believe me, the Big Data in OpenEdge it an amazing adventure and an incredible experience!

Progress 17

Page 18: Experience of Using Pro2Oracle + OE CDC

18

Pro2Oracle - what did we encounter?

▪ CHARACTER -> VARCHAR2

▪ INITIAL SYNC

▪ 64BIT ROWID

▪ REPLICATION TRIGGERS & COMPRESSION

▪ SPLITING REPLICATION THREADS

▪ OE CHANGE DATA CAPTURE (CDC)

But Big Data is a Big Problem!

On the screen is a list of the main things we had to face. Some of them were solved according to the results of our work with Big Data.

Let's look at each of them.

Progress 18

Page 19: Experience of Using Pro2Oracle + OE CDC

19

CHARACTER (OE) -> VARCHAR2 (Oracle)

OpenEdge

CHARACER = 31 Kb

ORACLE

VARCHAR2 = 4 Kb

Error : catch :4212A column in this row being inserted or updated is too large (4212)

=>

=>

The first, is that in Oracle the maximum allowed size of a field with the VARCHAR2 data type is four kilobytes, but, you know, in OpenEdge a field with the CHARACTER data type can contain up to thirty-one kilobytes of data.And yes, we had such fields.

As result, data from such fields did not always fit into Oracle fields.

Progress 19

Page 20: Experience of Using Pro2Oracle + OE CDC

20

CHARACTER (OE) -> VARCHAR2 (Oracle)

OpenEdge

CHARACER = 31 Kb

ORACLE

VARCHAR2 = 4 Kb=>

SUBSTRING(bfrSrcCustomer.Address,1,4000)

OR

Enable Extended Data Types in Oracle 12c

VARCHAR2 = 31Kb

So we had to trim that fields.

But this is not a good idea, because some of the data was lost.

We found another solution. In Oracle twelve it was possible to enable support for Extended Data Types, so the size of VARCHAR2 can reach thirty-one kilobytes.But this option has not been tested. Unfortunately, we didn't have time for this. But it will be a task for future implementations.

Progress 20

Page 21: Experience of Using Pro2Oracle + OE CDC

21

Initial Synchronization

Old Pro2 Template: tmpl_mreplproc_oracle.p

Processed 100,000 records so far. Time is 26/11/2017 13:37:26.311+03:00.

Total Elapsed Time is: 0h:5m:9s.845ms

4 billion records = 20 weeks!!!

Pro2 requires initial synchronization of OpenEdge with Oracle.

At that time, we used the Pro2 version four. The Template that`s used to synchronization in that version was too slow.

On this slide you can see that it took five minutes to process a hundred thousand records.A simple calculation shows that synchronization of a table with four billion records will take about twenty weeks.

This is unacceptable!

Progress 21

Page 22: Experience of Using Pro2Oracle + OE CDC

22

Initial Synchronization

Third party solution: Pentaho Data Integration

Fortunately, the developers in some update of Pro2 suggested an alternative solution - Pentaho Data Integration.

It worked many times faster than standard Bulk-Copy in Pro2.

Progress 22

Page 23: Experience of Using Pro2Oracle + OE CDC

23

Initial Synchronization

Third party solution: Pentaho Data Integration

2018/07/06 13:25:34 - URT - replControl Status.0 – db1.table1: Row 100000 Status Update, Rows Completed: 1000000 …

2018/07/06 13:25:37 - URT - replControl Status.0 - db1.table1 : Row 100000 Status Update, Rows Completed: 1100000…

2018/07/06 13:25:42 - URT - replControl Status.0 - db1.table1 : Row 100000 Status Update, Rows Completed: 1200000…

2018/07/06 13:25:45 - URT - replControl Status.0 - db1.table1 : Row 100000 Status Update, Rows Completed: 1300000…

2018/07/06 13:25:49 - URT - replControl Status.0 - db1.table1 : Row 100000 Status Update, Rows Completed: 1400000…

4 billion records = 4 - 5 days

And now, depending on the load on the server, it took from three to ten seconds to process one hundred thousand records.

And our calculations showed that now it would take only four or five days to process four billion records in primary synchronization.

We thought the problem was solved. But, unfortunately, no...

Progress 23

Page 24: Experience of Using Pro2Oracle + OE CDC

24

Initial Synchronization

Third party solution: Pentaho Data Integration

2018/07/06 13:25:34 - URT - replControl Status.0 – db1.table1: Row 100000 Status Update, Rows Completed: 1000000 …

2018/07/06 13:25:37 - URT - replControl Status.0 - db1.table1 : Row 100000 Status Update, Rows Completed: 1100000…

2018/07/06 13:25:42 - URT - replControl Status.0 - db1.table1 : Row 100000 Status Update, Rows Completed: 1200000…

2018/07/06 13:25:45 - URT - replControl Status.0 - db1.table1 : Row 100000 Status Update, Rows Completed: 1300000…

2018/07/06 13:25:49 - URT - replControl Status.0 - db1.table1 : Row 100000 Status Update, Rows Completed: 1400000…

2018/07/07 05:57:34 - URT - replControl Status.0 - db1.table1: Row 100000 Status Update, Rows Completed: 2 138 900 000 ...

2018/07/07 05:57:34 - URT - Insert / Update replControl.0 - Values set for lookup: [BULKCOPY], [db1.table1]

2018/07/07 05:57:35 - T - Source Table.0 - Couldn't get row from result set

2018/07/079 05:57:35 - T - Source Table.0 - [DataDirect][OpenEdge JDBC Driver][OpenEdge] Overflow error (7485)

The Big data is a Big problems.

When Pentaho processed a little more than two billion records it failed with an Overflow error.

Unfortunately, together with the Progress Technical Support we could not determine the cause of this error.So we had to partially refuse this solution.

But we successfully continued to use Pentaho to synchronize tables with less than two billion records.

Progress 24

Page 25: Experience of Using Pro2Oracle + OE CDC

25

Initial Synchronization

NEW Pro2 Template: tmpl_mreplproc_oracle.p Oracle_Bulk_Transaction_Count

Using Oracle_Bulk_Transaction_Count of 10000

Processed 100,000 records so far. Time is 25/11/2018 07:14:53.302+03:00. Total Elapsed Time is: 0h:0m:23s.837ms

Processed 200,000 records so far. Time is 25/11/2018 07:15:14.150+03:00. Total Elapsed Time is: 0h:0m:44s.685ms

Processed 300,000 records so far. Time is 25/11/2018 07:15:35.997+03:00. Total Elapsed Time is: 0h:1m:6s.532ms

...

...

Processed 2,147,200,000 records so far. Time is 29/11/2018 03:35:27.443+03:00. Total Elapsed Time is: 3 days, 20h:20m:57s.978ms

Processed 2,147,300,000 records so far. Time is 29/11/2018 03:35:38.957+03:00. Total Elapsed Time is: 3 days, 20h:21m:9s.492ms

Processed 2,147,400,000 records so far. Time is 29/11/2018 03:35:49.904+03:00. Total Elapsed Time is: 3 days, 20h:21m:20s.439ms

Finished Mass Replication at 29/11/2018 05:00:40.764+03:00. Total Elapsed Time was: 3 days, 21h:46m:11s.298ms

But what do we do with large tables?

We did trace queries in Oracle, and we found that this template sends data to Oracle in one record per transaction.This behavior greatly affects the performance.

As a result a new template was created. And now the size of the transaction can be controlled by Pro2 property “Oracle_Bulk_Transaction_Count” which default value is ten thousand records per transaction.

Thanks to this, we reduced the synchronization time from five minutes to fifteen seconds on average for every hundred thousand records. This is very close to the speed of Pentaho.

Progress 25

Page 26: Experience of Using Pro2Oracle + OE CDC

26

Initial Synchronization – New challenge 2019

And now I have a new challenge. A week before this conference, I received a request from the Bank. They said, Hi Valera, we want to add just two more tables. Can you help us with that?Yes, of course I said. What are these tables? Send me tabanalys.

26

Page 27: Experience of Using Pro2Oracle + OE CDC

27

Initial Synchronization – New challenge 2019

After that, I realized that I have a problem.This is an analysis of the first table.It has nearly seven billion records.and its size is eight hundred gigabytes

Progress 27

Page 28: Experience of Using Pro2Oracle + OE CDC

28

Initial Synchronization – New challenge 2019

But the second table hit me even more!It has seventeen billion records!And its size is one terabyte!

Progress 28

Page 29: Experience of Using Pro2Oracle + OE CDC

29

Initial Synchronization – New challenge 2019

I must do this in no more than 5 days!

At the same time, the bank said that the initial synchronization of these tables should be completed in no more than five days!But at the current speed, it takes seven hundred and forty hours or twenty-four days!

Progress 29

Page 30: Experience of Using Pro2Oracle + OE CDC

30

Initial Synchronization – New challenge 2019

There are no fields to divide into groups!

I have one idea how to do this, but I'm not sure.And the only way to do this is to split the initial synchronization into sub-threads.But the problem is that in these tables there are no fields that could be used to divide records into groups.Therefore, I decided to split the table using ROWID .The good news is that these tables are stored in their own type two areas.So after the conference I will try to do this. And I’ll tell you about the result next year in Vienna. Ok?

Progress 30

Page 31: Experience of Using Pro2Oracle + OE CDC

31

Pro2Oracle - what did we encounter?

▪ 64bit ROWID

Invalid ROWID in replqueue.srcrecord.

Aborting replication of this record. Sequence: 578,402,107 DB=db1 Table=table1

Rowid=0x10f13fb72 TnsID: 1,420,550,509

Must be: 0x000000010f13fb72

Recid: 4 547 935 090

And so, going to next.The Big data is a Big problem. Laws that worked for small databases eventually stop working for the Big Data.

But what we didn't expect it is a problem with sixty-four-bit ROWID. The CDC process programs used the incorrect format of ROWID after converting from sixty-four-bit RECID.

But after Pro2 developers fixed it, everything worked well.

Progress 31

Page 32: Experience of Using Pro2Oracle + OE CDC

32

Pro2Oracle - what did we encounter?

▪ Replication Triggers, No Compression

The next problem we had with replication triggers.

The slide shows an example graph of the replication queue. This line forms the standard replication trigger without compression and without multithreading.Note that we are constantly in the red zone. This means that the size of the replication queue has always been more than ten thousand changes notes, and on the slide, we see several million notes in the queue.

Why is this happening?

Progress 32

Page 33: Experience of Using Pro2Oracle + OE CDC

33

OE Source Table

OE Repl TriggerCreate change

note

CUD

Pro2: ReplQueue

Rec#1 Note #1

Rec#2 Note #1

Rec#3 Note #1

Pro2: How Repl Trigger works

The replication trigger is triggered every time a change is made to a record. At this time, it should create a change note in the Pro2 replication queue.

33

Page 34: Experience of Using Pro2Oracle + OE CDC

34

OE Repl TriggerCreate change

note

CUD

Rec#1 Note #1

Rec#2 Note #1

Rec#3 Note #1

Pro2: How Repl Trigger works

Pro2: Repl Batch #1 (ETL)

#1

Read C

hange N

ote

Pro2: ReplQueue

#2

Read Source Record

Ora Target Table

#3

Create/Write/Update

Target Record

After the change information has been queued, the Pro2 must process it to send change to Oracle.

However, the Big data is a Big problem. We found that for some tables, the same records can change very often in a short time.Ten times a minute.

34

Page 35: Experience of Using Pro2Oracle + OE CDC

35

OE Repl TriggerCreate change

note

CUD

Rec#1 Note #1

Rec#1 Note #2

Rec#1 Note #3

Pro2: How Repl Trigger works

Pro2: Repl Batch #1 (ETL)

#1

Read C

hange N

ote

Pro2: ReplQueue

#2

Read Source Record

Ora Target Table

#3

Create/Write/Update

Target Record

Rec#1 Note #4

Rec#1 Note #N

Rec#1 Note #5

As a result, the replication triggers each time created a new note about changes of the same record in the replication queue. As result, the size of the replication queue grew very quickly, and replication processes did not cope with the processing of such an amount of changes. We've seen hours of gaps between Oracle and OpenEdge.

35

Page 36: Experience of Using Pro2Oracle + OE CDC

36

OE Repl Trigger

Check

&&

Update || Create

CUD

Rec#1 Note #1

Rec#2 Note #1

Rec#3 Note #1

Pro2: How Repl Trigger with Compression works

Pro2: Repl Batch #1 (ETL)

#1

Read C

hange N

ote

Pro2: ReplQueue

#2

Read Source Record

Ora Target Table

#3

Create/Write/Update

Target Record

To solve this problem, we started using compression triggers.

The idea is that when a replication trigger fires, it first checks for a not applied change note for that record in the replication queue.

If such a note still exists, the trigger just updates it.

36

Page 37: Experience of Using Pro2Oracle + OE CDC

37

Pro2Oracle - what did we encounter?

▪ Replication Triggers, Compression

As a result, the size of the replication queue has been reduced by ten times!

But! We're still in the yellow zone. And we still have the gap between Oracle and OpenEdge.

And now I want to say Thanks to Pro2 developers for the third time!

Progress 37

Page 38: Experience of Using Pro2Oracle + OE CDC

38

OE Repl Trigger

Check

&&

Update || Create

+

Modulo of RECID

CUD

Rec#1 Thr #01

Rec#2 Thr #11

Rec#3 Thr #12

Pro2: Repl Trigger + Splitting Replication Threads

Pro2: Repl Batch #1 (ETL)

#1

Read Change Note

Pro2: ReplQueue

#2

Read Source Record

Ora Target Table

#3

Cre

ate

/Wri

te/U

pd

ate

Targ

et

Reco

rd

Rec#6 Thr #19

Pro2: Repl Batch #11 (ETL)

Pro2: Repl Batch #12 (ETL)

Pro2: Repl Batch #19 (ETL)

They proposed to divide the main Repl Batch thread into sub threads.

For example, we have replication thread with number one. Now we can divide this thread up to ten sub threads.How it works?

Trigger determines the number of the sub threads on the basis of special property which we specify.

Then trigger calculate the modulo of the RECID and do the split calculation for potentially sending to an alternate thread.

As a result, we have achieved our goal!

38

Page 39: Experience of Using Pro2Oracle + OE CDC

39

Pro2Oracle - what did we encounter?

▪ Splitting Replication Threads

The slide shows a graph where we can see that after implementing Splitting Replication Threads with up to ten sub-threads, we are moved to the green zone.Now all changes notes are evenly distributed between these sub-threads and the size of replication queues for each thread has decreased significantly.

As result, the gap between Oracle and OpenEdge was eliminated.

Progress 39

Page 40: Experience of Using Pro2Oracle + OE CDC

40

Pro2Oracle - what did we encounter?

▪ Splitting Replication Threads

This just an example graph clearly shows how the performance of Pro2 has increased after the implementation of Compression triggers with the Split Replication Threads.

Progress 40

Page 41: Experience of Using Pro2Oracle + OE CDC

41

Pro2Oracle + OE CDC

1. OE CDC embedded into RDBMS engine.

2. Can be enabled online.

3. CDC policies can be activated online. (replication triggers can not)

4. Faster than standard replication triggers.

5. Captures ABL and SQL changes.

6. Component of the Advanced Enterprise Edition RDBMS

And finally about OpenEdge Change Data Capture.

The main advantage of the OpenEdge CDC is that it is embedded into RDBMS engine. And unlike replication triggers, its CDC policies can be activated online.Its internal database triggers work much faster than standard replication triggers.

And now we can capture both ABL and SQL changes. Standard replication triggers cannot do this.

CDC is also available as a component of the Advanced Enterprise Edition . This means that when Advanced Enterprise RDBMS is installed, the CDC is installed too.

Let's see how Pro2 works with OpenEdge CDC.

Progress 41

Page 42: Experience of Using Pro2Oracle + OE CDC

42

Overview of Data Capture and ETL Processes

Source Table (Customer)

User Data

CDC Internal Database Triggers

Record operations: create, update, delete, …

Pro2 ETL Process Data Warehouse

1

2

CDC Cache (CDC Policy Tables: _Cdc-Table-Policy, _Cdc-Field-Policy

Change Tracking Table (_Cdc-Change-Tracking)

_Policy-Id _Tran-id _Time-Stamp _Change-Seq _Op

Change Table (CDC_Customer)

Meta-data Columns User Data (captured data)

This diagram provides an abstract look at the change data capture process.

The brackets, labeled 1 and 2 on the left, represent the span of the two high-level processes.

Notice that there is some overlap in the resources used by these two processes. This overlap is in the CDC Change Tracking Table and the CDC Change tables.

11/26/2019

Progress 42

Page 43: Experience of Using Pro2Oracle + OE CDC

43

Overview of Data Capture and ETL Processes

Source Table (Customer)

User Data

CDC Internal Database Triggers

CDC Cache (CDC Policy Tables: _Cdc-Table-Policy, _Cdc-Field-Policy

Record operations: create, update, delete, …

1

Change Tracking Table (_Cdc-Change-Tracking)

_Policy-Id _Tran-id _Time-Stamp _Change-Seq _Op

Change Table (CDC_Customer)

Meta-data Columns User Data (captured data)

Let’s get started by focusing on the first process - Data Capture. This is the process by which we replaced replication triggers.

The diagram identifies several database tables.

In the next few slides I’ll describe the function of each of them.

11/26/2019

Progress 43

Page 44: Experience of Using Pro2Oracle + OE CDC

44

OpenEdge Tables

Source tables: OpenEdge database user tables

▪ Identified in a CDC policy as the source captured data

▪ Source tables exist prior to implementing a CDC policy

▪ Source tables are not changed when the CDC policy is implemented

Source Table (Customer)

User Data

Source tables is a user's tables of the source OpenEdge database. The user table becomes the source table as soon as the CDC policy is created for it.

11/26/2019

Progress 44

Page 45: Experience of Using Pro2Oracle + OE CDC

45

CDC Policy Tables

▪ Policy tables are created when CDC is enabled on an OpenEdge database

CDC Table Policy Table – _Cdc-Table-Policy

▪ Contains one record for each policy defined on a source table

▪ There may be multiple policies defined for a source table (pending, current, previous)

▪ Relationship: A table policy record is parent of zero or more field policy records

CDC Field Policy Table – _Cdc-Field-Policy

▪ Contains one record for each tracked field of a CDC table policy

▪ Note if a table policy is set to level 0 there are no field policy records

CDC Policy Tables: _Cdc-Table-Policy, _Cdc-Field-Policy

There are two policy tables,the CDC Table Policy and the CDC Field Policy .

This tables are created when CDC is enabled on an OpenEdge database.

There is a table policy record for each policy created for a source table.

And there is a field policy record for each field that will be captured for a given policy.

Each table policy has a tracking depth – CDC Level.When this level is set to 0 the data from the source table is not captured directly and no records are created in the Field Policy table at all.

The CDC level zero is what Pro2 uses in its work.

11/26/2019

Progress 45

Page 46: Experience of Using Pro2Oracle + OE CDC

46

CDC Change Tracking Table

CDC Change Tracking Table : _Cdc-Change-Tracking

▪ Table created when CDC is enabled on the database

▪ Contains a record for each CDC tracked operation on a source table

▪ Operations are captured for the current policy on the source table when that policy is active

Change Tracking Table (_Cdc-Change-Tracking)

_Policy-Id _Tran-id _Time-Stamp _Change-Seq _Op

The following table is CDC Change Tracking.Like the CDC policy tables, this table is also created when the CDC is activated.Notes about each tracked change is recorded in this table.

Pro2 works only with this table.

11/26/2019

Progress 46

Page 47: Experience of Using Pro2Oracle + OE CDC

47

CDC Change Tables

CDC Change Tables :

▪ Contains change data as prescribed in the CDC policy tables

▪ Includes both meta data and user data (which are identified in the CDC policy table)

▪ A CDC change table is created the first time a policy with level 1 or higher is defined

▪ There is at most one CDC Change Table for a CDC source table

▪ May be associated with different CDC policies over the course of its lifetime

Change Table (CDC_Customer)

Meta-data Columns User Data (captured data)

And last one is CDC change tables. This tables are created the first time a policy is defined for a source table with the level is greater than zero.

But Pro2 does not use these tables, since its CDC level is zero.

11/26/2019

Progress 47

Page 48: Experience of Using Pro2Oracle + OE CDC

48

Overview of Data Capture

Source Table (Customer)

User Data

Change Tracking Table (_Cdc-Change-Tracking)

_Policy-Id _Tran-id _Time-Stamp _Change-Seq _Op

CDC Internal Database Triggers

CDC Cache (CDC Policy Tables: _Cdc-Table-Policy, _Cdc-Field-Policy

Record operations: create, update, delete, …

1

We’re now ready to look at the data capture process for Pro2.

This example is for one source table, Customer.

The Customer table has an active policy defined with a Level of zero.

When a change is made to the table, CDC internal database trigger is fired and looks at the policy tables that are in memory in the CDC cache.

Based on the policy a single record is written to the Change Tracking Table, like a change note when replication triggers works.

11/26/2019

Progress 48

Page 49: Experience of Using Pro2Oracle + OE CDC

49

Data Capture and Pro2 ETL Processes

Pro2 ETL Process Data Warehouse

2

Change Tracking Table (_Cdc-Change-Tracking)

_Policy-Id _Tran-id _Time-Stamp _Change-Seq _Op

CDCBatch Proces #1

CDCBatch Proces #2

CDCBatch Proces #N

Pro2 Repl

QUEUE

Now we will look at the second process – ETL, which is implemented at the Pro2 side.

To process records from Change Tracking Table in Pro2 added additional processes -CDCBatch.This process was designed in order to send change notes from OE CDC to the Pro2 replication queue.

After the note has been copied to the replication queue, it is deleted from the Change Tracking Table. This allows us to keep the Change Tracking Table in small sizes.

11/26/2019

Progress 49

Page 50: Experience of Using Pro2Oracle + OE CDC

50

Pro2Oracle + OE CDC - what did we encounter?

There is no internal compression for CDC Level 0!

Rec#1 Note #1

Rec#1 Note #2

Rec#1 Note #3

OE CDC:

_Cdc-Change-Tracking

Rec#1 Note #4

Rec#1 Note #N

Rec#1 Note #5

Rec#1 Note #1

Rec#1 Note #2

Rec#1 Note #3

Pro2: ReplQueue

Rec#1 Note #4

Rec#1 Note #N

Rec#1 Note #5

CDC Batches

But the problem we're facing here is that in OpenEdge CDC no internal compression for changes tracking with CDC Level zero.

And like with replication triggers without compression if the same source records changes very often, the size of the internal CDC-CHANGE-TRACKING table will grow.

And the larger the size of this table, the longer the CDCbatch process will work.

Also, in the first implementation of the CDCBatch process, it was not possible to copy data to the replication queue in compressed mode. As a result, the size of the replication queue has also grown significantly.There also was no Split Replication Threads.

As a result, we returned to the same problems we had with replication triggers in the beginning.

Progress 50

Page 51: Experience of Using Pro2Oracle + OE CDC

51

Pro2Oracle + OE CDC

Solution: CDCBatch with Compression & Split Replication Threads

Rec#1 Note #1

Rec#1 Note #2

Rec#1 Note #3

OE CDC:

_Cdc-Change-Tracking

Rec#1 Note #4

Rec#1 Note #N

Rec#1 Note #5

Pro2: ReplQueue

CDC Batches

Rec#1 Thr #01

Rec#2 Thr #11

Rec#3 Thr #12

Rec#6 Thr #19

Check

&&

Update || Create

+

Modulo of RECID

And this was considered in Pro2.

The CDCBatch process programs has been redesigned, and now changes notes send to the replication queue in compression mode with the possibility of Split replication Threads.

But these processes are still not fast enough. I hope that in future Pro2 versions this will be improved.

We also hope that Progress developers will add to the OpenEdge CDC the ability to compress records in the Change Tracking Table when the policy level is zero.

Progress 51

Page 52: Experience of Using Pro2Oracle + OE CDC

52

You need to implement Pro2?

Just ask me:

[email protected]

Find me:

Progress - OpenEdge Zone

https://www.facebook.com/groups/ProgressOEZone/

This concludes my story about my experience of implementing Pro2 for Big Data in the OpenEdge.Unfortunately, it is impossible to tell a lot in a short time. I even think that it is necessary to create a separate training.

The Pro2 is constantly evolving, and we hope that in the new versions we will get even more advantages in performance and management of our Big Data.

If you like my story, and if you have any questions, and, off course, if you want to implement Pro2 on your system, then you can email me. I am sure I can help you.

Thank you for your attention and see you soon!

CDC Level Description

Minimal (0) - This level indicates a change occurred. No record values will be

recorded.

Minimal with Field map (1) - Similar to the Minimal (0) level, this level

indicates a change occurred, but also includes a field map value indicating

which fields changed. No record values will be recorded.

52

Page 53: Experience of Using Pro2Oracle + OE CDC

Medium (2) - This level records the current (after) value of all CUD operations.

Maximum (3) - This level records both the previous (before) and current (after)

values of all CUD operations.

52