32
What’s New and Performance Tips Paige Roberts, Big Data Product Marketing Manager Ashwin Ramachandran, Big Data Product Manager

Keeping Data in Sync with Syncsort

Embed Size (px)

Citation preview

Page 1: Keeping Data in Sync with Syncsort

What’s New and Performance Tips

Paige Roberts, Big Data Product Marketing Manager

Ashwin Ramachandran, Big Data Product Manager

Page 2: Keeping Data in Sync with Syncsort

Agenda

What’s New and Coming Soon in Big Data

• What’s New in DMX/DMX-h version 9.5

• New Product: DMX Change Data Capture – Now GA in version 9.5!

• DataFunnel GUI – Now in beta!

• Lineage

• Big Data Quality

• DMX CDC and MIMIX Share

Strategies for Change Data Capture

• Advantages and Disadvantages of Various Strategies

– Versions, Dates

– Triggers

– Snapshot

– Log

How to Do Change Data Capture with Syncsort Software

• Snapshot-Based CDC with DMX/DMX-h

• Log-Based CDC with DMX Change Data Capture

Where to Find More Info on CDC

2Syncsort Confidential and Proprietary - do not copy or distribute

Page 3: Keeping Data in Sync with Syncsort

WHAT’S NEW IN DMX/DMX-H

3Syncsort Confidential and Proprietary - do not copy or distribute

Page 4: Keeping Data in Sync with Syncsort

Combine batch and streaming data sources

Single Interface for Streaming & Batch

Spark 2!

Easy development in GUI No need to write Scala, C or Java code

Now supports cluster mode!

4

Syncsort Confidential and Proprietary - do not copy or distribute

Simplify Streaming Data Integration

Syncsort Confidential and Proprietary - do not copy or distribute

Page 5: Keeping Data in Sync with Syncsort

Progress Monitoring

Track the progress of DMX/DMX-h jobs as they’re running!

Settable time intervals

See exactly how fast jobs are running

Know how much memory and CPU jobs use at any point

Know when there’s a problem, even in the middle of long-running jobs

5Syncsort Confidential and Proprietary - do not copy or distribute

C:\PROGRAM FILES\DMEXPRESS\PROGRAMS\dmsmonitor.exe /jobid J_readVSAM_20171006_001743_13572 /task T_readVSAM /interactive 2 /logdir .

Timestamp: 2017-10-06 00:19:09

Status: RUNNING for 00:01:28

User: aramachandran

Data directory: C:\Users\aramachandran\Documents\Projects\CompanyName\VSAM_test

Memory: 32MB

CPU: 12%

/MVS/WWCDMX/AZR.VSM (Source): 7689557 records [1689372 records/sec], 246065824 bytes [5405992 bytes/sec]

Vsam_out.dat (Target): 7685704 records [1687590 records/sec], 245942528 bytes [54002880 bytes/sec]

C:\PROGRAM FILES\DMEXPRESS\PROGRAMS\dmsmonitor.exe /jobid J_readVSAM_20171006_001743_13572 /task T_readVSAM /interactive 2 /logdir .

Timestamp: 2017-10-06 00:19:11

Status: RUNNING for 00:01:30

User: aramachandran

Data directory: C:\Users\aramachandran\Documents\Projects\CompanyName\VSAM_test

Memory: 32MB

CPU: 12%

/MVS/WWCDMX/AZR.VSM (Source): 10718776 records [1514609 records/sec], 343000832 bytes [48467504 bytes/sec]

Vsam_out.dat (Target): 10716748 records [1515522 records/sec], 342935936 bytes [48496704 bytes/sec]

Page 6: Keeping Data in Sync with Syncsort

Access and Integration of Mainframe Data … We’re Simply the Best

6Syncsort Confidential and Proprietary - do not copy or distribute

Save MIPS by processing mainframe data on Hadoop

Read and write Mainframe record formats

– Fixed record length, variable record length, & variable record length with block descriptor

– Handle complex array structures like ODO’s, even nested

– Interpret complex copybooks automatically

Write files to local or remote open systems via FTP, SFTP, Connect:Direct or HDFS

– Connect to external mainframe metadata like copybooks right on the mainframe with Connect:Direct

Store an unmodified archive copy for compliance and lineage tracking

Page 7: Keeping Data in Sync with Syncsort

Hive Enhancements

Improvements to Hive support

JDBC connectivity

Support for partitioned tables: ORC, Parquet, AVRO, HDFS

Support for Truncate and Insert

Automatic creation of Hive and other Hcat supported tables

Direct distributed processing of Hive

Update of Hive statistics

Use Hive tables for lookups

7Syncsort Confidential and Proprietary - do not copy or distribute

Page 8: Keeping Data in Sync with Syncsort

Keybreak Processing Made Easy

8Syncsort Confidential and Proprietary - do not copy or distribute

• Running Totals

• Counters

• Group Numbering

Page 9: Keeping Data in Sync with Syncsort

DATAFUNNEL

9Syncsort Confidential and Proprietary - do not copy or distribute

Page 10: Keeping Data in Sync with Syncsort

Get Your Database data into Hadoop, At the Press of a Button

• Funnel hundreds of tables at once into your data lake‒ Extract, map and move whole DB schemas in one invocation‒ Extract from Oracle, DB2/z, MS SQL Server, Teradata, Netezza and Redshift‒ To SQL Server, Postgres, Hive, HDFS, Redshift and Amazon S3‒ Automatically create target Hive and HCat tables

• Process multiple funnels in parallel on edge node or data nodes‒ Order data flows by dependencies

‒ Leverage DMX-h high performance data processing engine

• Extract only the data you want‒ Data type filtering‒ Table, record or column exclusion / inclusion

• In-flight transformations and cleansing

• User specified access methods: Native, ODBC or JDBC

10Syncsort Confidential and Proprietary - do not copy or distribute

DMX DataFunnel™

Move thousands of tables in days, not weeks!

Page 11: Keeping Data in Sync with Syncsort

New User Experience for DataFunnel

11Syncsort Confidential and Proprietary - do not copy or distribute

DMX DataFunnel™

Page 12: Keeping Data in Sync with Syncsort

New UI Wizard Flow Creation

12Syncsort Confidential and Proprietary - do not copy or distribute

DMX DataFunnel™

Page 13: Keeping Data in Sync with Syncsort

LINEAGE

13Syncsort Confidential and Proprietary - do not copy or distribute

Page 14: Keeping Data in Sync with Syncsort

Integration with Cloudera Navigator from Source to Cluster

14Syncsort Confidential and Proprietary - do not copy or distribute

Page 15: Keeping Data in Sync with Syncsort

BIG DATA QUALITY

15Syncsort Confidential and Proprietary - do not copy or distribute

Page 16: Keeping Data in Sync with Syncsort

Firstly, we configure DMX to access and ingest data

from a JSON source.

Secondly, DMX ingests data from a mainframe in

EBCDIC format.

Finally, DMX then ingests data from an XML source.

DMX then merges these files into

one consistent format.

At the same stage, DMX

produces two exports:

• one simple text/csv output

• a first write to a Hive

database.

DMX then

invokes

TSS to

perform

the Data

Quality

processing

.

Comments

All of these source files have different field structures too.

Page 17: Keeping Data in Sync with Syncsort

Trillium Quality for Big Data

17Syncsort Confidential and Proprietary - do not copy or distribute

Easily Create Data Quality Workflows Without MapReduce or Spark CodingIntelligent Execution enables deployment to Hadoop MapReduce and Spark

Verify and enrich global postal addresses using global postal reference sources

Enrich data from external, third-party sources to create comprehensive, unified records, enabling 360-degree views of the customer and other key business entities

Identify records that belong to the same domain (i.e., household or business)

Parse data values to their correct fields and standardize for better matching

Match like records and eliminate duplicates

Page 18: Keeping Data in Sync with Syncsort

DMX CHANGE DATA CAPTURE

18Syncsort Confidential and Proprietary - do not copy or distribute

Page 19: Keeping Data in Sync with Syncsort

Keep Mainframe and Hadoop Data in Sync with Hadoop in Real-Time

Keeps Hadoop data in sync with mainframe changes in real-time

• without overloading networks• without incurring a high MIPS cost • without affecting source database performance• without coding or tuning

Dependable – Reliable transfer of data even during loss of mainframe connection or Hadoop cluster failure. Continue from failure point.

Fast – Both Hive data and table statistics updated in real-time. Does fast update and insert, even on Hive tables that don’t natively support it.

Flexible – Works with all Hive tables, including those backed by text, ORC, Parquet or Avro.

DB2

Syncsort Confidential and Proprietary - do not copy or distribute

DMX Change Data Capture

DB2

Page 20: Keeping Data in Sync with Syncsort

MIMIX Share Replicates Data in Real Time

Transforms and enhances data during replication

Minimizes bandwidth usage with LAN/WAN friendly replication

Ensures data integrity with conflict resolution and collision monitoring

Enables tracking and auditing of transactions for compliance

Real-Time

Replication

with Transformation

Change Data

Capture

(CDC)

Conflict Resolution,

Collision Monitoring,

Tracking and Auditing

Source

Database

Target

Database

20

Page 21: Keeping Data in Sync with Syncsort

STRATEGIES FOR CHANGE DATA CAPTURE

21Syncsort Confidential and Proprietary - do not copy or distribute

Page 22: Keeping Data in Sync with Syncsort

Why Do Change Data Capture?

Change Data Capture (CDC) is the process that ensures that changes made over time in one dataset are automatically transferred to the other dataset.

Common data management scenarios where CDC is important:

Enterprise Data Warehouse (EDW)

Business Intelligence (BI)

EDW and/or Mainframe Optimization

Master Data Management

Data Quality

22Syncsort Confidential and Proprietary - do not copy or distribute

Page 23: Keeping Data in Sync with Syncsort

Different CDC Strategies

Timestamps or Version Numbers

Table Triggers

Snapshot or Table Comparison

Log Scraping

23Syncsort Confidential and Proprietary - do not copy or distribute

Page 24: Keeping Data in Sync with Syncsort

Advantages and Disadvantages of Timestamp or Version-Based CDC

Advantages

Simple

Nearly every database can query with a where clause.

24Syncsort Confidential and Proprietary - do not copy or distribute

Disadvantages

Must be built into database

Bloats database size

Query requires considerable compute resources in source database

Not always reliable

Page 25: Keeping Data in Sync with Syncsort

Advantages and Disadvantages of Trigger-Based CDC

Advantages

Very reliable and detailed

Changes can be captured, almost as fast as they are made – real-time CDC.

25Syncsort Confidential and Proprietary - do not copy or distribute

Disadvantages

Significant drag on database resources, both compute and storage.

Requires that the database have the capability.

Negative impact on performance of applications that depend on the source database.

Page 26: Keeping Data in Sync with Syncsort

Advantages and Disadvantages of Snapshot-Based CDC

Advantages

Relatively easy to implement with good ETL software.

Requires no specialized knowledge of the source database.

Very dependable and accurate.

26Syncsort Confidential and Proprietary - do not copy or distribute

Disadvantages

Requires repeatedly moving all data in monitored tables. May impact target or staging system resources and network bandwidth.

Moving lots of data can be slow, may not meet SLA’s.

Joining, comparing, and finding changes may also take time. Even slower.

Not a complete record of intermediate changes between snapshot captures.

Page 27: Keeping Data in Sync with Syncsort

Advantages and Disadvantages of Log-Based CDC

Advantages

Very reliable and detailed.

Virtually no impact on database or application performance.

Changes captured in real-time.

No database bloat.

27Syncsort Confidential and Proprietary - do not copy or distribute

Disadvantages

Every RDMS has a different log format, often not documented.

Log formats often change between RDBMS versions.

Log files are frequently archived by the database. CDC software must read them before they’re archived, or be able to go read the archived logs.

Requires specialized CDC software. Cannot be easily accomplished with ETL software.

Page 28: Keeping Data in Sync with Syncsort

TWO WAYS SYNCSORT DOES CDC

28Syncsort Confidential and Proprietary - do not copy or distribute

Page 29: Keeping Data in Sync with Syncsort

How Change Data Capture in DMX/DMX-h Works – Snapshot-based CDC

29Syncsort Confidential and Proprietary - do not copy or distribute

1. Capture: DMX or DMX-h pulls all data from tables that are being monitored for change. Syncsort high performance engine joins new data with previous snapshot and finds the data changes.

3. Apply: DMX-h applies the changes to Hive tables, and updates Hive statistics to facilitate queries on the new data.

2. Process: On an edge node in DMX-h, a CDC Reader consumes a single raw data stream of the delta data, and splits it into parallel load streams for the cluster.

Edge Node or Server

Source

Database

Staged

Data

Snapshot

Page 30: Keeping Data in Sync with Syncsort

How DMX Change Data Capture Works – Log-based CDC

30Syncsort Confidential and Proprietary - do not copy or distribute

1. Capture: DMX CDC engine scrapes the DB2 logs and stores only the delta, the data that has changed, and flags it as Updated, Deleted or Inserted. Virtually no MIPS usage.

3. Apply: DMX-h applies the changes to Hive tables, and updates Hive statistics to facilitate queries on the new data.

2. On an edge node in DMX-h, a CDC Reader consumes a single raw data stream of the delta data, and splits it into parallel load streams for the cluster.

Page 31: Keeping Data in Sync with Syncsort

What Next?

31Syncsort Confidential and Proprietary - do not copy or distribute

Find out more about DMX Change Data Capture

http://www.syncsort.com/en/Products/BigData/DMX-Change-Data-Capture

Contact Syncsort sales to get the latest info: http://www.syncsort.com/en/ContactSales

Page 32: Keeping Data in Sync with Syncsort

Questions

32Syncsort Confidential and Proprietary - do not copy or distribute