36
Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

Best Practices For Loading Data To Distributed Systems With Change Data Capture

Alexey Goncharuk

1

Page 2: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Change Data Capture In The Wild

Agenda

● What is CDC?

● What can I do with CDC?

● What is available in Ignite / GridGain?

2

Page 3: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain

Systems

What is Change Data Capture?

3

Page 4: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

4

What is CDC?

● Have a data set or arbitrary size

● Determine what records changed since a given moment

● Many ways to achieve this...

What is Change Data Capture

Page 5: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

What Is CDC?

5

● Timestamps

● Versions

● Statuses

● Attached to application data model

Record Change Markers

Page 6: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Record Change Markers

6

ID … UPDATE_TS

1 2020-01-10 00:01:02.000

2 2020-01-09 11:01:02.000

3 2019-10-09 18:36:13.000

4 2020-02-01 01:02:03.000

10 2020-02-04 11:12:04.000

Page 7: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Record Change Markers

7

ID … UPDATE_TS

1 2020-01-10 00:01:02.000

2 2020-01-09 11:01:02.000

3 2019-10-09 18:36:13.000

4 2020-02-12 23:59:59.000

10 2020-02-12 14:00:00.000

Page 8: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Record Change Markers

8

SELECT * FROM Table WHERE UPDATE_TS > ’ 2020-02-12 00:00:00.000’

ID … UPDATE_TS

1 2020-01-10 00:01:02.000

2 2020-01-09 11:01:02.000

3 2019-10-09 18:36:13.000

4 2020-02-12 23:59:59.000

10 2020-02-12 14:00:00.000

Page 9: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Record Change Markers

9

● Detecting changes is tricky

○ Full scan

○ Additional index for change markers

● No previous value (change coalescing)

Cons

Page 10: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Record Change Markers

10

● May be implemented in application layer

● Delayed change consumption

● Negligible storage overhead (when no index is added)

Pros

Page 11: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

What Is CDC?

11

● Triggers / interceptors / etc...

● User code is supplied to the storage system

Callbacks

Page 12: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Callbacks

12

Update

Callback User-defined Action

Page 13: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Callbacks

13

● Invoked synchronously

● Tricky failover in distributed systems

Cons

Page 14: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Callbacks

14

● No system storage/insert overhead

● Previous value is usually available

● May have an ability to modify updated value

Pros

Page 15: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

15

What Is CDC?

Change Feed

● Changes are stored as events (Event Sourcing)

● Or changes produce events

● Consumers subscribe to a change feed

● Database WAL is an events source!

Page 16: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Change Feed

16

Update

Subscriber / Consumer

Page 17: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

17

Change Feed

● Need additional storage to keep changes

Cons

Page 18: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

18

Change Feed

● Previous values are usually available

● Full change history is preserved

● Possibly an ability to re-read the history

Pros

Page 19: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain

Systems

19

CDC Applications

Page 20: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

20

CDC Applications

Continuous Data Integration

● “Active” database produces changes

● The changes are applied to a secondary system

Page 21: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Continuous Data Integration

21

Captured Changes

Page 22: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

22

CDC Applications

Continuous Data Integration

● Reads offload

● Audit Changelog

● Cross-system Replication

● High Availability

Page 23: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

CDC Applications

23

● Computationally expensive function over a large set of

items?

● Calculate once, then apply deltas

Running function calculation

Page 24: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

CDC Applications

24

● AVG (ITEMS) = SUM (ITEMS) / COUNT (ITEMS)

○ O(N) Complexity

● On insert => SUM += New Value, COUNT += 1

● On delete => SUM -= Deleted Value, COUNT -= 1

● On update => SUM = SUM - Old Value + New Value

● Average is a O(1) operation

Running function calculation

Page 25: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

CDC Applications

25

● Updates feed is going both ways

● Need to resolve conflicts

● Conflict-free Replicated Data Types (CRDTs) for help

Cross-System Active-Active Replication

Page 26: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Basic CRDTs

26

• Grow-only counter

• Positive-negative counter

• Grow-only set

• Two-phase set

• Last-write-wins

• …

Page 27: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain

Systems

CDC In Apache Ignite

27

Page 28: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

28

CDC In Ignite

● IgniteDataStreamer to optimally deliver changes to

data nodes

● A user can use custom stream receiver

● Out-of-the-box integrations

● Kafka

● MQTT

● …

Applying Changes To Ignite

Page 29: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

CDC In Ignite

29

● CacheInterceptor

○ Guarantees update order

○ May alter inserted value

○ Synchronous, may affect performance

Callbacks

Page 30: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

CDC In Ignite

30

● Cache Events

○ Guarantee update order

○ Asynchronous

Callbacks

Page 31: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

CDC In Ignite

31

● ContinuousQuery

○ Client - server subscription

○ Remote filter acts as a synchronous callback

○ Local listener acts as a sink

Callbacks And Change Feed Combined

Page 32: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

CDC In Ignite

32

Consumer

Remote Filter

Remote Filter

Update (K1)

Update (K2)

Page 33: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

CDC In Ignite

33

● Automatic failover in case of primary node crash

● Single-key ordering guarantees

Callbacks And Change Feed Combined

Page 34: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

CDC In Ignite

34

● Ingestion● IgniteDataStreamer

● Capturing Changes● CacheInterceptor

● Events● ContinuousQuery

Page 35: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Apache Ignite

35

Want To Contribute?

[email protected]

[email protected]

Page 36: Change Data Capture in the wild · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2020 © GridGain

Systems

Q&A

36

Thank you for your attention!