IMC Summit 2016 Breakout - Henning Andersen - Using Lock-free and Wait-free In-memory Algorithms to Turbo-charge High Volume Data Management

  • Published on
    09-Jan-2017

  • View
    97

  • Download
    0

Embed Size (px)

Transcript

Slide 1

Using lock-free and wait-free in-memory algorithms to turbo-charge high volume data managementHenning andersen, stibo systems A/SSee all the presentations from the In-Memory Computing Summit at http://imcsummit.org

001

BIo20 years of professional career at Stibo Systems A/SDeveloped software for the last 30+ yearsTechnical lead on many projects, including:Migrating from C++ to Java platform (performance & scalability)Establishing a component platformIn-Memory component

012

Get to Know Stibo Systems

023

Travel/Hospitality

Distribution

Retail

Manufacturing

Our Growing Family

402

2015 MQ MDM of Product Domain

035

Complete, Seamless Multidomain MDM Solution

03-056

Integrating In-Memory into STEPSTEPSTEPSTEP Server (J2EE)STEP Server (J2EE)DB ServerSTEP

DB ServerSTEP

In-Memory DBOFF-HEAP

05-077

Benchmark ResultsLarge Retailer DataLarge Distributor DataScalability Test Data

07-088

RequirementsGreat performanceCompact memory layoutDataPer Entry OverheadLookup by KeyComplex QueryingIndexingFriendly to our existing architectureFast Startup/Initialization

08-109

performance BY simplicityMVCC/ImmutabilityWait-free index scansCode GenerationCustom API/Direct Access

mov (%rdi,%r11,1),%r11

10-1110

Basic Hash Table Closed AddressingNextKey=K1Value=10

hash(key)%tablesizeBucket Table

1111

Basic Hash Table Closed AddressingNextKey=K1Value=10

hash(key)%tablesizeNextKey=K3Value=20

Bucket Table

1212

NextKey=K4Value=30

Basic Hash Table Collisionhash(key)%tablesizeNextKey=K3Value=20

NextKey=K1Value=10

Bucket Table

12-1313

MVCC Hash Table

NextPrevTSNKey=K1Value=10

hash(key)%tablesizeTSN = Transaction Sequence NumberBucket TableTx IDTSN

Transaction TablePublished TSN2

13-15Two additional structuresTSN global time, like oracle system change number (SCN). Monotonically increasing over time.

14

Transaction/Commit PhasesPrepareCommitFinishPublishVacuumAbortSTEP

In-Memory DBSTEP

In-Memory DBSTEP

In-Memory DBSTEP

In-Memory DB

LeaderCommit Phases

15-1615

Transaction/Commit PhasesPrepareCommitFinishPublishVacuumAbortSTEP

In-Memory DBSTEP

In-Memory DBSTEP

In-Memory DBSTEP

In-Memory DBLeader

Commit Phases

1516

Transaction/Commit PhasesPrepareCommitFinishPublishVacuumAbortSTEP

In-Memory DBSTEP

In-Memory DBSTEP

In-Memory DBSTEP

In-Memory DBLeader

TSN=3TSN=3TSN=3Commit Phases

1617

Transaction/Commit PhasesPrepareCommitFinishPublishVacuumAbortSTEP

In-Memory DBSTEP

In-Memory DBSTEP

In-Memory DBSTEP

In-Memory DBLeader

TSN=3TSN=3TSN=3Commit Phases

16-1718

Tx IDTSN

MVCC Hash Table Update - put(K1,15)

NextPrevTSN=2Key=K1Value=10

hash(key)%tablesizeTx IDTSNUUID=1234

Bucket TableTransaction TablePrepareFinishPublishNextPrevInfiniteKey=K1Value=15

PreparePublished TSN2

17-1819

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

MVCC Hash Table Update - put(K1,15)

NextPrevTSN=2Key=K1Value=10

hash(key)%tablesizeBucket TableTransaction TablePrepareFinishPublishNextPrevInfiniteKey=K1Value=15

Finish1. Pull new TSN

Published TSN2

1820

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

MVCC Hash Table Update - put(K1,15)

NextPrevTSN=2Key=K1Value=10

hash(key)%tablesizeBucket TableTransaction TablePrepareFinishPublishNextPrevInfiniteKey=K1Value=15

FinishNextPrevTSN=3Key=K1Value=15

2. Apply TSN

Published TSN2

18-1921

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

MVCC Hash Table Update - put(K1,15)

NextPrevTSN=2Key=K1Value=10

hash(key)%tablesizeBucket TableTransaction TablePrepareFinishPublishNextPrevInfiniteKeyValue

FinishNextPrevTSN=3Key=K1Value=15

3. Link to Prev

Published TSN2

1922

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

MVCC Hash Table Update - put(K1,15)

NextPrevTSN=2Key=K1Value=10

hash(key)%tablesizeBucket TableTransaction TablePrepareFinishPublishNextPrevInfiniteKeyValue

FinishNextPrevTSN=3Key=K1Value=15

4. Update Bucket Table

Published TSN2

19-2023

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

MVCC Hash Table Reader

NextPrevTSN=2Key=K1Value=10

hash(key)%tablesizeBucket TableTransaction TablePrepareFinishPublishNextPrevInfiniteKeyValue

NextPrevTSN=3Key=K1Value=15

Reader TSN=2Lookup K1

Published TSN2

20-2124

PrepareFinishPublishTx IDTSNUUID=1234

Tx IDTSNUUID=12343

MVCC Hash Table Update (Publish)

NextPrevTSN=2Key=K1Value=10

hash(key)%tablesizeBucket TableTransaction TableNextPrevInfiniteKeyValue

NextPrevTSN=3Key=K1Value=15

Update Published TSN PublishPublished TSN2

3

2125

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

MVCC Hash Table Reader

NextPrevTSN=2Key=K1Value=10

hash(key)%tablesizeBucket TableTransaction TablePrepareFinishPublishNextPrevInfiniteKeyValue

NextPrevTSN=3Key=K1Value=15

Reader TSN=3Lookup K1

Published TSN3

21-2226

MVCC Hash Table

2227

Indexing using Skip Lists

22-2328

Skip ListsH1020304050

50% have height >=225% have height >=3Head Height ~= log2(n)

30>=next.value?Find Value=30?

23-2529

Skip Lists - Insertion20304050

15

Pick random height

10

H

25-2630

Skip Lists - Insertion20304050

15

Pick random height

10

H

2631

Skip Lists - InsertionH1020304050

15

Pick random height

26-2732

Skip Lists Insertion ResultH1020304050

15

2733

NextPrevTSN=3Key=K1Value=15Index

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

MVCC Indexing Using Skip Lists

hash(key)%tablesizeBucket TableTransaction TablePrepareFinishPublishPublished TSN2

2

Finish5. Update Index NextPrevTSN=2Key=K1Value=10Index

27-2834

5. Update Index NextPrevTSN=3Key=K1Value=15Index

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

MVCC Indexing Using Skip Lists

hash(key)%tablesizeBucket TableTransaction TablePrepareFinishPublishPublished TSN2

2

FinishNextPrevTSN=2Key=K1Value=10Index

2835

Skip Lists 5. Update inDexH

10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2NextPrevTSNKeyValueIndex L0Index L1Index L2K1

28-29Explain fields on the left36

Skip Lists 5. Update inDexH

10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2NextPrevTSNKeyValueIndex L0Index L1Index L2K1

2937

Skip Lists Insertion ResultH

10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2NextPrevTSNKeyValueIndex L0Index L1Index L2

K1

2938

Skip Lists FIND [12-25], TSN=2H

10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2NextPrevTSNKeyValueIndex L0Index L1Index L2K1

29-30Let us find values in the range 12-25 for TSN=2

39

Skip Lists FIND [12-25], TSN=3H

10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2NextPrevTSNKeyValueIndex L0Index L1Index L2K1

30-31However, had we used TSN=3, we would have seen both 15 and 20, since both are OK to use for TSN=3.40

Lock-free Insertions SummaryCAS (compare-and-swap) on previous entity one winnerBottom-up preserves skip-list for every level, allowing wait-free readersHelp vacuum ensures lock-freedom

H1020304050

15

17

31-3241

Lock-free Insertions SummaryCAS (compare-and-swap) on previous entity one winnerBottom-up preserves skip-list for every level, allowing wait-free readersHelp vacuum ensures lock-freedom

H1020304050

15

17

31-3242

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

vacuum, epoch based deferred reclamation

hash(key)%tablesizeBucket TableTransaction TablePublished TSN2

3

H

T

10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2

K1Snapshot RegistryReaderTSNEpochThread=1234217Thread=1235317

Vacuum wait

32-33Snapshot registry, readers and writers (finish phase). Cannot remove old version yet since a TSN=2 reader is reading.43

ReaderTSNEpochThread=1234217Thread=1235317

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

vacuum, epoch based deferred reclamation

hash(key)%tablesizeBucket TableTransaction TablePublished TSN2

3

H

T

10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2

K1Snapshot RegistryVacuum wait

33Last TSN=2 reader completes.44

ReaderTSNEpochThread=1235317

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

vacuum, epoch based deferred reclamation

hash(key)%tablesizeBucket TableTransaction TablePublished TSN2

3

H

T

10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2

K1Snapshot Registry

3345

ReaderTSNEpochThread=1235317

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

vacuum, epoch based deferred reclamation

hash(key)%tablesizeBucket TableTransaction TablePublished TSN2

3

T10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2K1Snapshot RegistryVacuum phase 1H

33-34No readers using TSN=2 exists any longer and since published tsn is 3, there can never be a new reader in tsn=2. So vacuum can begin cleaning up. Phase 1 is to unlink the entry from the index.46

ReaderTSNEpochThread=1235317

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

Vacuum, epoch based deferred reclamation

hash(key)%tablesizeBucket TableTransaction TablePublished TSN2

3

T10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2K1Snapshot RegistryVacuum epoch waitH

34-35Still cannot deallocate entry since we do not know if the reader from epoch 17 is looking at it. We will do an epoch change and wait for all current readers to complete.47

ReaderTSNEpochThread=2345318Thread=1235317

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

Vacuum, epoch based deferred reclamation

hash(key)%tablesizeBucket TableTransaction TablePublished TSN2

3

T10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2K1Snapshot RegistryVacuum epoch waitH

35-36New readers may appear while waiting but they will use the new epoch value48

ReaderTSNEpochThread=2345318Thread=1235317

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

Vacuum, epoch based deferred reclamation

hash(key)%tablesizeBucket TableTransaction TablePublished TSN2

3

T10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2K1Snapshot RegistryVacuum epoch waitH

36Eventually the reader from epoch 17 completes49

ReaderTSNEpochThread=2345318

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

Vacuum, epoch based deferred reclamation

hash(key)%tablesizeBucket TableTransaction TablePublished TSN2

3

T10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2K1Snapshot RegistryVacuum epoch waitH

3650

ReaderTSNEpochThread=2345318

Tx IDTSNUUID=1234

Tx IDTSNUUID=12343

Vacuum, epoch based deferred reclamation

hash(key)%tablesizeBucket TableTransaction TablePublished TSN2

3

T

10K1PN2

15PN3

20K3PN2

30K4PN2

40K5PN2

50K6PN2K1Snapshot RegistryVacuum phase 2H

36-37And now we can deallocate the entry and the one transaction that we have been looking at is now completely handled and can be removed from the transaction table.51

MVCC SummaryMap and indexes both under MVCCIndex scans are wait-free (and simple/fast)Insert/update/delete are lock-freeAutomated reclamation of storage

Efficient and safe APItransactionManager.read((snapshot) -> { QueryIterator products = snapshot.query(ProductCO._ID.range(IMC,Stibo)); while (products.next()) { CacheEntry entry = queryIterator.entry(); long typeId = entry.longValue(ProductCO::getObjectType); CacheEntry type = snapshot.get(typeId);// can do gets, queries etc. on the same snapshot safely for all kinds of objects }}

public class ProductCO { long getObjectType(ValuePointer ptr) { }}No object copies, no GC, efficient accessOften JVM can inline entire query to one native method12345

37-38Example codeWe access the RAM directly, not through a deserialized object. 53

DIY Useful LearningMemory model (java different from C++) and CAS operationsAssemblyCPU memory architectureWait-free and lock-free algorithmsEnumerate all statesThink about state transitionsTry to formally proof it rightDeletions are often the most tricky partDo not even think about this will never happen, because it will

38-3954

In-memory Vendor QuestionsDirect access to data or only access to copies of data?And direct access to individual fields in an entry?Index/Query engine MVCC consistent with map gets and/or additional queries?Will index scans/queries acquire locks?Will index inserts acquire locks?Will map get/put operations acquire locks?Memory overhead per entry?Memory overhead per index (per entry)?How do you avoid memory fragmentation?Do you lock pages in memory and use huge/large pages?

39-4055

4056

Recommended

View more >