35
Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads

  • Upload
    marinel

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads. Pınar Tözün Anastasia Ailamaki. Islam Atta Andreas Moshovos. Online Transaction Processing (OLTP). $100 Billion/ Yr , +10% annually E.g ., banking, online purchases, stock market… Benchmarking - PowerPoint PPT Presentation

Citation preview

Page 1: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

Pınar TözünAnastasia Ailamaki

SLICC Self-Assembly of Instruction Cache Collectivesfor OLTP Workloads

Islam AttaAndreas Moshovos

Page 2: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

$100 Billion/Yr, +10% annually•E.g., banking, online purchases, stock market…

Benchmarking•Transaction Processing Council

•TPC-C: Wholesale retailer

•TPC-E: Brokerage market

Online Transaction Processing (OLTP)

OLTP drives innovation for HW and DB vendors

© Islam Atta 2

Page 3: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Many concurrent transactions

Transactions Suffer from Instruction Misses

L1-I size

Foot

prin

t

Each

Tim

e

Instruction Stalls due to L1 Instruction Cache Thrashing© Islam Atta

3

Page 4: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Even on a CMP all Transactions Suffer

CoresL1-1 Caches

Transactions

All caches thrashed with similar code blocks

Tim

e

© Islam Atta 4

Page 5: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Opportunity

Footprint over Multiple Cores Reduced Instruction Misses

Technology:• CMP’s aggregate L1

instruction cache capacity is large enough

Application Behavior:• Instruction overlap within

and across transactions

Multiple L1-I caches

Multiple threads

Tim

e

© Islam Atta 5

Page 6: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Dynamic Hardware Solution• How to divide a transaction• When to move• Where to go

Performance•Reduces instruction misses by 44% (TPC-C), 68% (TPC-E)

•Performance improves by 60% (TPC-C), 79% (TPC-E)

Robust: • non-OLTP workload remains unaffected

SLICC Overview

© Islam Atta 6

Page 7: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary

Talk Roadmap

© Islam Atta 7

Page 8: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Many concurrent transactions

Few DB operations•28 – 65KB

Few transaction types•TPC-C: 5, TPC-E: 12

Transactions fit in 128-512KB

OLTP Facts

Overlap within and across different transactions

R() U() I() D() IT() ITP()

PaymentNew Order

CMPs’ aggregate L1-I cache is large enough© Islam Atta

8

Page 9: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Instruction Commonality Across Transactions

Lots of code reuse

More Yellow

Even higher across same-type transactions

Most

Few

Single

TPC-C TPC-E

All Threads

Per TransactionType

More Reuse

© Islam Atta 9

Page 10: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Enable usage of aggregate L1-I capacity•Large cache size without increased latency

Exploit instruction commonality•Localizes common transaction instructions

Dynamic•Independent of footprint size or cache configuration

Requirements

© Islam Atta 10

Page 11: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary

Talk Roadmap

© Islam Atta 11

Page 12: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Example for Concurrent Transactions

T1 T2 T3

Code segments that can fit into L1-I

TransactionsControl FlowGraph

© Islam Atta 12

Page 13: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

T1 T2T1

T1

T3

T2 T3T1

T1

Scheduling Threads

T1 T2

T2 T3

T1 T3

0 1 2 3CORES

T3

Conventional

L1-I

T1

T2

T3

ThreadsTi

me

T1

T1

0 1 2 3CORES

SLICC

T1

T2

T3 T2

T1 T3

T3

T1T1

Cache Filled 10 times Cache Filled 4 times

T2 T2T2

© Islam Atta 13

Page 14: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary

Talk Roadmap

© Islam Atta 14

Page 15: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

When to migrate? Step 1:

Detect: cache full

Step 2: Detect: new code segment

Where to go? Step 3:

Predict where is the next code segment?

Migration Ingredients

© Islam Atta 15

Page 16: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Migration Ingredients

Tim

e

Idle coresWhen to migrate?Step 1: Detect: cache full

Step 2: Detect: new segment

Where to go?Step 3: Where is the next segment?

Loops

IdleReturn back

T1

© Islam Atta 16

Page 17: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Migration Ingredients

When to migrate?Step 1: Detect: cache full

Step 2: Detect: new segment

Where to go?Step 3: Where is the next segment?

Tim

e

T2

© Islam Atta 17

Page 18: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Implementation

When to migrate?Step 1: Detect: cache full

Step 2: Detect: new segment

Where to go?Step 3: Where is the next segment?

Find signature blocks on

remote cores

Miss Counter

Miss Dilution

© Islam Atta 18

Page 19: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

More overlap across transactions of the same-type

SLICC: Transaction Type-oblivious

Transaction Type-aware•SLICC-Pp: Pre-processing to detect similar transactions

•SLICC-SW : Software provides information

Boosting Effectiveness

© Islam Atta 19

Page 20: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary

Talk Roadmap

© Islam Atta 20

Page 21: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

How does SLICC affect INSTRUCTION misses? Our primary goal

How does it affect DATA misses? Expected to increase, by how much?

Performance impact: Are DATA misses and MIGRATION OVERHEADS amortized?

Experimental Evaluation

© Islam Atta 21

Page 22: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Simulation•Zesto (x86)

•16 OoO cores, 32KB L1-I, 32KB L1-D, 1MB per core L2

•QEMU extension

•User and Kernel space

Workloads

Methodology

Shore-MT

© Islam Atta 22

Page 23: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Baseline: no effort to reduce instruction misses

Effect on MissesBe

tter

Reduce I-MPKI by 58%. Increase D-MPKI by 7%.

I-MPKI

D-MPKI

Base

SLIC

C

SLIC

C-SW

Base

SLIC

C

SLIC

C-SW

Base

SLIC

C

SLIC

C-SW

TPC-C-10 TPC-E MapReduce

05

1015202530354045

MPK

I

© Islam Atta 23

Page 24: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Next-line: always prefetch the next-lineUpper bound for Proactive Instruction Fetch [Ferdman, MICRO’11]

Performance

TPC-C-1 TPC-C-10 TPC-E MapReduce1

1.11.21.31.41.51.61.71.81.9

2

Spee

dup

Bette

r

TPC-C: +60% TPC-E: +79%

Storage per core- PIF: ~40KB- SLICC: <1KB.

Next-Line

PIF-No Overhead

SLICC

SLICC-SW

© Islam Atta 24

Page 25: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

OLTP’s performance suffers due to instruction stalls.

Technology & Application Opportunities: • Instruction footprint fits on aggregate L1-I capacity of CMPs.• Inter- and intra-thread locality.

SLICC: • Thread migration spread instruction footprint over multiple

cores.• Reduce I-MPKI by 58%• Improve performance by

Summary

Baseline: +70%

Next-line: +44%

PIF: ±2% to +21%

© Islam Atta 25

Page 26: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

Email: [email protected]: http://islamatta.com

Thanks!

Page 27: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Example: thread migrates from core A core B.

•Read data on core B that is fetched on core A.

•Write data on core B to invalidate data on core A.

•When returning to core A, cache blocks might be evicted by other threads.

Why data misses increase?

© Islam Atta 27

Page 28: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

SLICC Agent per Core

MSV(Miss Shift-Vector)

Count “1”s

MC(Miss Counter)

Fill-up_t

...

Enable shifting

Dilution_t

Locating Missed Blocks on Remote

Cores

Miss Tag-Queue (MTQ)

EnableMigration Select Matching Core

Mat

ched

_t

entr

ies

EnableSearching

+Remote Cache Segment Search

Cache Full DetectionMiss(1)Hit(0)

Miss Dilution Tracking

© Islam Atta 28

Page 29: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Zesto (x86)Qtrace (QEMU extension)Shore-MT

Detailed Methodology

© Islam Atta 29

Page 30: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Hardware Cost

© Islam Atta 30

Page 31: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Larger I-caches?

16 32 64 128

256

512 16 32 64 128

256

512 16 32 64 128

256

512 16 32 64 128

256

512 16 32 64 128

256

512 16 32 64 128

256

512

Instructions Data Instructions Data Instructions DataTPC-C-10 TPC-E MapReduce

0

10

20

30

40

50

60

0

0.2

0.4

0.6

0.8

1

1.2

1.4Conflict Capacity Compulsory Speedup

MPK

I

Cache Size (K)

Spee

d Up

Bett

er

Bett

er

© Islam Atta 31

Page 32: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Different Replacement Policies?

TPC-C TPC-E MapReduce0

5

10

15

20

25

30

35

40 LRU LIP BIP DIP SRRIP BRRIP DRRIPL1

Inst

ruct

ion

MPK

I

Bett

er

© Islam Atta 32

Page 33: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Parameter Space (1)Ba

se 128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

Base 128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

2 4 6 8 10 2 4 6 8 10TPC-C TPC-E

0

10

20

30

40

50

60

70

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6I-MPKI D-MPKI Speedup

Fill-up_t (top), Matched_t (bottom)

MPK

I

Spee

dup

Bett

er

Bett

er

© Islam Atta 33

Page 34: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Parameter Space (2)2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

TPC-C TPC-E

0

10

20

30

40

50

60

00.20.40.60.811.21.41.61.82

I-MPKI D-MPKI Speedup

Dilution_t

MPK

I

Spee

dup

Bett

er

Bett

er

© Islam Atta 34

Page 35: SLICC  Self-Assembly of Instruction Cache  Collectives for  OLTP Workloads

SLICC

Partial Bloom Filter

Cache Signature Accuracy

512 1K 2K 4K 8K 512 1K 2K 4K 8KTPC-C TPC-E

96

97

98

99

100

101

BF AccuracyA

ccur

acy

(%)

Bett

er

© Islam Atta 35