Upload
marinel
View
34
Download
0
Embed Size (px)
DESCRIPTION
SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads. Pınar Tözün Anastasia Ailamaki. Islam Atta Andreas Moshovos. Online Transaction Processing (OLTP). $100 Billion/ Yr , +10% annually E.g ., banking, online purchases, stock market… Benchmarking - PowerPoint PPT Presentation
Citation preview
Pınar TözünAnastasia Ailamaki
SLICC Self-Assembly of Instruction Cache Collectivesfor OLTP Workloads
Islam AttaAndreas Moshovos
SLICC
$100 Billion/Yr, +10% annually•E.g., banking, online purchases, stock market…
Benchmarking•Transaction Processing Council
•TPC-C: Wholesale retailer
•TPC-E: Brokerage market
Online Transaction Processing (OLTP)
OLTP drives innovation for HW and DB vendors
© Islam Atta 2
SLICC
Many concurrent transactions
Transactions Suffer from Instruction Misses
L1-I size
Foot
prin
t
Each
Tim
e
Instruction Stalls due to L1 Instruction Cache Thrashing© Islam Atta
3
SLICC
Even on a CMP all Transactions Suffer
CoresL1-1 Caches
Transactions
All caches thrashed with similar code blocks
Tim
e
© Islam Atta 4
SLICC
Opportunity
Footprint over Multiple Cores Reduced Instruction Misses
Technology:• CMP’s aggregate L1
instruction cache capacity is large enough
Application Behavior:• Instruction overlap within
and across transactions
Multiple L1-I caches
Multiple threads
Tim
e
© Islam Atta 5
SLICC
Dynamic Hardware Solution• How to divide a transaction• When to move• Where to go
Performance•Reduces instruction misses by 44% (TPC-C), 68% (TPC-E)
•Performance improves by 60% (TPC-C), 79% (TPC-E)
Robust: • non-OLTP workload remains unaffected
SLICC Overview
© Islam Atta 6
SLICC
• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary
Talk Roadmap
© Islam Atta 7
SLICC
Many concurrent transactions
Few DB operations•28 – 65KB
Few transaction types•TPC-C: 5, TPC-E: 12
Transactions fit in 128-512KB
OLTP Facts
Overlap within and across different transactions
R() U() I() D() IT() ITP()
PaymentNew Order
CMPs’ aggregate L1-I cache is large enough© Islam Atta
8
SLICC
Instruction Commonality Across Transactions
Lots of code reuse
More Yellow
Even higher across same-type transactions
Most
Few
Single
TPC-C TPC-E
All Threads
Per TransactionType
More Reuse
© Islam Atta 9
SLICC
Enable usage of aggregate L1-I capacity•Large cache size without increased latency
Exploit instruction commonality•Localizes common transaction instructions
Dynamic•Independent of footprint size or cache configuration
Requirements
© Islam Atta 10
SLICC
• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary
Talk Roadmap
© Islam Atta 11
SLICC
Example for Concurrent Transactions
T1 T2 T3
Code segments that can fit into L1-I
TransactionsControl FlowGraph
© Islam Atta 12
SLICC
T1 T2T1
T1
T3
T2 T3T1
T1
Scheduling Threads
T1 T2
T2 T3
T1 T3
0 1 2 3CORES
T3
Conventional
L1-I
T1
T2
T3
ThreadsTi
me
T1
T1
0 1 2 3CORES
SLICC
T1
T2
T3 T2
T1 T3
T3
T1T1
Cache Filled 10 times Cache Filled 4 times
T2 T2T2
© Islam Atta 13
SLICC
• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary
Talk Roadmap
© Islam Atta 14
SLICC
When to migrate? Step 1:
Detect: cache full
Step 2: Detect: new code segment
Where to go? Step 3:
Predict where is the next code segment?
Migration Ingredients
© Islam Atta 15
SLICC
Migration Ingredients
Tim
e
Idle coresWhen to migrate?Step 1: Detect: cache full
Step 2: Detect: new segment
Where to go?Step 3: Where is the next segment?
Loops
IdleReturn back
T1
© Islam Atta 16
SLICC
Migration Ingredients
When to migrate?Step 1: Detect: cache full
Step 2: Detect: new segment
Where to go?Step 3: Where is the next segment?
Tim
e
T2
© Islam Atta 17
SLICC
Implementation
When to migrate?Step 1: Detect: cache full
Step 2: Detect: new segment
Where to go?Step 3: Where is the next segment?
Find signature blocks on
remote cores
Miss Counter
Miss Dilution
© Islam Atta 18
SLICC
More overlap across transactions of the same-type
SLICC: Transaction Type-oblivious
Transaction Type-aware•SLICC-Pp: Pre-processing to detect similar transactions
•SLICC-SW : Software provides information
Boosting Effectiveness
© Islam Atta 19
SLICC
• Intra/Inter-thread instruction locality is high• SLICC Concept• SLICC Ingredients• Results• Summary
Talk Roadmap
© Islam Atta 20
SLICC
How does SLICC affect INSTRUCTION misses? Our primary goal
How does it affect DATA misses? Expected to increase, by how much?
Performance impact: Are DATA misses and MIGRATION OVERHEADS amortized?
Experimental Evaluation
© Islam Atta 21
SLICC
Simulation•Zesto (x86)
•16 OoO cores, 32KB L1-I, 32KB L1-D, 1MB per core L2
•QEMU extension
•User and Kernel space
Workloads
Methodology
Shore-MT
© Islam Atta 22
SLICC
Baseline: no effort to reduce instruction misses
Effect on MissesBe
tter
Reduce I-MPKI by 58%. Increase D-MPKI by 7%.
I-MPKI
D-MPKI
Base
SLIC
C
SLIC
C-SW
Base
SLIC
C
SLIC
C-SW
Base
SLIC
C
SLIC
C-SW
TPC-C-10 TPC-E MapReduce
05
1015202530354045
MPK
I
© Islam Atta 23
SLICC
Next-line: always prefetch the next-lineUpper bound for Proactive Instruction Fetch [Ferdman, MICRO’11]
Performance
TPC-C-1 TPC-C-10 TPC-E MapReduce1
1.11.21.31.41.51.61.71.81.9
2
Spee
dup
Bette
r
TPC-C: +60% TPC-E: +79%
Storage per core- PIF: ~40KB- SLICC: <1KB.
Next-Line
PIF-No Overhead
SLICC
SLICC-SW
© Islam Atta 24
SLICC
OLTP’s performance suffers due to instruction stalls.
Technology & Application Opportunities: • Instruction footprint fits on aggregate L1-I capacity of CMPs.• Inter- and intra-thread locality.
SLICC: • Thread migration spread instruction footprint over multiple
cores.• Reduce I-MPKI by 58%• Improve performance by
Summary
Baseline: +70%
Next-line: +44%
PIF: ±2% to +21%
© Islam Atta 25
Email: [email protected]: http://islamatta.com
Thanks!
SLICC
Example: thread migrates from core A core B.
•Read data on core B that is fetched on core A.
•Write data on core B to invalidate data on core A.
•When returning to core A, cache blocks might be evicted by other threads.
Why data misses increase?
© Islam Atta 27
SLICC
SLICC Agent per Core
MSV(Miss Shift-Vector)
Count “1”s
MC(Miss Counter)
≥
Fill-up_t
...
Enable shifting
Dilution_t
Locating Missed Blocks on Remote
Cores
Miss Tag-Queue (MTQ)
EnableMigration Select Matching Core
Mat
ched
_t
entr
ies
≥
EnableSearching
+Remote Cache Segment Search
Cache Full DetectionMiss(1)Hit(0)
Miss Dilution Tracking
© Islam Atta 28
SLICC
Zesto (x86)Qtrace (QEMU extension)Shore-MT
Detailed Methodology
© Islam Atta 29
SLICC
Hardware Cost
© Islam Atta 30
SLICC
Larger I-caches?
16 32 64 128
256
512 16 32 64 128
256
512 16 32 64 128
256
512 16 32 64 128
256
512 16 32 64 128
256
512 16 32 64 128
256
512
Instructions Data Instructions Data Instructions DataTPC-C-10 TPC-E MapReduce
0
10
20
30
40
50
60
0
0.2
0.4
0.6
0.8
1
1.2
1.4Conflict Capacity Compulsory Speedup
MPK
I
Cache Size (K)
Spee
d Up
Bett
er
Bett
er
© Islam Atta 31
SLICC
Different Replacement Policies?
TPC-C TPC-E MapReduce0
5
10
15
20
25
30
35
40 LRU LIP BIP DIP SRRIP BRRIP DRRIPL1
Inst
ruct
ion
MPK
I
Bett
er
© Islam Atta 32
SLICC
Parameter Space (1)Ba
se 128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
Base 128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
2 4 6 8 10 2 4 6 8 10TPC-C TPC-E
0
10
20
30
40
50
60
70
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6I-MPKI D-MPKI Speedup
Fill-up_t (top), Matched_t (bottom)
MPK
I
Spee
dup
Bett
er
Bett
er
© Islam Atta 33
SLICC
Parameter Space (2)2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
TPC-C TPC-E
0
10
20
30
40
50
60
00.20.40.60.811.21.41.61.82
I-MPKI D-MPKI Speedup
Dilution_t
MPK
I
Spee
dup
Bett
er
Bett
er
© Islam Atta 34
SLICC
Partial Bloom Filter
Cache Signature Accuracy
512 1K 2K 4K 8K 512 1K 2K 4K 8KTPC-C TPC-E
96
97
98
99
100
101
BF AccuracyA
ccur
acy
(%)
Bett
er
© Islam Atta 35