Universityof Illinois http://iacoma.cs.uiuc.edu/
BulkSMT: Designing SMT Processors for
Atomic-Block Execution
Xuehai Qian, Benjamin Sahelices and Josep TorrellasUniversity of Illinois
To appear at HPCA 2012
Thursday, December 8, 2011
http://iacoma.cs.uiuc.eduhttp://iacoma.cs.uiuc.edu
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Motivation
2Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Motivation
• Architectures that continuously execute Atomic Blocks
• Performance and programmability advantages [Hammond 04], [Ahn 10]
• All proposals use single context cores
• What if we used Simultaneous Multithreading (SMT) cores?
• Enable a better utilization of the hardware
• Fast/local communication between contexts
• Enable higher-concurrency forms of atomic block execution
2Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Contributions
3Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Contributions• BulkSMT: first SMT design that supports atomic-block (transactional)
execution
• Enables concurrency between dependent blocks
• Analysis of design space
• SQUASH on conflict
• STALL on conflict
• ORDER on conflict
• Design of a multicore of BulkSMTs
• BulkSMT is cost Effective
• Higher performance for the same core count
• Competitive performance for 1/4 of the cores
3Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Outline
• Motivation
• Designing BulkSMT
• Design Space
• Hardware Mechanisms
• Multicore of BulkSMTs
• Evaluation
4Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Designing SMT For Blocked Execution
5
L1/L2
Contexts
• Version Management:
• Conflict Detection:
• Conflict Resolution:
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Designing SMT For Blocked Execution
5
L1/L2
Contexts
spec wr
No disp
• Version Management:
• Conflict Detection:
• Conflict Resolution:
Eager
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Designing SMT For Blocked Execution
5
L1/L2
Contexts
spec wr
spec rd
No disp
• Version Management:
• Conflict Detection:
• Conflict Resolution:
Eager
Eager
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Designing SMT For Blocked Execution
5
L1/L2
Contexts
spec wr
spec rd
No disp
• SQUASH
• STALL
• ORDER
• Version Management:
• Conflict Detection:
• Conflict Resolution:
Eager
Eager
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
SQUASH Design
6
P0L1$
L2$
wr x
rd x
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
SQUASH Design
6
P0L1$
L2$
wr x
rd x
wr x
rd x
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
SQUASH Design
6
P0L1$
L2$
wr x
rd x
wr x
rd xsquash
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
SQUASH Design
6
P0L1$
L2$
wr x
rd x
wr x
rd x
rd x
squash
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design
7
wr x
rd x
P0L1$
L2$
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design
7
wr x
rd x
wr x
rd x
P0L1$
L2$
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design
7
wr x
rd x
wr x
rd x
...... stall
P0L1$
L2$
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design
7
wr x
rd x
wr x
rd x
commit
...... stall
P0L1$
L2$
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design
7
wr x
rd x
wr x
rd x
rd xcommit
...... stall
P0L1$
L2$
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design
8
wr x
rd x
P0L1$
L2$
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design
8
wr x
rd x
wr x
rd x
P0L1$
L2$
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design
8
wr x
rd x
wr x
rd x
P0L1$
L2$
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design
8
wr x
rd x
wr x
rd x
commit
P0L1$
L2$
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design
8
wr x
rd x
wr x
rd x
commitcommit
P0L1$
L2$
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design Issues
9
wr x
rd x
T0 T1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design Issues
• On a dependence:
• Stall the destination block
• HW records source and destination blocks
• On commit/squash of source block
• Wake up stalled destination block
9
wr x
rd x
T0 T1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design Issues
10Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design Issues• Transitive stall OK
• Block allowed to stall on an already stalled block
10Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design Issues• Transitive stall OK
• Block allowed to stall on an already stalled block
10
wr y
wr x
rd x
wr y
T0 T1 T2
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design Issues• Transitive stall OK
• Block allowed to stall on an already stalled block
10
wr y
wr x
rd x
wr y
T0 T1 T2
T1 wakes up T0T0 wakes up T2
Total order of blocks: T1 ➝ T0 ➝ T2
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design Issues• Transitive stall OK
• Block allowed to stall on an already stalled block
10
wr y
wr x
rd x
wr y
T0 T1 T2
T1 wakes up T0T0 wakes up T2
Total order of blocks: T1 ➝ T0 ➝ T2
• Cycle not OK
• When a stall cycle is formed, block that closes the cycle is squashed
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design Issues• Transitive stall OK
• Block allowed to stall on an already stalled block
10
wr y
wr x
rd x
wr y
T0 T1 T2
wr y
wr x
rd x
T0 T1
wr y
T1 wakes up T0T0 wakes up T2
Total order of blocks: T1 ➝ T0 ➝ T2
• Cycle not OK
• When a stall cycle is formed, block that closes the cycle is squashed
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
STALL Design Issues• Transitive stall OK
• Block allowed to stall on an already stalled block
10
wr y
wr x
rd x
wr y
T0 T1 T2
wr y
wr x
rd x
T0 T1
wr y
T1 wakes up T0T0 wakes up T2
Total order of blocks: T1 ➝ T0 ➝ T2
Squash T1Total order of blocks:
T0 ➝ T1
• Cycle not OK
• When a stall cycle is formed, block that closes the cycle is squashed
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design Issues (I)
11Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design Issues (I)• On a dependence:
• HW records source and destination block; Both blocks proceed
• HW will enforce the same order in commit
11Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design Issues (I)• On a dependence:
• HW records source and destination block; Both blocks proceed
• HW will enforce the same order in commit
11
• Transitive order is OK
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design Issues (I)• On a dependence:
• HW records source and destination block; Both blocks proceed
• HW will enforce the same order in commit
11
wr y
wr x
rd x
wr y
T0 T1 T2
Total order of blocks: T1 ➝ T0 ➝ T2
• Transitive order is OK
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design Issues (I)• On a dependence:
• HW records source and destination block; Both blocks proceed
• HW will enforce the same order in commit
11
wr y
wr x
rd x
wr y
T0 T1 T2
Total order of blocks: T1 ➝ T0 ➝ T2
• Cycle is not OK
• On cycle, HW breaks it by squashing one or more blocks involved in the cycle
• Transitive order is OK
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design Issues (I)• On a dependence:
• HW records source and destination block; Both blocks proceed
• HW will enforce the same order in commit
11
wr y
wr x
rd x
wr y
T0 T1 T2
wr y
wr x
rd x
T0 T1
wr y
Total order of blocks: T1 ➝ T0 ➝ T2
• Cycle is not OK
• On cycle, HW breaks it by squashing one or more blocks involved in the cycle
• Transitive order is OK
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design Issues (II)
12Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design Issues (II)
• On a cycle: different dependences have different squash requirement
12Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design Issues (II)
• On a cycle: different dependences have different squash requirement
12
wr
T0 T1
rd
RAW
Squash T0 ⇒ Squash T1Squash T1 ⇏ Squash T0
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design Issues (II)
• On a cycle: different dependences have different squash requirement
12
wr
T0 T1
rd
RAW
Squash T0 ⇒ Squash T1Squash T1 ⇏ Squash T0
wr
T0 T1
wr
WAW
Squash T0 ⇒ Squash T1Squash T1 ⇒ Squash T0
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
ORDER Design Issues (II)
• On a cycle: different dependences have different squash requirement
12
wr
T0 T1
rd
RAW
Squash T0 ⇒ Squash T1Squash T1 ⇏ Squash T0
wr
T0 T1
wr
WAW
Squash T0 ⇒ Squash T1Squash T1 ⇒ Squash T0
T0 T1
wr
WAR
Squash T0 ⇏ Squash T1Squash T1 ⇏ Squash T0
rd
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Outline
• Motivation
• Designing BulkSMT
• Design Space
• Hardware Mechanisms
• Multicore of BulkSMTs
• Evaluation
13Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
14
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
14
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
14
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
14
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Access Recording• LW: Last Writer context-ID
• R[i]: Read bit-mask
• Sp: Speculative bit
15Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Access Recording• LW: Last Writer context-ID
• R[i]: Read bit-mask
• Sp: Speculative bit
15
• Read from context k: set R[k]
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Access Recording• LW: Last Writer context-ID
• R[i]: Read bit-mask
• Sp: Speculative bit
15
• Read from context k: set R[k]
• Write from context k: Sp ← 1, LW ← k
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Conflict Detection
16Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Conflict Detection
• Read After Write (RAW)
• Load from context k reads cache line and (Sp == 1) && (LW != k)
• Write After Read (WAR)
• ...
• Write After Write (WAW)
• ...
16Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Commit and Squash
17Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Commit and Squash
• On commit of context k:
• Flash clear of the R[k] bit of all the cache lines
• Flash conditional-clear the Sp bits of any line whose (LW == k)
• On Squash (context k):
• ...
17Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
18
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
18
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
18
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
18
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection: HW Structures
19
Ti Tj
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection: HW Structures
19
Ti Tj
Dependence Table (DT)
Records ordered dependencesTi ➙ Tj
i
j
n
n (=4)
1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection: HW Structures
19
Ti Tj
Dependence Table (DT)
Records ordered dependencesTi ➙ Tj
i
j
Cycle Table (CT)
In background: Finds cycles of dependences Cycle = bit in diagonal
i
j
n n
n (=4) n
1 1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
1
1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
1
1
d2
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
1
1
d2
DT
CT
1
1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
1
1
d2
DT
CT
d2
1
1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
1
1
d2
DT
CT
d2
1
1
1
1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
1
1
d2
DT
CT
d2
1
1
1
1
Coldst |= ColsrcRowsrc |= Rowdst
Propagation of d2 dependence:
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
1
1
d2
DT
CT
d2
1
1
1
1
CT
11
Col2 |= Col1
Coldst |= ColsrcRowsrc |= Rowdst
Propagation of d2 dependence:
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
1
1
d2
DT
CT
d2
1
1
1
1
CT
11
Col2 |= Col1
Coldst |= ColsrcRowsrc |= Rowdst
Propagation of d2 dependence:
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
1
1
d2
DT
CT
d2
1
1
1
1
CT
11
Col2 |= Col1
1
Coldst |= ColsrcRowsrc |= Rowdst
Propagation of d2 dependence:
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
1
1
d2
DT
CT
d2
1
1
1
1
CT
dA
11
Col2 |= Col1
1
dA
Coldst |= ColsrcRowsrc |= Rowdst
Propagation of d2 dependence:
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
20
T0 T1 T2
DT
CT
d1
d1
1
1
d2
DT
CT
d2
1
1
1
1
CT
dA
11
Col2 |= Col1
1
dA
CT
11
Row1 |= Row2
1
Coldst |= ColsrcRowsrc |= Rowdst
Propagation of d2 dependence:
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
d1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
d1
d1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
d1
d1
1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
d1
d1
1
d2
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
d1
d1
1
CT
1
d2
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
d1
d1
1
CT
d2
1
d2
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
d1
d1
1
CT
d2
11
d2
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
d1
d1
1
CT
d2
11
CT
11
d2
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
d1
d1
1
CT
d2
11
CT
11
d2
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
d1
d1
1
CT
d2
11
CT
111
d2
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Cycle Detection
21
T0 T1
CT
d1
d1
1
CT
d2
11
CT
dA
111
d2
dA
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Issues Related to CT and DT
22Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Issues Related to CT and DT
• CT is not time-critical structure: is updated in the background
• OK to find a cycle some time later (DT is always up-to-date)
• Update of CT is recursive
• Terminates when any bit in diagonal is set or CT is no longer changed
• Always detect precise cycle, no false positive
22Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
23
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
23
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
23
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
23
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Advanced Conflict Recording
24
# Context
# Co
ntex
t
RAWWARWAW
Enhanced Dependence Table
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
25
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
25
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
25
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Hardware Mechanisms
25
Mechanism Function Impl. Designs
Access Recording and Conflict
Detection
Cycle Detection
Advanced Conflict Recording
Squash Set Generation
Record the addresses accessed by each thread and detect when two threads
have a data conflict
Access Bits in cache and related
logicAll
Record data conflicts and their ordering, and detect conflict cycles
Dependence Table and Cycle Table
STALLORDER
Represent the type of conflict between different blocks compactly
Enhanced Dependence Table ORDER
On a cycle of conflicts, decide the set of blocks to squash
Logic that operates on the
Dependence TableORDER
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Set of Blocks to Squash on a Cycle
26
T0 T1 T2
wr
wr
wr
rd
rd
RAW WAR
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Set of Blocks to Squash on a Cycle
26
T0 T1 T2
wr
wr
wr
rd
rd
rd
RAW
RAW
WAR
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Set of Blocks to Squash on a Cycle
26
T0 T1 T2
wr
wr
wr
rd
rd
rd
RAW
RAW
WAR
• Squash the thread that closes the cycle (T0)
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Set of Blocks to Squash on a Cycle
26
T0 T1 T2
wr
wr
wr
rd
rd
rd
RAW
RAW
WAR
• Squash the thread that closes the cycle (T0)
• Forward propagate from T0
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Set of Blocks to Squash on a Cycle
26
T0 T1 T2
wr
wr
wr
rd
rd
rd
RAW
RAW
WAR
• Squash the thread that closes the cycle (T0)
• Forward propagate from T0
• T0 ➙ T1: squash T1
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Set of Blocks to Squash on a Cycle
26
T0 T1 T2
wr
wr
wr
rd
rd
rd
RAW
RAW
WAR
• Squash the thread that closes the cycle (T0)
• Forward propagate from T0
• T0 ➙ T1: squash T1
• T1 ➙ T2: WAR, no propagate
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Set of Blocks to Squash on a Cycle
26
T0 T1 T2
wr
wr
wr
rd
rd
rd
RAW
RAW
WAR
• Squash the thread that closes the cycle (T0)
• Forward propagate from T0
• T0 ➙ T1: squash T1
• T1 ➙ T2: WAR, no propagate
• Backward propagate from T0
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Set of Blocks to Squash on a Cycle
26
T0 T1 T2
wr
wr
wr
rd
rd
rd
RAW
RAW
WAR
• Squash the thread that closes the cycle (T0)
• Forward propagate from T0
• T0 ➙ T1: squash T1
• T1 ➙ T2: WAR, no propagate
• Backward propagate from T0
• T2 ➙ T0: RAW, no propagate
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Outline
• Motivation
• Designing BulkSMT
• Design Space
• Hardware Mechanisms
• Multicore of BulkSMTs
• Evaluation
27Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Multicore of BulkSMTs
28Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Multicore of BulkSMTs
28
• Eager Scheme (E)
• Block wants to commit
• Commit globally first, then commit locally
• Reception of cache invalidation
• Check the cache access bits; may squash multiple blocks
• Lazy Scheme (L)
• ......
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Outline
• Motivation
• Designing BulkSMT
• Design Space
• Hardware Mechanisms
• Multicore of BulkSMTs
• Evaluation
29Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Evaluation
• Cycle-accurate execution-driven simulator based on SESC
• 6 SPLASH-2 and 2 PARSEC applications
• Applications run with 1, 4, or 16 threads
• Compare performance:
• Using same number of cores, diff number of threads
• Using same number of threads, diff number of cores
30Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution 31
BK
Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0
1.0
2.0
Exec
utio
n Ti
me
BK
BK BK
BK BK
BK
BK
BK
BK
SQ SQ SQ SQ SQ SQ SQ SQ SQST ST
ST ST
ST
ST ST
ST
ST
OR
OR
OR
OR O
R OR
OR O
R OR
UsefulProcPipe
ProcMemBlockStall
SquashedGeoMeanExecution Time (Same HW, 1 Core)
BKSQ,ST,OR
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution 31
BK
Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0
1.0
2.0
Exec
utio
n Ti
me
BK
BK BK
BK BK
BK
BK
BK
BK
SQ SQ SQ SQ SQ SQ SQ SQ SQST ST
ST ST
ST
ST ST
ST
ST
OR
OR
OR
OR O
R OR
OR O
R OR
UsefulProcPipe
ProcMemBlockStall
SquashedGeoMean
• BulkSMT more cost effective: faster for the same HW
• BulkSMT ORDER is the best
Execution Time (Same HW, 1 Core)
BKSQ,ST,OR
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
SQ-EE
Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0
1.0
2.0
3.0Ex
ecut
ion
Tim
e
BK-L
L
BK-L
L
BK-L
L
BK-L
L
BK-L
L
BK-L
L
BK-L
L
BK-L
L
BK-L
LSQ
-LL
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
ST-L
L ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L ST-L
L
OR-
LL
OR-
LL
OR-
LL
OR-
LL OR-
LL
OR-
LL
OR-
LL
OR-
LL
OR-
LL
BK-E
E
BK-E
E
BK-E
E
BK-E
E
BK-E
E BK-E
E
BK-E
E BK-E
E BK-E
ESQ-E
E
SQ-E
E
SQ-E
E SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
ST-E
E ST-E
E
ST-E
E
ST-E
E
ST-E
E
ST-E
E
ST-E
E ST-
EE ST-E
E
OR-
EE
OR-
EE
OR-
EE
OR-
EE OR-
EE
OR-
EE
OR-
EE OR-
EE
OR-
EE
UsefulProcPipe
ProcMemBlockStall
SquashedGeoMean
32
Execution Time (Same HW, 4 Cores)
BK-{LL,EE}{SQ,ST,OR}-{LL,EE}
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
• BulkSMT ORDER is best
• BulkSMT limited by application scalability
SQ-EE
Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0
1.0
2.0
3.0Ex
ecut
ion
Tim
e
BK-L
L
BK-L
L
BK-L
L
BK-L
L
BK-L
L
BK-L
L
BK-L
L
BK-L
L
BK-L
LSQ
-LL
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
ST-L
L ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L ST-L
L
OR-
LL
OR-
LL
OR-
LL
OR-
LL OR-
LL
OR-
LL
OR-
LL
OR-
LL
OR-
LL
BK-E
E
BK-E
E
BK-E
E
BK-E
E
BK-E
E BK-E
E
BK-E
E BK-E
E BK-E
ESQ-E
E
SQ-E
E
SQ-E
E SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
ST-E
E ST-E
E
ST-E
E
ST-E
E
ST-E
E
ST-E
E
ST-E
E ST-
EE ST-E
E
OR-
EE
OR-
EE
OR-
EE
OR-
EE OR-
EE
OR-
EE
OR-
EE OR-
EE
OR-
EE
UsefulProcPipe
ProcMemBlockStall
SquashedGeoMean
32
Execution Time (Same HW, 4 Cores)
BK-{LL,EE}{SQ,ST,OR}-{LL,EE}
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution 33
SQ-EE
Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0
1.0
2.0
3.0
Exec
utio
n Ti
me
BK-L
L
BK-L
L
BK-L
L
BK-L
L BK-L
L
BK-L
L
BK-L
L BK-L
L
BK-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L ST-L
L
OR-
LL
OR-
LL
OR-
LL
OR-
LL OR-
LL
OR-
LL
OR-
LL
OR-
LL
OR-
LL
BK-E
E BK-E
E
BK-E
E
BK-E
E
BK-E
E
BK-E
E
BK-E
E
BK-E
E
BK-E
ESQ
-EE
SQ-E
E
SQ-E
E SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
ST-E
E ST-E
E
ST-E
E
ST-E
E
ST-E
E
ST-E
E
ST-E
E ST-
EE ST-E
E
OR-
EE
OR-
EE
OR-
EE
OR-
EE OR-
EE
OR-
EE
OR-
EE OR-
EE
OR-
EE
UsefulProcPipe
ProcMemBlockStall
SquashedGeoMean
Execution Time (Same Threads, 16)
BK-{LL,EE}{SQ,ST,OR}-{LL,EE}
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution 33
SQ-EE
Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0
1.0
2.0
3.0
Exec
utio
n Ti
me
BK-L
L
BK-L
L
BK-L
L
BK-L
L BK-L
L
BK-L
L
BK-L
L BK-L
L
BK-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
SQ-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L
ST-L
L ST-L
L
OR-
LL
OR-
LL
OR-
LL
OR-
LL OR-
LL
OR-
LL
OR-
LL
OR-
LL
OR-
LL
BK-E
E BK-E
E
BK-E
E
BK-E
E
BK-E
E
BK-E
E
BK-E
E
BK-E
E
BK-E
ESQ
-EE
SQ-E
E
SQ-E
E SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
SQ-E
E
ST-E
E ST-E
E
ST-E
E
ST-E
E
ST-E
E
ST-E
E
ST-E
E ST-
EE ST-E
E
OR-
EE
OR-
EE
OR-
EE
OR-
EE OR-
EE
OR-
EE
OR-
EE OR-
EE
OR-
EE
UsefulProcPipe
ProcMemBlockStall
SquashedGeoMean
• BulkSMT multicore: using 1/4 hardware and achieves better performance
Execution Time (Same Threads, 16)
BK-{LL,EE}{SQ,ST,OR}-{LL,EE}
Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Also in the paper
34Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Also in the paper
• Implementation issues
• Handling high-contention synchornization
• Other characteristics of execution behavior
• Related work
34Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Conclusion
35Thursday, December 8, 2011
Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution
Conclusion
• Proposed BulkSMT: first SMT design that supports atomic-block (transactional) execution
• Enables concurrency between dependent blocks
• Proposed designs of different concurrency vs cost
• SQUASH on conflict
• STALL on conflict
• ORDER on conflict
• Designed a multicore of BulkSMTs
• BulkSMT is cost Effective
• Higher performance for the same core count
• Competitive performance for 1/4 of the cores
35Thursday, December 8, 2011
Universityof Illinois http://iacoma.cs.uiuc.edu/
BulkSMT: Designing SMT Processors for
Atomic-Block Execution
Xuehai Qian, Benjamin Sahelices and Josep TorrellasUniversity of Illinois
To appear at HPCA 2012
Thursday, December 8, 2011
http://iacoma.cs.uiuc.eduhttp://iacoma.cs.uiuc.edu