24
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation and Test in Eur ope Conference and Exhibition, Volume: 1 Pages:148 - 153, Feb. 2004

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

Scheduling Reusable Instructions for Power Reduction

J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, Volume: 1

Pages:148 - 153, Feb. 2004

Page 2: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 2/24

Abstract In this paper, we propose a new issue queue design

that is capable of scheduling reusable instructions. Once the issue queue is reusing instructions, no instruction cache access is needed since the instructions are supplied by the issue queue itself. Furthermore, dynamic branch prediction and instruction decoding can also be avoided permitting the gating of the front-end stages of the pipeline (the stages before register renaming) . Results using array-intensive codes show that up to 82% of the total execution cycles, the pipeline front-end can be gated, providing a power reduction of 72% in the instruction cache, 33% in the branch predictor, and 21% in the issue queue, respectively, at a small performance cost. Our analysis of compiler optimizations indicates that the power savings can be further improved by using optimized code.

Page 3: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 3/24

Outline

What’s the problem Introduction Implementation foundation and analysis Proposed method and architecture Experimental results and evaluation Conclusions

Page 4: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 4/24

What’s the Problem

Power problem is not only a design limiter but also a major constraint in embedded systems design

The front-end of the pipeline (stages before register renaming) is a power-hungry component of an embedded microprocessor For StrongARM, there is about 27% power dissipation in the instruc

tion cache access Furthermore, sophisticated branch predictors are also very power c

onsuming

Therefore, seek to optimize the power consumption in the pipeline front-end becomes the most challenging issues in the design of an embedded processors

Page 5: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 5/24

Introduction

In recent years, several techniques have proposed to reduce the power consumption in the pipeline front-end: Stage-skip pipeline

Utilizes a decoded instruction buffer (DIB) to temporarily store decoded loop instructions for later reused Disadvantage:

1) Requires ISA modification

2) Needs an additional instruction buffer

3) Buffering only one iteration of the loop (performance down)

Loop caches Dynamic/preloaded loop caches

Disadvantage:

1) Needs an additional loop cache

2) Buffering only one iteration of the loop (performance down)

Page 6: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 6/24

Introduction (cont.)

Filter cache Use smaller level zero cache to capture tight spatial /

temporal locality in cache access The proposed approach

New issue queue design based on superscalar architecture Scheduling reusable loop instructions within the

issue queue No need of an additional instruction buffer Utilize the existing issue queue resources Automatically unroll loops in the issue queue No ISA modification Be able to gate the front-end of pipeline Address the power problem in the front-end of the pipeline

Page 7: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 7/24

The Baseline Datapath Base on MIPS Core

(a) The Baseline Datapath Model of the MIPS R10000

(b) The Pipeline Stages of the Baseline Superscalar Microprocessor

Page 8: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 8/24

Implementation Analysis

Reusable instructions are those mainly belonging to loop structures that are repeatedly executed

The new issue queue design consists of the following four parts: A loop structure detector A mechanism to buffer the reusable instructions within

the issue queue A scheduling mechanism to reuse those buffered

instructions in their program order A recovery scheme from the reuse state to the normal

state

Page 9: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 9/24

Issue Queue State Transition

Non-successful buffering will be revoked Changing control flow in a buffering loop will cause buffering to be revoked The front-end of the pipeline is gated during Code_Reuse state Misprediction due to normally exiting a loop will restore issue queue to Normal state

Page 10: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 10/24

Detecting Reusable Loop Structures Conditional branch and direct jump instructions may fo

rm the last instruction of a loop iteration

Add logic to check:

(a) Backward branch/jump (b) Loop size is no larger than the issue queue size

This check is performed at decode stage using predicted target address

After a capturable loop is detected then starts to buffer instructions as the second iteration begins

Page 11: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 11/24

Buffering Reusable Instructions

Extend issue queue micro-architecture

Buffering a reusable instruction requires several operations : The Classification Bit is set

With the Classification Bit set, the instruction will not be removed from the issue queue even after it has been issued The Issue State Bit is reset to zero The logical register numbers are recorded in the Logic Register List (LRL)

Page 12: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 12/24

Strategy of When to Stop Buffering Buffering only one iteration of the loop

Advantage: Enters Code_Reuse state and gates the pipeline front-end m

uch earlier (gain much more power reduction)

Buffering multiple iterations of the loop Advantage:

It automatically Unrolls the loop to exploit more ILP The issue queue resource is used more effectively

Although the second strategy does not gate the pipeline front-end as fast as the first strategy, we choose the second one for performance sake

Page 13: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 13/24

Optimizing Loop Buffering Strategy During Loop_Buffering state, if

An inner loop is detected The execution exits the current loop A procedure call within a loop causes the issue queue to use up befor

e the loop end is met

The current loop is identified as a non-bufferable loop Example of an non-bufferable loop

Use Non-bufferable loop table (NBLT) to store non-bufferable loops

If a detected loop appears in NBLT, no buffering is attempted for this loop.

With this optimization, the issue queue can eliminate most of the buffering of non-bufferable loops

Page 14: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 14/24

Reusing Buffered Instructions During Code_Reuse state, instruction cache access and ins

truction decoding is disable

Needs a mechanism to reuse the buffered instructions already in the issue queue Thus, the instructions are supplied by the issue queue itself

Page 15: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 15/24

Reusing Buffered Instructions (cont.)

Utilizes a reuse pointer to scan for instructions to be reused Check the issue state bits of the first n instructions starting from the reuse pointer, if they are set, the logical

register numbers are sent for register-renaming to reuse those inst. and the reuse pointer then advanced by n to scan instructions for the next cycle

Renamed instructions update their corresponding entries (e.g., register information) in the issue queue

Page 16: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 16/24

Reusing Buffered Instructions (cont.)

Utilizes a reuse pointer to scan for instructions to be reused Check the issue state bits of the first n instructions starting from the reuse pointer, if they are set, the logical registe

r numbers are sent for register-renaming to reuse those inst. and the reuse pointer then advanced by n to scan instructions for the next cycle

Renamed instructions update their corresponding entries (e.g., register information) in the issue queue

Scan & send for register renaming After renaming, update register information in the issue queue

Page 17: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 17/24

Reusing Buffered Instructions (cont.)

Reuse pointer is automatically reset to the position of the first buffered instruction after the last buffered instruction is reused

This process is repeat until a branch misprediction is detected Misprediction due to normally exiting a loop will restore issue queue t

o Normal state During Code_Reuse state, dynamic branch prediction is disable

Branch instructions are statically predicted The static prediction works well since the branches within loops are normally highly-biased for one direction

The static prediction is still verified after the branch completes If the static prediction is detected to be incorrect (Misprediction) d

uring this verification, the issue queue will exit Code_Reuse state and restore to Normal state

Page 18: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 18/24

Restoring Normal State

Recovery process of revoking the current buffering state Check the classification bit and issue state bit of all instructions, if they ar

e set, it is immediately removed from the issue queue The remaining instruction’s classification bit are then cleared

Recovery process due to misprediction in the Loop_Buffering state Besides above process, Instructions newer than this branch are removed

from the issue queue and ROB

Recovery process due to misprediction in the Code_Reuse state Check the classification bit and issue state bit of all instructions, if they are

set, it is immediately removed from the issue queue The remaining instruction’s classification bit are then cleared Instructions newer than this branch are removed from the issue queue and

ROB The gating signal is reset

Page 19: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 19/24

Results - Rate of Gated Front End

On the average, the pipeline front-end gated rate increase from 42% to 82% as the issue queue size increase

However, increasing issue queue size does’t always improve the gated rate (e.g. tsf and wss) Larger issue queue will buffer more iteration and delay the pipeline gating

Page 20: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 20/24

Results - Power Savings in Front End

On the average, a power reduction of 35% - 72% in ICache 19% - 33% in branch predictor 12% - 21% in issue queue (due to partial update) with < 2% overhead. (due to supporting logic)

as the issue queue size increase

Page 21: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 21/24

Results - Overall Power Reduction

On the average, the power reduction is improved from

8% to 12% as the issue queue size increase But for some configuration the overall power is increased (e.

g. larger loop structure with smaller issue queue size) Does’t gate front-end and consume power in the supporting logic

Page 22: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 22/24

Results - Performance Loss

Average performance loss ranges from 0.2% to 4% as the issue queue size increase Due to the non-fully utilized issue queue as we buffering integer

number of iterations of the loops

Page 23: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 23/24

Results - Impact of Compiler Optimizations

Larger loop structure can hardly be captured with a small issue queue

Perform loop distribution optimization to reduce the size of loop body Break a loop into two or more smaller loop Gear the loop code towards a given issue queue size

Optimized code increases power savings from 8% to 13% with issue queue size of 64 entries

Page 24: Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation

112/04/21 Scheduling Reusable Instructions for Power Reduction 24/24

Conclusions

Proposed a new issue queue architecture Detect capturable loop code Buffer loop code in the issue queue Reuse those buffered loop instructions

Significant power reduction in pipeline front-end components while gated

(e.g. Icache, Bpred and Idecoder)

Compiler optimizations can further improve the power savings