1 Applying Perceptrons to Speculation in Computer Architecture Michael Black Dissertation Defense April 2, 2007

1

Applying Perceptrons to Speculation in Computer Architecture

Michael BlackDissertation DefenseApril 2, 2007

2

Presentation Outline

Background and ObjectivesPerceptron behaviorLocal value predictionGlobal value predictionCriticality predictionConclusions

3

Motivation: Jimenez’s Perceptron Branch Predictor

27% reduction in misprediction over gshare15.8% increase in performance over gshare1

why better? can consider longer history

1Jimenez and Lin, “Dynamic Branch Prediction with Perceptrons.”, 2002.

4

Problem of Lookup Tables

Size grows exponentially with history

Result: must consider small subset of available data

5

Global vs. Local

Local history: past iterations of same instruction

Global history: all past dynamic instructions

6

Perceptron

Predictions: 1. Dot product of binary inputs and integer weights2. Apply threshold: if +, predict 1; if -, predict 0

Learning objective:Weight values should reflect input’s correlation

7

Training strategies

Training by correlationif actual==inputk : wk++

else: wk--

Training by errorerror = actual - predicted

wk = wk + inputk error

8

Linear Separability

Weight can only learn one correlation: direct (positive) inverse (negative)

9

Dissertation Objectives

Analyze behavior of perceptrons when used to replace tablesCoping with limitations of perceptrons and their implementationsApplying perceptrons to value predictionApplying perceptrons to criticality prediction

10

Dissertation Contributions

Perceptron Local Value Predictor can consider longer local histories

Perceptron Global-based Local Value Predictor can use global information to choose local values

Two Perceptron Global Value PredictorsPerceptron Global Criticality PredictorComparison and analysis of: perceptron training approaches multiple-bit topologies interference reduction strategies

11

Analyses

How perceptrons behave when replacing tables What effect the training approach has

Design and behavior of different multiple-bit perceptrons

Dealing with history interference

12

Context-based Learning

Concatenated history pattern (“context”) indexes table

13

Pattern Compatibility

14

What affects perceptron learning?

Noise from uncorrelated inputsImbalance between pattern occurrencesFalse correlations

Effects:Perceptron takes longer to learnPerceptron never learns

15

Noise

Training by correlation: weights grow large rapidly: less susceptible

Training by error: weights don’t grow until misprediction: susceptible

Solution? Exponential Weight Growth

-1

0 1 01 1 1 1

5 1 -1 1 -1 -1 -1

16

Studying Noise

Perceptron modeled independently of applicationp random patterns chosen for each level of correlation:

At n bits correlated, a random correlation direction (direct/inverse) chosen for each of n bits

Target randomly chosen for each pattern; Correlation direction determines first n bits of each pattern

Remaining bits chosen randomly for each pattern

Perceptron is trained on each pattern setAverage of training time for 1000 random pattern sets plotted

pattern set generation for n=4, p=2:

ddid

1101xxxx – 10010xxxx – 0

11010101 – 100101110 – 0

17

How does noise affect training time?

18

How does imbalance affect training time?

19

How does imbalance affect learning?

20

Why can’t training-by-correlation handle imbalance?

21

Findings

Increasing history size is bad if the percentage of correlated inputs decrease

Must use training-by-error if there is poor correlation and imbalance

22

Multibit Perceptron

Predicts values, not single bits

What is a value correlation? Input value infers a particular output value 5 --> 4

Approaches: Disjoint Fully Coupled Weight per value

23

Disjoint Perceptron

Tradeoff: + small size- can only learn from respective bits

1 52 4

0 34 1

0 0 1

0 0 0

1 0 1

0 1 1

1 0 0

0 0 1

0 1 0

1 0 0

inverse

directinverse

24

Fully Coupled Perceptron

Tradeoff:+ can learn from any past bit- more weights

direct

direct

direct

1 12 4

0 30 0

1 0 0

0 0 0

0 0 1

0 1 0

0 0 1

0 0 0

0 1 0

0 0 0

25

Learning abilities compared

26

Weight-per-Value Perceptron

Tradeoff:+ Can always

learn- Tons of weights

v1 v2 v3v1 v2 v3

27

History Interference

28

How common is interference?

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

History size

Avera

ge #

bra

nch

es i

nte

rferi

ng

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

1 5 9 13

17

21

25

29

33

37

41

45

49

Quantity of interfering branches

Perc

en

tag

e o

f ti

me t

he m

ost

co

mm

on

bra

nch

ap

pears

at

the

inp

ut

29

How does interference affect perceptrons?

constructivedestructiveneutralweight-destructivevalue-destructive

30

Interference in Perceptron Branch Prediction

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

gzip

gcc

perlb

mk

bzip

2tw

olf

vorte

xvp

rm

cf

Constructive

Neutral

Weight destructive

Value destructive

Completely Destructive

31

Coping: Assigned Seats

Tradeoff: + no additional

size - can’t

consider multiple iterations of an instruction

1 0 10A B C D E

Iteration 11 0 11 0

A B F C D

Iteration 2Branch History

1

1 0 11 0 0

Branch History

Prediction

32

Weight for each interfering branch (“Piecewise Linear”)

Tradeoff:+ interference is

completely removed

- massive size

Iteration 1

A B F C D

Branch History

Prediction

0 1 1

1 0 1 1 0

1 1 01 0

A B C D EBranch Address

Iteration 2

0 1 11 0Branch History

Branch Address

33

Simulatornew superscalar

cycle-accurate execution-driven simulator

can accurately model value prediction & criticality

34

Value Prediction

What is it? predicting instructions’ data values to

overcome data dependencies

Why consider it? requires a multiple-bit prediction, not

a single-bit

35

Table-based Predictor

Limitations: exponential growth in past values & value history can only consider local history

Storage: 70kB for 4 values, 34MB for 8 values, 74*1018 B for 16 values

Data Values

Value History Pattern

2v

Pattern History Table

Predicted Data Value

Instruction Address

36

Perceptron in Pattern Table (PPT)

Tradeoff: + Few perceptrons needed (for 4 past values)

Can consider longer histories- Exponential growth with # of past values

Data Values

Value History Pattern

2v

Predicted Data Value

Instruction Address

Perceptron 0Perceptron 1

Perceptron n

log v

37

Perceptron in Value Table (PVT)

Tradeoff:+ Linear growth in both value history and #

past values- More perceptrons needed

Data Values

Prediction

Instruction Address


Perceptron n

log v

38

Results: PVT

2.4-5.6% accuracy increase, 0.5-1.2% performance increase102kB-1.3MB storage needed

39

Results: PPT

1.4-2.8% accuracy decrease: not a good approach72kB-115kB storage needed

40

Global-Local Value PredictionUses global correlation to predict locally

available values

41

Global-Local PredictorCorrect

Data Value Index

Data Values

Prediction

Instruction Address


Perceptron n

log v

42

Global-Global Prediction

Tradeoff:+ Less value storage- More bits needed per perceptron input

Correct Data Value

Index

Prediction

Instruction Address


Perceptron n

Global Value Cache

43

Global Bitwise

Tradeoff:+ No value storage

Not limited to past values only- Many more bits needed per perceptron

input

Data Value

Prediction

Instruction Address


Perceptron n

44

Global Predictors Compared

Global-Local:3.1% accuracy increase, 1.6% performance increase1.2MB storage needed

Global-Global:7.6% accuracy increase, 6.7% performance increase1.3MB storage needed

Bitwise:12.7% accuracy increase, 5.3% performance increase4.2MB storage needed

45

Can Bitwise Predict New Values?

5.0% of all predictions are correct values never seen beforeFurther 9.8% are correct values not seen in local history

46

Multibit Topologies Compared

Disjoint:3.1% accuracy increase, 1.6% performance increase1.2MB storage needed

Fully Coupled:6.8% accuracy decrease, 1.5% performance decrease3.8MB storage needed

Weight per Value:10.7% accuracy increase, 4.4% performance increase21.5MB storage needed

47

Training Approaches Compared: Global-Local

48

Training Approaches Compared: PVT Local

49

Final Weight Values: Distribution and Accuracy

0.001%

0.010%

0.100%

1.000%

10.000%

100.000%

-1

28

-1

17

-1

06

-9

5

-8

4

-7

3

-6

2

-5

1

-4

0

-2

9

-1

8

-7 4

15

26

37

48

59

70

81

92

10

3

11

4

12

5

Weight Value

% o

f W

eig

hts

at V

alu

e

Error, Disjoint

Correlation, Disjoint

Exponential Growth, Disjoint

Error, Fully Coupled

Correlation, Fully Coupled

50

Anti-Interference Compared

51

Criticality PredictionWhat is it?

Predicting whether each instruction is on the critical path

Why consider it? lack of good training information multiple input factors

52

Counter-based Criticality

Predicts four “criteria” that indicate criticality: QOLD

oldest waiting instruction QOLDDEP

parent of a QOLD instruction

ALOLD oldest instruction in

machine QCONS

instruction with the most dependencies

Instruction Address

QOLD Counters

QOLDDEP Counters

ALOLD Counters

QCONS Counters

Prediction

53

Perceptron-per-Criteria (PEC)

Correct QOLD

Instruction Address

QOLD Perceptron 0

QOLD Perceptron 1

QOLD Perceptron n

Correct QOLDDEP

QOLDDEP Perceptron 0QOLDDEP

Perceptron 1

QOLDDEP Perceptron n

Correct ALOLD

ALOLD Perceptron 0

ALOLD Perceptron 1

ALOLD Perceptron n

Correct QCONS

QCONS Perceptron 0

QCONS Perceptron 1

QCONS Perceptron n

PredictionTradeoff:+ One input per history entry- Can’t learn relationships between criteria

54

Single Perceptron (SP)

Instruction Address

Perceptron 0

Perceptron 1

Perceptron n

CorrectQOLD

CorrectQOLDDEP

CorrectALOLD

CorrectQOLDDEP

Tradeoff:+ One input per history entry & one

perceptron- Can’t learn effects of individual criteria

55

Single Perceptron with Input for Each Criterion (SPC)

Correct QOLD

Instruction Address

Perceptron 0

Correct QOLDDEP

Correct ALOLD

Correct QCONS

Perceptron 1

Perceptron n

Tradeoff:+ Can learn relative relationships of each

criterion- Four inputs per perceptron

56

Accuracy ComparedPEC:2.9% accuracy increase, 4.2MB storage needed

SP:4.1% accuracy increase, 1.0MB storage needed

SPC:6.6% accuracy increase, 4.2MB storage needed

57

Performance with Value Prediction

58

Training Approaches Compared

59

Final SPC Weight Distribution

0.0001%

0.0010%

0.0100%

0.1000%

1.0000%

10.0000%

100.0000%

-128 -112 -96 -80 -64 -48 -32 -16 0 16 32 48 64 80 96 112 128

Training by Errror

Training byCorrelation

60

Conclusions

Perceptron Local Value Predictor 5.6% accuracy increase with 1.3MB storage

Perceptron Global-based Local Value Predictor 3.1% accuracy increase with 1.2MB storage;

10.7% increase for 21.5MB storage

Two Perceptron Global Value Predictors 7.6% accuracy increase with 1.3MB storage;

12.7% increase for 4.2MB storage

Perceptron Global Criticality Predictor 6.6% accuracy increase with 4.2MB storage

61

Conclusions (continued)Perceptron training approaches Training-by-error must be used for poorly correlated

applications

Multiple-bit topologies Disjoint - best approach if hardware is a concern Fully coupled - performs poorly with low correlation Weight-per-value - performs very well but requires high

hardware costs

Interference reduction Assigned Seats - modest improvement but no

additional hardware Piecewise - substantially more hardware, significant

improvement

62

Questions

Documents

1 Applying Perceptrons to Speculation in Computer Architecture Michael Black Dissertation Defense April 2, 2007