Pipeline Depth Tradeoffs and the Intel Pentium 4 Processor Data Cache Bus Interface Unit Quad Pumped 3.2 GB/s L2 Cache 64GB/s System Bus 64-bit wide Instruction TLB Dynamic BTB 4K

Copyright © 2001 Intel Corporation. ®

Pipeline Depth Tradeoffsand the Intel® Pentium® 4 Processor

Doug CarmeanDoug Carmean

Principal ArchitectPrincipal Architect

Intel Architecture GroupIntel Architecture Group

August 21, 2001August 21, 2001


AgendaAgenda

ReviewReview

Pipeline DepthPipeline Depth

Execution Trace CacheExecution Trace Cache

L1 Data CacheL1 Data Cache

SummarySummary


Information in this document is provided in connection with Intel® products. No license,

express or implied, by estoppel or otherwise, to any intellectual property rights is

granted by this document. Except as provided in Intel’s Terms and Conditions of

Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims

any express or implied warranty, relating to sale and/or use of Intel® products

including liability or warranties relating to fitness for a particular purpose,

merchantability, or infringement of any patent, copyright or other intellectual

property right. Intel products are not intended for use in medical, life saving, or life

sustaining applications.

Intel may make changes to specifications and product descriptions at any time,

without notice.

This document contains information on products in the design phase of development.

Do not finalize a design with this information. Revised information will be published

when the product is available. Verify with your local sales office that you have the

latest datasheet before finalizing a design.

Intel processors may contain design defects or errors known as errata, which may

cause the product to deviate from published specifications. Current characterized

errata are available on request.

Intel, Pentium, and the Intel logo are trademarks or registered trademarks of Intel

Corporation or its subsidiaries in the United States and foreign countries.

Copyright © (2001) Intel Corporation.


IntelIntel® NetburstNetburstTMTM MicroMicro--architecture vs P6architecture vs P6

11 22 33 44 55 66 77 88 99 1010

FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd RdyRdy/Sch/Sch DispatchDispatch ExecExec

Basic P6 PipelineBasic P6 Pipeline Intro at Intro at

733MHz733MHz

.18µ.18µ

1515 1616 1717 1818 1919 2020

RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF

Intro at Intro at

1.5GHz1.5GHz

.18µ.18µ

Basic PentiumBasic Pentium®® 4 Processor Pipeline4 Processor Pipeline

11 22 33 44 55 66 77 88 99 1010 1111 1212

TC TC Nxt Nxt IPIP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch

1313 1414

DispDisp DispDisp

Hyper pipelined Technology enables industry Hyper pipelined Technology enables industry

leading performance and clock rateleading performance and clock rate


Hyper Pipelined TechnologyHyper Pipelined TechnologyF

req

uen

cy

Fre

qu

en

cy

TimeTimeIntroductionIntroduction

233MHz233MHz

60MHz60MHz P5 MicroP5 Micro--ArchitectureArchitecture

55

1.2GHz1.2GHz

166MHz166MHz

P6 MicroP6 Micro--ArchitectureArchitecture

1010

2020

IntelIntel® Netburst MicroNetburst Micro--ArchitectureArchitecture

IntroIntro 1.5GHz1.5GHz

TodayToday 1.8GHz1.8GHz

2.0GHz2.0GHz


Deeper Pipelines are BetterDeeper Pipelines are Better

0%

20%

40%

60%

80%

100%

120%

10 15 20 25 30

Pipeline Depth

Perf

orm

ance Im

pro

vem

ent

2 MB

1 MB

512 KB

256 KB

Source: Average of 2000 application segments from performance siSource: Average of 2000 application segments from performance simulationsmulations


Why not deeper pipelines?Why not deeper pipelines?

Increases complexityIncreases complexity

––Harder to balanceHarder to balance

––More challenges to architect aroundMore challenges to architect around

––More algorithmsMore algorithms

––Greater validation effortGreater validation effort

––Need to pipeline the wiresNeed to pipeline the wires

Overall Engineering Effort Increases Quickly Overall Engineering Effort Increases Quickly

as Pipeline depth increasesas Pipeline depth increases


PerformancePerformance

High bandwidth front endHigh bandwidth front end

Low latency coreLow latency core

Lower memory latencyLower memory latencyHigh Bandwidth Front EndHigh Bandwidth Front EndHigh Bandwidth Front End


Higher Frequency increases Higher Frequency increases requirements of front endrequirements of front end

Branch prediction is more importantBranch prediction is more important

––So we improved itSo we improved it

Need greater uop bandwidthNeed greater uop bandwidth

––Branches constantly change the flowBranches constantly change the flow

––Need to decode more instructions in Need to decode more instructions in

parallelparallel


Block DiagramBlock Diagram

BusBus

InterfaceInterface

UnitUnit

Quad Quad

PumpedPumped

3.2 GB/s3.2 GB/s

L2L2

CacheCache

64GB/s64GB/s

SystemSystem

BusBus

6464--bit widebit wide Instruction TLBInstruction TLBDynamic BTBDynamic BTB

4K Entries4K Entries

Instruction DecoderInstruction Decoder

Execution Trace Execution Trace

Cache 12K Cache 12K µµopsopsTrace Cache BTBTrace Cache BTB

512 Entries512 EntriesMicroMicro

InstructionInstruction

SequencerSequencerAllocatorAllocator / Register / Register RenamerRenamer

Memory Memory µµop op QueueQueue Integer/Floating Point Integer/Floating Point µµop op QueueQueue

MemoryMemory SlowSlow Fast Fast IntInt Fast Fast IntInt FP MoveFP Move FP GenFP Gen

Ld/St Ld/St

Address Address

unitunit

Integer Register File / Bypass NetworkInteger Register File / Bypass Network FP Register / BypassFP Register / Bypass

2xAGU2xAGU

Complex Complex

InstrInstr..

Slow ALUSlow ALU

Simple Simple

InstrInstr..

2xALU2xALU

Simple Simple

InstrInstr..

2xALU2xALU

L1 Data Cache 8Kbyte 4L1 Data Cache 8Kbyte 4--wayway

FP FP

MoveMove

FmulFmul

FAddFAddFmulFmul

FAddFAddFmulFmul

FAddFAdd

256256--bit widebit wide

Integer SchedulersInteger Schedulers Floating Point SchedulersFloating Point Schedulers

Execution Trace

Cache 12K uops



1 1 cmpcmp2 br 2 br --> T1 > T1

....

... (unused code)... (unused code)

T1:T1: 3 sub3 sub4 br 4 br --> T2> T2....... (unused code)... (unused code)

T2:T2: 5 5 movmov6 sub6 sub7 br 7 br --> T3> T3

....

... (unused code)... (unused code)

T3:T3: 8 add8 add9 sub9 sub

10 10 mulmul11 11 cmpcmp12 br 12 br --> T4> T4

Trace Cache DeliveryTrace Cache Delivery

10 mul 11 cmp 12 br T4

7 br T3 8 T3:add 9 sub

4 br T2 5 mov 6 sub

1 cmp 2 br T1 3 T1: sub



1 cmp 2 br T1

3 T1: sub 4 br T2

5 mov 6 sub 7 br T3

8 T3:add 9 sub 10 mul

11 cmp 12 br T4

P6 MicroarchitectureP6 Microarchitecture Trace Cache DeliveryTrace Cache Delivery

10 mul 11 cmp 12 br T4

7 br T3 8 T3:add 9 sub

4 br T2 5 mov 6 sub

1 cmp 2 br T1 3 T1: sub

BW = 1.5 uops/nsBW = 1.5 uops/ns BW = 6 uops/nsBW = 6 uops/ns


Inside the Execution Trace CacheInside the Execution Trace Cache

0x0900

Instruction

Pointerhead

body 1

body 2

tail

Way 0 Way 1 Way 2 Way 3

Set 0

Set 1

Set 2

Set 3

Set 4

head cmp, br T1, T1:sub, br T2, mov, sub

body 1 br T3, T3:add, sub, mul, cmp, br T4

body 2 T4:add, sub, mov, add, add, mov

tail add, sub, mov, add, add, mov


Self Modifying CodeSelf Modifying Code

Programs that modify the instruction Programs that modify the instruction

stream that is being executedstream that is being executed

Very common in Java* code from JITsVery common in Java* code from JITs

Requires hardware mechanisms to Requires hardware mechanisms to

maintain consistencymaintain consistency

*Other names and brands may be claimed as the property of others.


Self Modifying CodeSelf Modifying Code

The hardware needs to handle two The hardware needs to handle two

basic cases:basic cases:

––Stores that write to instructions in the Stores that write to instructions in the

Trace CacheTrace Cache

–– Instruction fetches that hit pending storesInstruction fetches that hit pending stores

–– SpeculativeSpeculative

–– CommittedCommitted


Case 1: Stores to cached instructionsCase 1: Stores to cached instructions

Execution Core

Data

TLB Store’s Physical Address

Trace Cache

addr

addr

Instruction

TLB

(128 entries)

addr

“in use” bits


Case 2: Fetches to pending storesCase 2: Fetches to pending stores

Instruction

TLB

(128 entries)

addr

addr

addr

Instruction

Pointer

Store

Buffer

Write

Combining

Buffer

Execution Core

Please

Re-Fetch

Please

Flush Pipeline

Please

Re-FetchSpeculative

Committed



Provides higher bandwidth for higher Provides higher bandwidth for higher

frequency corefrequency core

Reduces fetch latencyReduces fetch latency

Requires new fundamentally new Requires new fundamentally new

algorithmsalgorithms





Lower memory latencyLower memory latency

Low Latency CoreLow Latency CoreLow Latency Core



BusBus

InterfaceInterface

UnitUnit

Quad Quad

PumpedPumped

3.2 GB/s3.2 GB/s

L2L2

CacheCache

64GB/s64GB/s

SystemSystem

BusBus











Ld/St Ld/St

Address Address

unitunit


2xAGU2xAGU

Complex Complex

InstrInstr..

Slow ALUSlow ALU

Simple Simple

InstrInstr..

2xALU2xALU

Simple Simple

InstrInstr..

2xALU2xALU


FP FP

MoveMove

FmulFmul

FAddFAddFmulFmul

FAddFAddFmulFmul

FAddFAdd



L1 Data Cache 8Kbyte 4-way


L1 Cache is 3x FasterL1 Cache is 3x Faster

PentiumPentium® 4 Processor :4 Processor :

––2 clocks @ 2GHz2 clocks @ 2GHz

3ns3ns

1ns1ns

P6:P6:––3 clocks @ 1GHz3 clocks @ 1GHz

Lower Latency is Higher PerformanceLower Latency is Higher PerformanceLower Latency is Higher Performance



Pipeline Stages

2x Clock

VA

15:0

VA

31:16Fast

SB

replay replay replay

VA

31:0TAG

Slow

SB

TLB

data

Data

tag



2x Clock2x Clock

tagtag

TAGTAG

Slow

SB

Slow

SB

VA

15:0

VA

15:0

VA

31:16

VA

31:16

datadata

Fast

SB

Fast

SB

VA

31:0

VA

31:0

DataDataData

TLBTLB

Mini Tag

10:6 15:11

Way Predictor (Tag array)

Way

select…

Data array

19:16

Hit

(Replay)…






Lower Memory LatencyLower Memory LatencyLower Memory Latency


Reducing LatencyReducing Latency

As frequency increases, it is important As frequency increases, it is important

to improve the performance of the to improve the performance of the

memory subsystemmemory subsystem

Data Prefetch LogicData Prefetch Logic

––Watches processor memory trafficWatches processor memory traffic

––Looks for patternsLooks for patterns

–– Initiates accessesInitiates accesses



BusBus

InterfaceInterface

UnitUnit

Quad Quad

PumpedPumped

3.2 GB/s3.2 GB/s

L2L2

CacheCache

64GB/s64GB/s

SystemSystem

BusBus











Ld/St Ld/St

Address Address

unitunit


2xAGU2xAGU

Complex Complex

InstrInstr..

Slow ALUSlow ALU

Simple Simple

InstrInstr..

2xALU2xALU

Simple Simple

InstrInstr..

2xALU2xALU


FP FP

MoveMove

FmulFmul

FAddFAddFmulFmul

FAddFAddFmulFmul

FAddFAdd



Data

Prefetch

Logic



L1

Data Cache

Instruction

Fetch

Instruction

Buffers64

256

L2

Advanced

Transfer

Cache

Data

Prefetch

Logic

Bus

Queue

Prefetch logic first checks L2 cache and then fetches lines from memory

that miss L2 cache.



Watches for streaming memory access Watches for streaming memory access

patternspatterns

–– Can track 8 multiple independent streamsCan track 8 multiple independent streams

–– Loads, Stores or InstructionLoads, Stores or Instruction

–– Forward or BackwardForward or Backward

Analysis on 32 byte cache line granularityAnalysis on 32 byte cache line granularity

Looks for “mostly” complete streams:Looks for “mostly” complete streams:

–– Access to cache lines 1,2,3,4,5,6 will prefetchAccess to cache lines 1,2,3,4,5,6 will prefetch

–– Access to cache lines 1,2, 4,5,6 will prefetchAccess to cache lines 1,2, 4,5,6 will prefetch

–– 1, ,3, , ,6, , ,9 will not prefetch1, ,3, , ,6, , ,9 will not prefetch







SummarySummary

The PentiumThe Pentium®® 4 Processor’s deep pipelines 4 Processor’s deep pipelines

provide high performance by enabling high provide high performance by enabling high

frequency frequency

Deep pipelines are more difficult to engineerDeep pipelines are more difficult to engineer

Larger caches further improve benefits of Larger caches further improve benefits of

PentiumPentium®® 4 Processor’s deep pipeline4 Processor’s deep pipeline

Compilers further increase benefits of deep Compilers further increase benefits of deep

pipelines by removing hazardspipelines by removing hazards


Documents

Pipeline Depth Tradeoffs and the Intel Pentium 4 Processor Data Cache Bus Interface Unit Quad Pumped 3.2 GB/s L2 Cache 64GB/s System Bus 64-bit wide Instruction TLB Dynamic BTB 4K