Copyright © 2001 Intel Corporation.
Hyper-Threadin g Technolo gyHyper-Threadin g Technolo gyin the in the NetburstNetburst ™™MicroarchitectureMicroarchitecture
Debbie MarrDebbie MarrHyper-Threadin g Technolo gy ArchitectHyper-Threadin g Technolo gy Architect
Intel CorporationIntel Corporation
August 19, 2002August 19, 2002
Copyright © 2002 Intel Corporation. Page 2
AgendaAgenda
zz Hyper-Threadin g Technolo gy in theHyper-Threadin g Technolo gy in theNetburstNetburst ™ ™ MicroarchitectureMicroarchitecture
zz Microarchitecture Microarchitecture Choices & TradeoffsChoices & Tradeoffs
zz Performance ResultsPerformance Results
Copyright © 2002 Intel Corporation. Page 3
AgendaAgenda
zz Hyper-Threadin g Technolo gy in theHyper-Threadin g Technolo gy in theNetburstNetburst ™ ™ MicroarchitectureMicroarchitecture
zz Microarchitecture Microarchitecture Choices & TradeoffsChoices & Tradeoffs
zz Performance ResultsPerformance Results
Copyright © 2002 Intel Corporation. Page 4
Hyper-Threadin g Technolo gyHyper-Threadin g Technolo gy
zz Simultaneous Multi-threadin gSimultaneous Multi-threadin g–– 2 logical processors (LP) simultaneously share one2 logical processors (LP) simultaneously share one
physical processor’s execution resourcesphysical processor’s execution resources
zz Appears to software as 2 processors (2-wayAppears to software as 2 processors (2-wayshared memory multiprocessor)shared memory multiprocessor)–– Shrink-wrapped operatin g system schedules softwareShrink-wrapped operatin g system schedules software
threads/processes to both lo gical processorsthreads/processes to both lo gical processors
–– Fully compatible to existin g multi-processor systemFully compatible to existin g multi-processor systemsoftware and hardware.software and hardware.
zz Integral part of Intel Integral part of Intel NetburstNetburst ™ ™ MicroarchitectureMicroarchitecture
Copyright © 2002 Intel Corporation. Page 5
Intel® Processors withIntel® Processors withNetburstNetburst ™ ™ MicroarchitectureMicroarchitecture
Intel XeonIntel XeonProcessorProcessor
256KB 2nd-Level 256KB 2nd-Level CacheCache
.18u process.18u process
Intel® Xeon™ MP Intel® Xeon™ MP ProcessorProcessor
256KB 2256KB 2 ndnd -Level Cache-Level Cache1MB 3rd-Level Cache1MB 3rd-Level Cache
.18u process.18u process
Intel Xeon Intel Xeon ProcessorProcessor
512KB 2512KB 2 ndnd -Level -Level CacheCache
.13u process.13u process
Copyright © 2002 Intel Corporation. Page 6
Die Size Increase is SmallDie Size Increase is Smallzz Total die area added is smallTotal die area added is small
––A few small structures duplicatedA few small structures duplicated
––Some additional control lo gic andSome additional control lo gic andpointerspointers
Copyright © 2002 Intel Corporation. Page 7
What was addedWhat was added
Instruction TLB
Next IPInstruction Streamin g Buffers
Trace Cache Fill Buffers
Register Alias Tables
Trace Cache Next IP
Copyright © 2002 Intel Corporation. Page 8
Complexit y is Lar geComplexit y is Lar gezz Challen ged man y basic assum ptionsChallen ged man y basic assum ptions
zz New New microarchitecture microarchitecture algorithmsalgorithms––To address new To address new uopuop (micro-operation) (micro-operation)
prioritization issuesprioritization issues
––To solve potential newTo solve potential new livelock livelock scenarios scenarios
zz High lo gic desi gn com plexit yHigh lo gic desi gn com plexit y
zz Validation EffortValidation Effort––Explosion of validation spaceExplosion of validation space
Copyright © 2002 Intel Corporation. Page 9
AgendaAgenda
zz Hyper-Threadin g Technolo gy in theHyper-Threadin g Technolo gy in theNetburstNetburst ™ ™ MicroarchitectureMicroarchitecture
zz Microarchitecture Microarchitecture Choices & TradeoffsChoices & Tradeoffs
zz Performance ResultsPerformance Results
Copyright © 2002 Intel Corporation. Page 10
Managing ResourcesManaging Resourceszz ChoicesChoices
–– PartitionPartition–– Half of resource dedicated to each lo gical processorHalf of resource dedicated to each lo gical processor
–– ThresholdThreshold–– Flexible resource sharin g with limit on maximumFlexible resource sharin g with limit on maximum
resource usa geresource usa ge
–– Full Sharin gFull Sharin g–– Flexible resource sharin g with Flexible resource sharin g with nono limit on maximum limit on maximum
resource usa geresource usa ge
–– Others (not discussed in this talk)Others (not discussed in this talk)
zz ConsiderationsConsiderations–– Throu ghput and fairnessThrou ghput and fairness–– Die size and ComplexityDie size and Complexity
Copyright © 2002 Intel Corporation. Page 11
Partitionin gPartitionin gzz Half of resource dedicated to each lo gicalHalf of resource dedicated to each lo gical
processorprocessor–– Simple, low complexitySimple, low complexity
zz Good for structures whereGood for structures where–– Occupancy time can be hi gh and unpredictableOccupancy time can be hi gh and unpredictable
–– High avera ge utilizationHigh avera ge utilization
zz Major pipeline queues are a good exampleMajor pipeline queues are a good example–– Provide bufferin g to avoid pipeline stallsProvide bufferin g to avoid pipeline stalls
–– Allow slip between lo gical processorsAllow slip between lo gical processors
Copyright © 2002 Intel Corporation. Page 12
Registers
RegisterRename
RegistersROB
StoreBuffer
L1 D-Cache
Allocate
RenameUop
QueueRegister
ReadExecute D-Cache
RegisterWrite
RetireQueue
SchedFetchQueue
I-Fetch
TraceCache
IP
Execution Pi pelineExecution Pi peline
Copyright © 2002 Intel Corporation. Page 13
Registers
RegisterRename
RegistersROB
StoreBuffer
L1 D-Cache
Allocate
RenameUop
QueueRegister
ReadExecute D-Cache
RegisterWrite
RetireQueue
SchedFetchQueue
I-Fetch
TraceCache
IP
Execution Pi pelineExecution Pi peline
In-Order Pipeline Out-of-Order Pipeline In-Order
Copyright © 2002 Intel Corporation. Page 14
Registers
RegisterRename
RegistersROB
StoreBuffer
L1 D-Cache
Allocate
RenameUop
QueueRegister
ReadExecute D-Cache
RegisterWrite
RetireQueue
SchedFetchQueue
I-Fetch
TraceCache
IP
Execution Pi pelineExecution Pi peline
Partition queues between majorpipesta ges of pipeline
Copyright © 2002 Intel Corporation. Page 15
Partitioned Queue Exam plePartitioned Queue Exam plezz With full sharin g, a slow thread can getWith full sharin g, a slow thread can get
unfair share of resourcesunfair share of resourcesÆÆCan prevent a faster thread from makin gCan prevent a faster thread from makin g
rapid pro gress.rapid pro gress.
Copyright © 2002 Intel Corporation. Page 16
Partitioned Queue Exam plePartitioned Queue Exam plezz Green thread stalledGreen thread stalled
zz Yellow thread not stalledYellow thread not stalled
1010
Shared Queue Partitioned Queue
1010Cycle 0
Copyright © 2002 Intel Corporation. Page 17
1 10
Shared Queue Partitioned Queue(Max entries/LP = 2)
1 10Cycle 1
00
Partitioned Queue Exam plePartitioned Queue Exam plezz Green thread stalledGreen thread stalled
zz Yellow thread not stalledYellow thread not stalled
Copyright © 2002 Intel Corporation. Page 18
1 10
Shared Queue Partitioned Queue(Max entries/LP = 2)
1 10Cycle 1
2 2
Partitioned Queue Exam plePartitioned Queue Exam plezz Green thread stalledGreen thread stalled
zz Yellow thread not stalledYellow thread not stalled
Copyright © 2002 Intel Corporation. Page 19
1210
Shared Queue Partitioned Queue(Max entries/LP = 2)
1210Cycle 1
Partitioned Queue Exam plePartitioned Queue Exam plezz Green thread stalledGreen thread stalled
zz Yellow thread not stalledYellow thread not stalled
Copyright © 2002 Intel Corporation. Page 20
210
Shared Queue Partitioned Queue(Max entries/LP = 2)
210Cycle 2
11
Partitioned Queue Exam plePartitioned Queue Exam plezz Green thread stalledGreen thread stalled
zz Yellow thread not stalledYellow thread not stalled
Copyright © 2002 Intel Corporation. Page 21
210
Shared Queue Partitioned Queue(Max entries/LP = 2)
210Cycle 2
2 3
Partitioned Queue Exam plePartitioned Queue Exam plezz Green thread stalledGreen thread stalled
zz Yellow thread not stalledYellow thread not stalled
Copyright © 2002 Intel Corporation. Page 22
2210
Shared Queue Partitioned Queue(Max entries/LP = 2)
3210Cycle 2
Partitioned Queue Exam plePartitioned Queue Exam plezz Green thread stalledGreen thread stalled
zz Yellow thread not stalledYellow thread not stalled
Copyright © 2002 Intel Corporation. Page 23
210
Shared Queue Partitioned Queue(Max entries/LP = 2)
3 10Cycle 3
22
Partitioned Queue Exam plePartitioned Queue Exam plezz Green thread stalledGreen thread stalled
zz Yellow thread not stalledYellow thread not stalled
Copyright © 2002 Intel Corporation. Page 24
210
Shared Queue Partitioned Queue(Max entries/LP = 2)
3 10Cycle 3
3 4
Partitioned Queue Exam plePartitioned Queue Exam plezz Green thread stalledGreen thread stalled
zz Yellow thread not stalledYellow thread not stalled
Copyright © 2002 Intel Corporation. Page 25
3210
Shared Queue Partitioned Queue(Max entries/LP = 2)
3410Cycle 3
Partitioned Queue Exam plePartitioned Queue Exam plezz Green thread stalledGreen thread stalled
zz Yellow thread not stalledYellow thread not stalled
Copyright © 2002 Intel Corporation. Page 26
3210
Shared Queue Partitioned Queue(Max entries/LP = 2)
410Cycle 4
3Yellow isYellow isBlocked!Blocked!
Partitioned Queue Exam plePartitioned Queue Exam plezz Green thread stalledGreen thread stalled
zz Yellow thread not stalledYellow thread not stalled
Partitionin g resource ensures fairness andPartitionin g resource ensures fairness andensures pro gress for both lo gical processorsensures pro gress for both lo gical processors
Copyright © 2002 Intel Corporation. Page 27
ThresholdsThresholdszz Flexible resource sharin g with limit onFlexible resource sharin g with limit on
maximum resource usa gemaximum resource usa gezz Good for small structures whereGood for small structures where
–– Occupancy time is low and predictableOccupancy time is low and predictable–– Low avera ge utilization with occasional hi gh peaksLow avera ge utilization with occasional hi gh peaks
zz Schedulers are a good exampleSchedulers are a good example–– Throu ghput is hi gh because of data speculationThrou ghput is hi gh because of data speculation
(get data re gardless of cache hit)(get data re gardless of cache hit)–– uOpsuOps pass throu gh scheduler very quickly pass throu gh scheduler very quickly–– Schedulers are small for speedSchedulers are small for speed
Copyright © 2002 Intel Corporation. Page 28
MemoryuOp
Scheduler
ALU0Scheduler
Schedulers, QueuesSchedulers, Queueszz 5 Schedulers5 Schedulers
–– MEMMEM
–– ALU0ALU0
–– ALU1ALU1
–– FP MoveFP Move
–– FP/MMX/SSEFP/MMX/SSE
zz Threshold prevents oneThreshold prevents onelogical processor fromlogical processor fromconsumin g all entriesconsumin g all entries
–– Round robin until reachRound robin until reachthresholdthreshold
Mem Queue LP0
Mem Queue LP1
+ Counters -
Similarly.. ALU1Scheduler
FP MoveScheduler
FP/MMX/SSEScheduler
Copyright © 2002 Intel Corporation. Page 29
Measurement of image composition workload on an Intel® Xeon™ Processor
Scheduler Occupancy HistogramImage Composition Workload
0%
5%
10%
15%
20%
25%
30%
35%
40%
0% 20% 40% 60% 80% 100%
% of Entries Occupied
% o
f Tim
e
alu0alu1memoryfp movefp/mmx/sse
Copyright © 2002 Intel Corporation. Page 30
Measurement of transaction processing workload on a 4P Intel® Xeon™ MP Processor System
Scheduler Occupancy Histo gramTransaction Processin g Workload
0%10%
20%30%
40%50%
60%70%
80%90%
100%
0% 20% 40% 60% 80% 100%
% of Entries Occupied
% o
f Tim
e
alu0alu1memoryfp movefp/mmx/sse
Copyright © 2002 Intel Corporation. Page 31
Measurement of transaction processing workload on a 4P Intel® Xeon™ MP Processor System
Variable partitionin g allows a lo gical processor to use mostVariable partitionin g allows a lo gical processor to use mostresources when the other doesn’t need themresources when the other doesn’t need them
Memory Scheduler Occupancy Over Time
0
1
2
3
4
5
6
7
8
<--------- Time -------->
Occ
upan
cy
Logical Processor 0Logical Processor 1
Copyright © 2002 Intel Corporation. Page 32
Full Sharin gFull Sharin gzz Flexible resource sharin g with no limit onFlexible resource sharin g with no limit on
maximum resource usa gemaximum resource usa gezz Good for lar ge structures whereGood for lar ge structures where
–– Workin g set sizes are variableWorkin g set sizes are variable–– Sharin g between lo gical processors possibleSharin g between lo gical processors possible–– Not possible for one lo gical processor to starveNot possible for one lo gical processor to starve
zz Caches are a good exampleCaches are a good example–– All caches are sharedAll caches are shared
–– Better overall performance vs. partitioned cachesBetter overall performance vs. partitioned caches–– Some applications share code and/or dataSome applications share code and/or data
–– High set High set associativityassociativity minimizes conflict misses. minimizes conflict misses.–– Level 2 and Level 3 caches are 8-wa y set associativeLevel 2 and Level 3 caches are 8-wa y set associative
Copyright © 2002 Intel Corporation. Page 33
Shared Cache vs. Partitioned Cache
0.80
1.00
1.20
1.40
1.60
1.80
2.00
164.
gzip
175.
vpr
176.
gcc
181.
mcf
186.
craf
ty
197.
pars
er
252.
eon
253.
perlb
mk
254.
gap
255.
vort
ex
256.
bzip
2
300.
twol
f
168.
wup
wis
e
171.
swim
172.
mgr
id
173.
appl
u
177.
mes
a
178.
galg
el
179.
art
183.
equa
ke
187.
face
rec
188.
amm
p
189.
luca
s
191.
fma3
d
200.
sixt
rack
301.
apsi
Ave
rage
Sha
red
Cac
he im
prov
emen
t ove
r S
plit
Cac
he
Cache Hit RatePerformance
On avera ge, a shared cache has 40% better hit rateOn avera ge, a shared cache has 40% better hit rateand 12% better performance for these applicationsand 12% better performance for these applications
Results for 2 copies of application run simultaneously. Measured on Intel® Xeon™ Processor C0 step.Results for 2 copies of application run simultaneously. Measured on Intel® Xeon™ Processor C0 step.Cache miss statistics using EMON event: MN Cache miss statistics using EMON event: MN MEM:2nd Level Cache Load Misses RetiredMEM:2nd Level Cache Load Misses Retired
3.12.2
Copyright © 2002 Intel Corporation. Page 34
AgendaAgenda
zz Hyper-Threadin g Technolo gy in theHyper-Threadin g Technolo gy in theNetburstNetburst ™ ™ MicroarchitectureMicroarchitecture
zz Microarchitecture Microarchitecture Choices & TradeoffsChoices & Tradeoffs
zz Performance ResultsPerformance Results
Copyright © 2002 Intel Corporation. Page 35
E-Commerce Workload
0.00
0.50
1.00
1.50
2.00
2.50
2 4
Number of Processors
Per
form
ance
HT OffHT On
Transaction Processing Workload
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 2 4
Number of Processors
Per
form
ance
HT OffHT On
Server PerformanceServer Performance
20%
20%
10%
17%
14%
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. AnyPerformance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Anydifference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems ordifference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems orcomponents they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.www.intelintel.com/.com/procsprocs//perfperf/limits./limits.htmhtm or call (U.S.) or call (U.S.)1-800-628-8686 or 1-916-356-31041-800-628-8686 or 1-916-356-3104
Good performance benefit fromGood performance benefit fromsmall die area investmentsmall die area investment
Copyright © 2002 Intel Corporation. Page 36
Multi-taskin gMulti-taskin g
Larger gains can be realized by runnin g dissimilarLarger gains can be realized by runnin g dissimilarapplications due to different resource requirementsapplications due to different resource requirements
Intel® Xeon™ Processor platform is a prototype systemIntel® Xeon™ Processor platform is a prototype system
1
1.02
1.04
1.06
1.08
1.1
1.12
1.14
1.16
1.18
1.2
Integer(same)
Integer(different)
FP (same) FP(different)
Integer vs.FP
Hyp
er-T
hrea
ding
Tec
hnol
ogy
Spe
edup
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. AnyPerformance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Anydifference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems ordifference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems orcomponents they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.www.intelintel.com/.com/procsprocs//perfperf/limits./limits.htmhtm or call (U.S.) or call (U.S.)1-800-628-8686 or 1-916-356-31041-800-628-8686 or 1-916-356-3104
Copyright © 2002 Intel Corporation. Page 37
ConclusionsConclusionszz Hyper-Threadin g Technolo gy is an inte gralHyper-Threadin g Technolo gy is an inte gral
part of the part of the NetburstNetburst ™ ™ MicroarchitectureMicroarchitecture–– Very little additional die area neededVery little additional die area needed–– Compellin g performanceCompellin g performance–– Currently enabled for server processorsCurrently enabled for server processors
zz MicroarchitectureMicroarchitecture desi gn choices desi gn choices–– Resource sharin g policy matched to traffic andResource sharin g policy matched to traffic and
performance requirementsperformance requirements
zz New challen ging New challen ging microarchitecture microarchitecture directiondirection–– Continuous improvements in future processors forContinuous improvements in future processors for
years to comeyears to come