Intel Nehalem Processor

8/11/2019 Intel Nehalem Processor

1/65

Ronak Singhal

Nehalem Architect

NGMS001

Inside Intel Next GenerationNehalem Microarchitecture


2/65

2

Nehalem Design Philosophy Enhanced Processor Core

Performance Features

Simultaneous Multi-Threading Feeding the Engine

New Cache Hierarchy

New Platform Architecture

Performance AccelerationVirtualizationNew Instructions

Agenda

Nehalem: Next Generation Intel Microarchitecture


3/65

3

NEHALEM

TO K

SANDY

BRIDGE

TO K

WESTMERE

2009-10

32nm

T K

2005-06

Intel

Core2,Xeon

processors

Intel

Pentium

D,Xeon,

Core

processors

T K TO K

2007-08

PENRYN

45nm

T K

65nm

Tick Tock Development Model

NehalemNehalem --

The IntelThe Intel

45 nm Tock Processor45 nm Tock Processor


4/654

Nehalem Design Goals

Existing AppsExisting Apps

Emerging AppsEmerging Apps

All UsagesAll Usages

Single ThreadSingle Thread

MultiMulti--threadsthreads

Workstation / ServerDesktop / Mobile

World class performance combined with superior energy efficiencyWorld class performance combined with superior energy efficiency

Optimized for:Optimized for:

A single, scalable, foundation optimized across each segment andA single, scalable, foundation optimized across each segment and

power envelopepower envelope

Dynamically scaledperformance when

needed to maximizeenergy efficiency


A Dynamic and Design Scalable Microarchitecture

Dynamic and Design Scalable Microarchitecture


5/655

Scalable Nehalem Core

Common feature setCommon feature setSame core forSame core for

all segmentsall segments

Common softwareCommon software

optimizationoptimization

45nm45nm

Servers/Workstations

Energy Efficiency,Performance,

Virtualization,Reliability, Capacity, Scalability

Nehalem

ehalem

DesktopPerformance,

Graphics, EnergyEfficiency, Idle Power, Security

Mobile

Battery Life, Performance,

Energy Efficiency, Graphics,Security

Optimized Nehalem core to meet all market segments

ptimized Nehalem core to meet all market segments



6/656

Intel Core Microarchitecture Recap

Wide Dynamic Execution

- 4-wide decode/rename/retire

Advanced Digital Media Boost

- 128-bit wide SSE execution units Intel HD Boost

- New SSE4.1 Instructions

Smart Memory Access

- Memory Disambiguation- Hardware Prefetching

Advanced Smart Cache

- Low latency, high BW shared L2 cache

N e h a le m b u i l d s o n t h e g r e a t I n t e l Co r e m i c r o a r ch i t e c t u r e

Nehalem: Next Generation Intel

Microarchitecture


7/657



Simultaneous Multi-Threading Feeding the Engine

New Cache HierarchyNew Platform Architecture


Agenda


Microarchitecture


8/658

Designed for Performance

Execution

Units

Out-of-Order

Scheduling &

Retirement

L2 Cache

& Interrupt

Servicing

Instruction Fetch

& L1 Cache

Branch PredictionInstruction

Decode &

Microcode

Paging

L1 Data Cache

Memory Ordering

& Execution

Additional Caching

Hierarchy

New SSE4.2

Instructions

DeeperBuffers

Faster

Virtualization

Simultaneous

Multi-ThreadingBetter Branch

Prediction

Improved Lock

Support

ImprovedLoop

Streaming


9/659

Enhanced Processor Core

Instruction Fetch andPre Decode

Instruction Queue

Decode

ITLB

Rename/Allocate

Retirement Unit(ReOrder Buffer)

Reservation Station

Execution Units

DTLB

2nd Level TLB4

4

6

32kBInstruction Cache

32kB

Data Cache

256kB

2nd Level Cache

L3 and beyond

Fr o n t En d E x e c u t i o n

E n g i n e

M e m o r y


10/6510

Front-end

Responsible for feeding the compute engine- Decode instructions- Branch Prediction

Key Intel Core2 processor features- 4-wide decode- Macrofusion- Loop Stream Detector

Instruction Fetch and

Pre Decode

Instruction Queue

Decode

ITLB32kB

Instruction Cache


11/6511

Macrofusion Recap

Introduced in the Intel Core2 Processor Family

TEST/CMP instruction followed by a conditionalbranch treated as a single instruction

- Decode as one instruction- Execute as one instruction- Retire as one instruction

Higher p e r f o r m a n ce - Improves throughput- Reduces execution latency

Improved p o w e r e f f i c i e n c y - Less processing required to accomplish the same work


12/65

12

Nehalem Macrofusion

Goal: Identify more macrofusion opportunities for increasedp e r f o r m a n c e and p o w e r e f f i c i e n c y

Support all the cases in Intel Core2 processor PLUS- CMP+Jcc macrofusion added for the following branch conditions

JL/JNGE

JGE/JNL

JLE/JNG

JG/JNLE

Intel Core 2 processor only supports macrofusion in 32-bit mode

- Nehalem supports macrofusion in both 32-bit and 64-bit modes

I n c r e a s e d m a c r o f u s i o n b e n e f i t o n N e h a l e m


Microarchitecture


13/65

13

Loop Stream Detector Reminder

Loops are very common in most software Take advantage of knowledge of loops in HW

- Decoding the same instructions over and over- Making the same branch predictions over and over

Loop Stream Detector identifies software loops- Stream from Loop Stream Detector instead of normal path- Disable unneeded blocks of logic for p o w e r sa v in g s - H i g h e r p e r f o r m a n ce by removing instruction fetch limitations

I n t e l Co r e 2 p r o c e ss o r L o o p St r e a m D e t e c t o r

Branch

Prediction Fetch Decode

Loop

Stream

Detector

1 8

I n s t r u c t i o n s


Microarchitecture


14/65

14

Nehalem Loop Stream Detector

Same concept as in prior implementations

H i g h e r p e r f o r m a n ce : Expand the size of theloops detected

I m p r o v e d p o w e r e f f i c i e n c y : Disable even morelogic

N e h a l e m Lo o p S t r e a m D e t e c t o r

Branch

Prediction Fetch Decode

Loop

Stream

Detector

2 8

M i c r o - O p s



15/65

15

Branch Prediction Reminder

Goal: Keep powerful compute engine fed

Options:- Stall pipeline while determining branch direction/target- Predict branch direction/target and correct if wrong

High branch prediction accuracy leads to:- Pe r f o r m a n ce : Minimize amount of time wasted correcting

from incorrect branch predictions Through higher branch prediction accuracy

Through faster correction when prediction is wrong

- Po w e r e f f i c i e n c y : Minimize number ofspeculative/incorrect micro-ops that are executedCo n t i n u e d f o c u s o n b r a n ch

p r e d i c t i o n i m p r o v e m e n t s


16/65

16

L2 Branch Predictor

Problem: Software with a large code footprint notable to fit well in existing branch predictors

- Example: Database applications

Solution: Use multi-level branch prediction scheme

Benefits:- Higher p e r f o r m a n ce through improved branch prediction

accuracy

- Greater p o w e r e f f i c i e n c y through less mis-speculation

d S k ff ( S )


17/65

17

Renamed Return Stack Buffer (RSB)

Instruction Reminder- CALL: Entry into functions- RET: Return from functions

Classical Solution- Return Stack Buffer (RSB) used to predict RET- Problems:

RSB can be corrupted by speculative path

RSB can overflow, overwriting valid entries

The Re n am e d RSB - No RET mispredicts in the common case

- No stack overflow issue

E i E i


18/65

18

Execution Engine

Start with powerful Intel Core2 processorexecution engine- Dynamic 4-wide Execution

- Advanced Digital Media Boost 128-bit wide SSE- HD Boost (Penryn)

SSE4.1 instructions

- Super Shuffler (Penryn)- Add Nehalem enhancements- Additional parallelism for higher performance


Microarchitecture

E ti U it O i


19/65

19

Execute 6 operations/cycle

3 Memory Operations

1 Load

1 Store Address

1 Store Data

3 Computational Operations

Execution Unit Overview

Unified Reservation Station

Port0

Port1

Port2

Po

rt3

Port4

Po

rt5

LoadStore

Address

Store

Data

Integer ALU &

Shift

Integer ALU &

LEA

Integer ALU &

Shift

BranchFP AddFP Multiply

Complex

IntegerDivide

SSE Integer ALU

Integer Shuffles

SSE Integer

Multiply

FP Shuffle

SSE Integer ALU

Integer Shuffles

Unified Reservation Station Schedules operations to Execution units

Single Scheduler for all Execution Units

Can be used by all integer, all FP, etc.

I d P ll li


20/65

20

Increased Parallelism

Goal: Keep powerfulexecution engine fed

Nehalem increases size of

out of order window by 33% Must also increase other

corresponding structures 016

3248

64

80

96

112

128

Dothan Merom Nehalem

Concurrent uOps Possible

I n c r e a s e d R eso u r c e s f o r H i g h e r Pe r f o r m a n c e

Structure Merom Nehalem CommentReservation Station 32 36 Dispatches operations

to execution units

Load Buffers 32 48 Tracks all loadoperations allocated

Store Buffers 20 32 Tracks all storeoperations allocated


Microarchitecture

Source: Microarchitecture specifications

E h d M S b t


21/65

21

Enhanced Memory Subsystem

Start with great Intel Core 2 Processor Features- Memory Disambiguation- Hardware Prefetchers

- Advanced Smart Cache

New Nehalem Features- New TLB Hierarchy

- Fast 16-Byte unaligned accesses- Faster Synchronization Primitives

New TLB Hierarchy


22/65

22

New TLB Hierarchy

Problem: Applications continue to grow in data size

Need to increase TLB size to keep the pace for performance

Nehalem adds new low-latency unified 2nd level TLB

# of Entries

1st Level Instruction TLBs

Small Page (4k) 128

Large Page (2M/4M) 7 per thread

1st Level Data TLBs

Small Page (4k) 64

Large Page (2M/4M) 32

New 2nd Level Unified TLB

Small Page Only 512

Fast Unaligned Cache Accesses


23/65

23

Fast Unaligned Cache Accesses

Two flavors of 16-byte SSE loads/stores exist- Aligned (MOVAPS/D, MOVDQA) -- Must be aligned on a 16-byte boundary- Unaligned (MOVUPS/D, MOVDQU) -- No alignment requirement

Prior to Nehalem- Optimized for Aligned instructions

- Unaligned instructions slower, lower throughput -- Even for aligned accesses! Required multiple uops (not energy efficient)- Compilers would largely avoid unaligned load

2-instruction sequence (MOVSD+MOVHPD) was faster

Nehalem optimizes Unaligned instructions- Same speed/throughput as Aligned instructions on aligned accesses

- Optimizations for making accesses that cross 64-byte boundaries fast Lower latency/higher throughput than Intel Core 2 processor- Aligned instructions remain fast

No reason to use aligned instructions on Nehalem! Benefits:

- Compiler can now use unaligned instructions without fear- H ig h e r p e r f o r m a n c e on key media algorithms- More e n e r g y e f f i c i e n t than prior implementations


Faster Synchronization Primitives


24/65

24

Faster Synchronization Primitives

Multi-threaded softwarebecoming more prevalent

S c a l a b i l i t y of multi-thread

applications can be limitedby synchronization

Synchronization primitives:LOCK prefix, XCHG

Reduce synchronizationlatency for legacy software

Greater thread s c a l a b i l i t y with Nehalem

LOCK CMPXCHG Per fo rm ance

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pentium 4 Core 2 Nehalem

RelativeLaten

cy


Simultaneous Multi-Threading (SMT)


25/65

25

Simultaneous Multi-Threading (SMT)

SMT

- Run 2 threads at the same time per core Take advantage of 4-wide execution

engine

- Keep it fed with multiple threads

- Hide latency of a single thread Most p o w e r e f f i c i e n t performance

feature

- Very low die area cost

- Can provide significant performancebenefit depending on application

- Much more efficient than adding anentire core

Nehalem advantages- Larger caches- Massive memory BW

Sim u l t a n e o u s m u l t i - t h r e a d i n g en h a n ce sp e r f o r m a n c e a n d e n e r g y e f f i c i e n c y

Time(proc.cy

cles)

w/o SMT SMT

Note: Each box

represents a

processorexecution unit

SMT Implementation Details


26/65

26

SMT Implementation Details

Multiple policies possible for implementation of SMT Replicated Duplicate state for SMT

- Register state- Renamed RSB- Large page ITLB

Partitioned Statically allocated between threads- Key buffers: Load, store, Reorder- Small page ITLB

Competitively shared Depends on threads dynamic behavior

- Reservation station- Caches- Data TLBs, 2nd level TLB

Unaware- Execution units

Agenda


27/65

27



Simultaneous Multi-Threading Feeding the EngineNew Cache HierarchyNew Platform Architecture


Agenda

Feeding the Execution Engine


28/65

28

Feeding the Execution Engine

Powerful 4-wide dynamic execution engine

Need to keep providing fuel to the execution engine

Nehalem Goals

- Lo w l a t e n cy to retrieve data Keep execution engine fed w/o stalling

- High data b a n d w i d t h Handle requests from multiple cores/threads seamlessly

- S c a l a b i l i t y Design for increasing core counts

Combination of great ca ch e h i e r a r c h y and n e w p l a t f o r m

N e h a l e m d e s i g n e d t o f e e d t h e e x e cu t i o n e n g i n e


Designed For Modularity


29/65

29

Designed For Modularity

Optimal price / performance / energy efficiency

ptimal price / performance / energy efficiency

for server desktop and mobile products

or server desktop and mobile products

DRAM

QPI

PI

Core

Uncore

C

O

R

E

C

O

R

E

C

O

R

E

IMC QPIPowerPower

&&

ClockClock

#QPI#QPILinksLinks

## memmemchannelschannels

Size ofSize ofcachecache# cores# cores

PowerPowerManageManage--mentment

Type ofType ofMemoryMemory

IntegratedIntegratedgraphicsgraphics

Differentiation in the Uncore:

2008

2009 Servers & Desktops

QPI

L3 Cache

QPI:

Intel

QuickPathInterconnect

Intel Smart Cache Core Caches


30/65

30

Intel Smart Cache Core Caches

New 3-level Cache Hierarchy 1st level caches

- 32kB Instruction cache

- 32kB Data Cache Support more L1 misses in parallel

than Core 2

2nd level Cache

- New cache introduced in Nehalem- Unified (holds code and data)- 256 kB per core

- Pe r f o r m a n ce : Very low latency- S c a l a b i l i t y : As core countincreases, reduce pressure onshared cache

Co r e

2 5 6 k B

L 2 Ca c h e

3 2 k B L 1

Da t a Ca c h e

3 2 k B L 1

I n s t . Ca c h e


Intel Smart Cache -- 3rd Level Cache


31/65

31

Intel Smart Cache 3 Level Cache

New 3rd level cache

Shared across all cores

Size depends on # of cores- Quad-core: Up to 8MB- S c a l a b i l i t y :

Built to vary size with varied

core counts Built to easily increase L3 size

in future parts

Inclusive cache policy forbest p e r f o r m a n ce

- Data residing in L1/L2 m u s tbe present in 3rd level cache

L3 Cache

Co r e

L 2 Ca c h e

L 1 Ca c h e s

Co r e

L 2 Ca c h e

L 1 Ca c h e s

Co r e

L 2 Ca c h e

L 1 Ca c h e s

Inclusive vs. Exclusive Caches


32/65

32

c us e s c us e Cac esCache Miss

E x c l u s i v e I n c l u s i v e

Core

0

Core

1

Core

2

Core

3

L3 Cache

Core

0

Core

1

Core

2

Core

3

L3 Cache

Data request from Core 0 misses Core 0s L1 and L2

Request sent to the L3 cache



33/65

33

Cache Miss


Core

0

Core

1

Core

2

Core

3

L3 Cache

Core

0

Core

1

Core

2

Core

3

L3 Cache

Core 0 looks up the L3 Cache

Data not in the L3 Cache

MISS! MISS!



34/65

34

Cache Miss


Core

0

Core

1

Core

2

Core

3

L3 Cache

Core

0

Core

1

Core

2

Core

3

L3 CacheMISS! MISS!

Must check other cores Guaranteed data is not on-die

Greater s c a l a b i l i t y from inclusive approach



35/65

35

Cache Hit


Core

0

Core

1

Core

2

Core

3

L3 Cache

Core

0

Core

1

Core

2

Core

3

L3 CacheHIT! HIT!

No need to check other cores Data could be in another core

BUT Nehalem is smart




36/65

36

Cache Hit

I n c l u s i v e

Core

0

Core

1

Core

2

Core

3

L3 CacheHIT!

Core valid bits limitunnecessary snoops

Maintain a set of corevalid bits per cache line inthe L3 cache

Each bit represents a core

If the L1/L2 of a core maycontain the cache line, thencore valid bit is set to 1

No snoops of cores areneeded if no bits are set

If more than 1 bit is set,line cannot be in Modified

state in any core

0 0 0 0



37/65

37

Read from other core


Core

0

Core

1

Core

2

Core

3

L3 Cache

Core

0

Core

1

Core

2

Core

3

L3 CacheMISS! HIT!

Must check all other cores Only need to check the core

whose core valid bit is set

0 0 1 0

Todays Platform Architecture


38/65

38

y

Front-Side Bus Evolution

ICHICH

CPUCPU

CPUCPU

CPUCPU

CPUCPU

MCHMCH

memory

ICHICH

CPUCPU

CPUCPU

CPUCPU

CPUCPU

MCHMCH

memory

ICHICH

CPUCPU

CPUCPU

CPUCPU

CPUCPU

MCHMCH

memory

Nehalem-EP Platform Architecture


39/65

39

Integrated Memory Controller- 3 DDR3 channels per socket- Massive memory b a n d w i d t h

- Memory Bandwidth scales with# of processors

- Very l o w m e m o r y la t e n c y Intel QuickPath Interconnect

(QPI)- New point-to-point interconnect- Socket to socket connections- Socket to chipset connections- Build s c a l a b l e solutions

Nehalem

EP

Nehalem

EP

Tylersburg

EP

Si g n i f i c a n t p e r f o r m a n ce l e a p f r o m n e w p l a t f o r m


Intel QuickPath Interconnect


40/65

40

Nehalem introduces newIntel QuickPathInterconnect (QPI)

H ig h b a n d w i d t h , l o wl a t e n c y point to pointinterconnect

Up to 6.4 GT/sec initially

- 6.4 GT/sec -> 12.8 GB/sec- Bi-directional link -> 25.6

GB/sec per link

- Future implementations at

even higher speeds Highly s c a l a b l e for systems

with varying # of sockets

Nehalem

EPNehalem

EP

IOH

memoryCPU CPU

CPU CPU

IOH

memory

memory

memory


Integrated Memory Controller (IMC)


41/65

41

Memory controller optimized per marketsegment Initial Nehalem products

- Native DDR3 IMC- Up to 3 channels per socket- Speeds up to DDR3-1333

Massive m e m o r y b a n d w i d t h

- Designed for l o w l a t e n c y - Support RDIMM and UDIMM- RAS Features

Future products

- S c a l a b i l i t y Vary # of memory channels Increase memory speeds Buffered and Non-Buffered solutions

- Market specific needs Higher memory capacity Integrated graphics

Nehalem

EPNehalem

EP

Tylersburg

EP

DDR3 DDR3

Si g n i f i c a n t p e r f o r m a n ce t h r o u g h n e w I M C


IMC Memory Bandwidth (BW)


42/65

42

3 memory channels per socket

Up to DDR3-1333 at launch- Massive m em o r y BW - HEDT: 32 GB/sec peak- 2S server: 64 GB/sec peak

S c a l a b i l i t y - Design IMC and core to takeadvantage of BW

- Allow performance to scale withcores

Core enhancements Support more cache misses per

core Aggressive hardware prefetching

w/ throttling enhancements

Example IMC Features Independent memory channels Aggressive Request Reordering

HarpertownHarpertown

1600 FSB1600 FSB0

1

2

3

4

NehalemNehalem

EPEP

3x DDR33x DDR3

--13331333

2 socket Streams Triad

>4Xbandwidth

Ma ss iv e m em o r y BW p r o v i d e s p e r f o r m a n ce a n d s ca la b i l i t y


Source: Internal measurements

Non-Uniform Memory Access (NUMA)


43/65

43

FSB architecture- All memory in one location Starting with Nehalem

- Memory located in multiple

places Latency to memorydependent on location

Local memory- Highest BW- Lowest latency

Remote Memory- Higher latency

Nehalem

EPNehalem

EP

Tylersburg

EP

En s u r e s o f t w a r e i s N UMA - o p t i m i z e d f o r b e s t p e r f o r m a n c e


Local Memory Access


44/65

44

CPU0 requests cache line X, not present in any CPU0 cache

- CPU0 requests data from its DRAM- CPU0 snoops CPU1 to check if data is present Step 2:

- DRAM returns data- CPU1 returns snoop response

Local memory latency is the maximum latency of the two responses

Nehalem optimized to keep key latencies close to each other

CPU0 CPU1QPI

DRAMDRAM


QPI:

Intel


Remote Memory Access


45/65

45

CPU0 requests cache line X, not present in any CPU0 cache

- CPU0 requests data from CPU1- Request sent over QPI to CPU1- CPU1s IMC makes request to its DRAM- CPU1 snoops internal caches

- Data returned to CPU0 over QPI Remote memory latency a function of having a low latencyinterconnect

CPU0 CPU1QPI

DRAMDRAM

QPI:

Intel


Memory Latency Comparison


46/65

46

Lo w m em o r y l a t en c y critical to high performance Design integrated memory controller for low latency Need to optimize both local and remote memory latency Nehalem delivers

- Huge reduction in local memory latency- Even remote memory latency is fast

Effective memory latency depends per application/OS- Percentage of local vs. remote accesses- NHM has lower latency regardless of mix

Relativ e Memory Latency Comparison

0.00

0.20

0.40

0.60

0.80

1.00

Harpertown (FSB 1600) Nehalem (DDR3-1333) Local Nehalem (DDR3-1333) Remote

RelativeMemoryLatency


Microarchitecture

Agenda


47/65

47

Nehalem Design Philosophy Enhanced Processor CorePerformance FeaturesSimultaneous Multi-Threading

Feeding the EngineNew Cache HierarchyNew Platform Architecture


Virtualization


48/65

48

To get best virtualized p e r f o r m a n ce - Have best native performance- Reduce:

# of transitions into/out of virtual machine

Latency of transitions Nehalem virtualization features

- Reduced latency for transitions- Virtual Processor ID (VPID) to reduce effective cost of transitions

- Extended Page Table (EPT) to reduce # of transitions

Great virtualization performance w/ Nehalem


Virtualization Enhancements


49/65

49

0%

20%

40%

60%

80%

100%

Relative

Latenc

y

Merom Penryn Nehalem

>40%>40%FasterFaster

Greatly improved Virtualization performance

reatly improved Virtualization performance

Virtualization Latency

irtualization Latency

Time to enter and exit virtual machine

Virtual

ProcessorID (VPID)

ExtendedPage

Tables (EPT)

Reduce frequency of

invalidation of TLB entries

Hardware to translateguest to host physicaladdress

Previously done in VMMsoftware

Removes #1 cause ofexits from virtual machine

Allows guests full controlover page tables

Architecture:rchitecture:

Round Trip Virtualization Latency

Source: pre-silicon/post-silicon measurements


Extended Page Tables (EPT) Motivation


50/65

50

Guest

OS

VM1

VMM

CR3

Guest Page Table

CR3

Active Page Table

A VMM needs to protectphysical memory Multiple Guest OSs share

the same physicalmemory

Protections areimplemented throughpage-table virtualization

Page table virtualizationaccounts for asignificant portion ofvirtualization overheads VM Exits / Entries

The goal of EPT is to

reduce these overheads

Guest page table changescause exits into the VMM

VMM maintains the activepage table, which is used

by the CPU

EPT Solution


51/65

51

Intel 64 Page Tables

- Map Guest Linear Address to Guest Physical Address- Can be read and written by the guest OS

New EPT Page Tables under VMM Control

- Map Guest Physical Address to Host Physical Address

- Referenced by new EPT base pointer No VM Exits due to Page Faults, INVLPG or CR3

accesses

Intel

64

PageTables

Guest

Linear

Address

EPT

PageTables

CR3

Guest

Physical

Address

EPT

Base Pointer

Host

Physical

Address

Extending Performance and Energy Efficiency- SSE4 . 2 I n s t r u c t i o n Se t A r c h i t e c t u r e ( I SA ) Le a d e r s h i p i n 2 0 0 8


52/65

52

SSE4.2SSE4.2(Nehalem Core)(Nehalem Core)

STTNISTTNIe.g. XMLe.g. XML

accelerationacceleration

POPCNTPOPCNTe.g. Genomee.g. Genome

MiningMining

ATAATA(Application(Application

TargetedTargetedAccelerators)Accelerators)

SSE4.1SSE4.1((PenrynPenryn Core)Core)

SSE4SSE4(45nm CPUs)(45nm CPUs)

CRC32CRC32e.g.e.g. iSCSIiSCSIApplicationApplication

New CommunicationsNew Communications

CapabilitiesCapabilities

Hardware based CRCinstructionAccelerated Network

attached storageImproved power efficiencyfor Software I-SCSI, RDMA,and SCTP

Accelerated SearchingAccelerated Searching

& Pattern Recognition& Pattern Recognitionof Large Data Setsof Large Data Sets

Improved performance forGenome Mining,

Handwriting recognition.Fast Hamming distance /Population count

AcceleratedAcceleratedString and TextString and Text

ProcessingProcessing

Faster XML parsing

Faster search and patternmatchingNovel parallel data matchingand comparison operations

STTNI

ATA

W h a t s h o u l d t h e a p p l i c a t i o n s , O S a n d VM M v e n d o r s d o ?:

U n d e r s t a n d t h e b e n e f i t s & t a k e ad v a n t a g e o f n e w i n s t r u c t i o n s in 2 0 0 8 .

P r o v i d e u s f e e d b a c k o n i n s t r u c t i o n s I SV w o u l d l i k e t o s ee f o r

n e x t g e n e r a t i o n o f a p p l ic a t i o n s

STTNI - STring & Text New InstructionsOp e r a t e s o n s t r i n g s o f b y t e s o r w o r d s ( 1 6 b )


53/65

53

Equal Each Instruction

True for each character in Src2 ifsame position in Src1 is equalSr c1: Test \ t daySr c2: t ad t seTMask: 01101111

Equal Ordered Instruction

Finds the start of a substring (Src1)within another string (Src2)Src1: ABCA0XYZSrc2: S0BACBABMask: 00000010

Equal Any Instruction

True for each character in Src2if any character in Src1 matchesSr c1: Exampl e\ n

Sr c2: at ad t sTMask: 10100000

Ranges Instruction

True if a character in Src2 is inat least one of up to 8 rangesin Src1Sr c1: AZ 0 9zzzSr c2: t aD t seTMask: 00100001

Projected 3.8x kernel speedup on XML parsing &2.7x savings on instruction cycles

STTNI MODEL

x x xx xxx Tx x xx Txx x

x T xx xxx x

x x xx xTx x

x x Fx xxx x

x x xx xxT x

x x xT xxx xF x xx xxx x

t da st Te

T

st

e

ad\t

y

Check

each bit

in the

diagonal

Source1

(XMM)

Source2 (XMM / M128)

IntRes1

0 1 01 111 1

Bit 0

Bit 0

STTNI ModelSource2 (XMM / M128)Source2 (XMM / M128)

EQUAL ANY EQUAL EACH


54/65

54

AND the results

along each

diagonal

x x xx xxx Tx x xx Txx x

x T xx xxx x

x x xx xTx x

x x Fx xxx xx x xx xxT x

x x xT xxx xF x xx xxx x

t da st Te

T

st

e

ad\t

y

Checkeach bit inthediagonal

S

ource1

(XMM)


IntRes10 1 01 111 1

Bit 0

Bit 0

F F FF FF FFF F FF FFF F

F F FF FFF F

T T FF FFF F

F F FF FFF FF F FF FFF F

F F FF FFF FF F FF FFF F

a a dt t Ts

E

am

x

elp

\n

OR results downeach column

Source1

(XMM)


IntRes11 1 00 000 0

Bit 0

Bit 0

Source1(XM

M)

Bit 0fF F TfF TFF FfF T FfF FTF x

fF F FfF xFT xfF F TfF xxF xfT fTfTfT xxx xfT fT xfT xxx xfT x xfT xxx xfT x xx xxx x


S BA0 ABC B

A

CA

B

YX0

Z

IntRes10 0 00 100 0

Bit 0

F TFF FF TTF TTF FFF T

T TTT TTT T

T TFT TTT TF F FF FFF FF F TF FFF F

F F FF FFF FT TTT TTT T

t Da st Te

A

09

Z

zzz

z

Source1

(X

MM)


IntRes10 100 000 1

Bit 0

First Compare

does GE, next

does LEAND GE/LE pairs of results

OR those results

Bit 0

EQUAL ORDEREDRANGES

AND the results

along each

diagonal

Bit 0

Source1(X

MM)

Example Code For strlen()

STTNI Version


55/65

55

int sttni_strlen(const char * src)

{

char eom_vals[32] = {1, 255, 0};

__asm{

mov eax, src

movdqu xmm2, eom_vals

xor ecx, ecx

topofloop:

add eax, ecx

movdqu xmm1, OWORD PTR[eax]

pcmpistri xmm2, xmm1, imm8

jnz topofloop

endofstring:

add eax, ecx

sub eax, srcret

}

}

string equ [esp + 4]mov ecx,string ; ecx -> string

test ecx,3 ; test if string is aligned on 32 bitsje short main_loopstr_misaligned:

; simple byte loop until string is alignedmov al,byte ptr [ecx]add ecx,1test al,al

je short byte_3

test ecx,3jne short str_misalignedadd eax,dword ptr 0 ; 5 byte nop to align label belowalign 16 ; should be redundant

main_loop:mov eax,dword ptr [ecx] ; read 4 bytesmov edx,7efefeffhadd edx,eax

xor eax,-1xor eax,edxadd ecx,4test eax,81010100h

je short main_loop; found zero byte in the loopmov eax,[ecx - 4]test al,al ; is it byte 0

je short byte_0test ah,ah ; is it byte 1

je short byte_1test eax,00ff0000h ; is it byte 2

je short byte_2test eax,0ff000000h

; is it byte 3je short byte_3jmp short main_loop

; taken if bits 24-30 are clear and bit; 31 is setbyte_3:

lea eax,[ecx - 1]mov ecx,stringsub eax,ecxret

byte_2:lea eax,[ecx - 2]mov ecx,stringsub eax,ecx

retbyte_1:lea eax,[ecx - 3]mov ecx,stringsub eax,ecxret

byte_0:lea eax,[ecx - 4]

mov ecx,stringsub eax,ecxret

strlen endpend

STTNI Version

Cu r r e n t Co d e : M i n im um o f 1 1 i n s t r u c t i o n s ; I n n e r l o o p p r o ce ss es 4 b y t e s w i t h 8 i n s t r u c t io n s

STTN I Co d e : M i n im um o f 1 0 i n s t r u c t i o n s ; A s in g l e i n n e r l o op p r o ce ss es 1 6 b y t e s w i t h o n l y 4i n s t r u c t i o n s

ATA - Application Targeted Accelerators

CRC32 POPCNTPOPCNT d t i th b f


56/65

56

One register maintains the running CRC value as asoftware loop iterates over data.Fixed CRC polynomial = 11EDC6F41h

Replaces complex instruction sequences for CRC in

Upper layer data protocols: iSCSI, RDMA, SCTP

SRC Data 8/16/32/64 bit

Old CRC

63 3132 0

0 New CRC

63 3132 0

DST

DST

0

X

Accumulates a CRC32 value using the iSCSI polynomial

En a b l e s e n t e r p r i s e c l a s s d a t a a ss u r a n ce w i t h h i g h d a t a r a t e s

i n n e t w o r k e d s t o r a g e i n a n y u s e r e n v i r o n m e n t .

0 1 0 . . . 0 0 1 1

63 1 0 Bit

0x3

RAX

RBX

0ZF=? 0

POPCNT determines the number of nonzero

bits in the source.

POPCNT is useful for speeding up fast matching indata mining workloads including: DNA/Genome Matching Voice Recognition

ZFlag set if result is zero. All other flags (C,S,O,A,P)reset

CRC32 Preliminary Performance


57/65

57

crc32c_sse42_optimized_version(uint32 crc, unsigned

char const *p, size_t len)

{ // Assuming len is a multiple of 0x10

asm("pusha");

asm("mov %0, %%eax" :: "m" (crc));

asm("mov %0, %%ebx" :: "m" (p));

asm("mov %0, %%ecx" :: "m" (len));

asm("1:");

// Processing four byte at a time: Unrolled four times:

asm("crc32 %eax, 0x0(%ebx)");



asm("crc32 %eax, 0xc(%ebx)");

asm("add $0x10, %ebx")2;

asm("sub $0x10, %ecx");

asm("jecxz 2f");

asm("jmp 1b");

asm("2:");

asm("mov %%eax, %0" : "=m" (crc));

asm("popa");

return crc;

}}

Preliminary tests involved Kernel code implementing

CRC algorithms commonly used by iSCSI drivers.

32-bit and 64-bit versions of the Kernel under test

32-bit version processes 4 bytes of data using

1 CRC32 instruction

64-bit version processes 8 bytes of data using1 CRC32 instruction

Input strings of sizes 48 bytes and 4KB used for the

test

32 - bit 64 - bit

InputDataSize =48 bytes

6.53 X 9.85 X

InputDataSize = 4KB

9.3 X 18.63 X

CRC32 optimized Code

P r e l im i n a r y Re su l t s sh o w CRC3 2 i n s t r u c t i o n o u t p e r f o r m i n g t h e

f a s t e s t CRC3 2 C so f t w a r e a lg o r i t h m b y a b i g m a r g i n

Tools Support of New Instructions

- Intel C++ Compiler 10 x supports the new instructions


58/65

58

- Intel C++ Compiler 10.x supports the new instructions SSE4.2 supported via intrinsics

Inline assembly supported on both IA-32 and Intel64 targets

Necessary to include required header files in order to access intrinsics for Supplemental SSE3

for SSE4.1

for SSE4.2

- Intel Library Support XML Parser Library using string instructions will beta Spring 08 and release product

in Fall 08

IPP is investigating possible usages of new instructions

- Microsoft* Visual Studio 2008 VC++ SSE4.2 supported via intrinsics

Inline assembly supported on IA-32 only

Necessary to include required header files in order to access intrinsics

for Supplemental SSE3 for SSE4.1

for SSE4.2

VC++ 2008 tools masm, msdis, and debuggers recognize the new instructions

Software Optimization Guidelines


59/65

59

Most optimizations for Intel

Core

microarchitecturestill hold

Examples of new optimization guidelines:- 16-byte unaligned loads/stores- Enhanced macrofusion rules- NUMA optimizations

Nehalem SW Optimization Guide will be published

Intel C++ Compiler will support settings forNehalem optimizations


Microarchitecture

Summary


60/65

60

Nehalem - The Intel 45 nm Tock Processor Designed for

- Pow e r Ef f i c i e n c y

- S c a l a b i l i t y - Pe r f o r m a n ce

Enhanced Processor Core

Brand New Platform Architecture Extending ISA Leadership

Additional sources ofinformation on this topic:


61/65

61

information on this topic:

DPTS003-High End Desktop Platform Overview for the NextGeneration Intel Microarchitecture (Nehalem) Processor

-April 3 11:10-12:00; Room 3C+3D, 3rd floor

NGMS002-Upcoming Intel 64 Instruction Set ArchitectureExtensions- April 3 16:00 17:50; Auditorium, 3rd floor

NGMC001-Next Generation Intel Microarchitecture (Nehalem) andNew Instruction Extensions - Chalk Talk- April 3 17:50 18:30; Auditorium, 3rd floor

Demos in the Advance Technology Zone on the 3rd floor

More web based info: http://www.intel.com/technology/architecture-silicon/next-gen/index.htm

Session Presentations - PDFs
http://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htm


62/65

62

The PDF of this Session presentation is available from ourIDF Content Catalog:

https://intel.wingateweb.com/SHchina/catalog/controller/catalog

These can also be found from links on www.intel.com/idf

Please Fill out theSession Evaluation Form
https://intel.wingateweb.com/SHchina/catalog/controller/cataloghttp://www.intel.com/idfhttp://www.intel.com/idfhttps://intel.wingateweb.com/SHchina/catalog/controller/catalog


63/65

63

Session Evaluation Form

Thank You for your input, we use it to improve futureIntel Developer Forum events

Put in your lucky draw coupon to win the prize at theend of the track!

You must be present to win!

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTEDBY THIS DOCUMENT EXCEPT AS PROVIDED IN INTELS TERMS AND CONDITIONS OF SALE FOR SUCH


64/65

64

BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTELS TERMS AND CONDITIONS OF SALE FOR SUCH

PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIEDWARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIESRELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANYPATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FORUSE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.

Intel may make changes to specifications and product descriptions at any time, without notice.

All products, dates, and figures specified are preliminary based on current expectations, and are subject tochange without notice.

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, whichmay cause the product to deviate from published specifications. Current characterized errata are availableon request.

Nehalem, Harpertown, Tylersburg, Merom, Dothan, Penryn, Bloomfield and other code names featured are

used internally within Intel to identify products that are in development and not yet publicly announced forrelease. Customers, licensees and other third parties are not authorized by Intel to use code names inadvertising, promotion or marketing of any product or services and any such use of Intel's internal codenames is at the sole risk of the user

Performance tests and ratings are measured using specific computer systems and/or components andreflect the approximate performance of Intel products as measured by those tests. Any difference insystem hardware or software design or configuration may affect actual performance.

Intel, Intel Inside, Xeon, Pentium, Core and the Intel logo are trademarks of Intel Corporation in the UnitedStates and other countries.

*Other names and brands may be claimed as the property of others.

Copyright 2008 Intel Corporation.


65/65

Documents

Intel Nehalem Processor