35
Christopher Foster Scott Thibaudeau Brian Cleary

Christopher Foster Scott Thibaudeau Brian Cleary

  • View
    231

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Christopher Foster Scott Thibaudeau Brian Cleary

Christopher Foster

Scott Thibaudeau

Brian Cleary

Page 2: Christopher Foster Scott Thibaudeau Brian Cleary

Itanium – IA-64: Overview.• Development of the Parallel Processor

• Success and Failure (Problems and Solutions)• Multiple Parallel Pipelines on a Single Die• Itanium is born!

• Execution of Parallel Processing in IA-64• 10 deep pipeline execution; 9 Parallel distribution sites• Current and future IA-64 code Development

• The Memory Requirements and Specifications• Heirarchy: Registers, L1,2,3 Cache, Main Memory, HD• L1=Data; L2=Unified; L3=Off-Chip: Fully Associative• Latency Times• Full Memory Block Diagram Overview

• System Management Bus (SM Bus)• Thermal System• EEPROM, PIROM

• System Bus (IA-64 Bus Architecture)• Bandwidth• Parallel Processors in Parallel• SAC, SDC (Controls access to the bus)

Page 3: Christopher Foster Scott Thibaudeau Brian Cleary

History of Microprocessors:A Very Abridged Tour.

•Beginning of time: Circa 1980 and before…•CISC and RISC Computers are all that exist.•Zilog 6502 Lives in every house (Nintendo).•Ronald Regan in office.

•Middle Ages: Circa 1990•Parallel Processing exists in white-papers.•IA-32 is in almost every desktop.•Vanilla Ice hits it big.

•Current Day: Circa 2000•Beowulf Clusters (Distributed Parallel Processing Networks)•Pentium breaks the GHz mark with IA-32.•Intel develops the IA-64 Architecture to support Parallel on die.

Page 4: Christopher Foster Scott Thibaudeau Brian Cleary

So what’s so good about Parallelism?

At the most efficient each parallel path divides the execution time IN HALF!

•Security (Locality)•All processors in one place (physically)•Encryption power increased

•Scalability (Modular reuse)

This leads to incredible gains:

•Productivity (Reduced Latency)•Wait times for compile/execute•Increased functionality in real-time processes

•Reliability (Redundancy)•Multiple modules for eachfunctional unit

Page 5: Christopher Foster Scott Thibaudeau Brian Cleary

But are there any disadvantages?

•YES:•Memory Size/Latency•Branch Prediction•Independent Instructions

Page 6: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 Solves All of these problems:

•Memory Size: 64 bit addressing | Huge Register File•Memory Latency: Multiple Layers of Cache•Branch Prediction: Hardware Solution•Independent Instructions: *New code classes*

And with these problems out of the way…

Page 7: Christopher Foster Scott Thibaudeau Brian Cleary

The way is prepared for:Multiple Parallel Processes on a Single Die:

Explicitly Parallel Instruction Processing (EPIC)

With resources made available, the Itanium is able to use multiplefunctional units for each process required.

This results in an incredible number of separate pipelined execution paths:

•Integer Function Units (2)•Memory Units (2)•Branch Prediction Units (3)•Floating Point Units (2) + Total 9 separate execution paths!

Note: Though the focus is noton pipelining here, there are10 deep pipelines for each unit.

Page 8: Christopher Foster Scott Thibaudeau Brian Cleary

Overall Architecture

Page 9: Christopher Foster Scott Thibaudeau Brian Cleary

The Full Pipeline Procedure

Page 10: Christopher Foster Scott Thibaudeau Brian Cleary

Fetch/Distribution Procedures

M=Memory UnitsF=Floating Point UnitsI=Integer UnitsB=Branch

3 instructions per bundle2 bundles per clock xFully 6 instructions per clock.

M0, M1, I0, I1, F0, F1, B0, B1, B2These are all execution pipelines.

Page 11: Christopher Foster Scott Thibaudeau Brian Cleary

How do we write code for The Itanium?

•*NEW Code Classes*•Allow programmer to specify specific function units for:

•Loads, Arithmetic, Branch Ops, Logic Operations•Enable users to specify INDEPENDENT INSTRUCTIONS•Interpretation at OS Level:

•Windows 64 (to be released as Windows XP64);•Linux-64, HP-UX, Modesto;

•PAL Level interpretation•Possibility of Virtual Machine interface.

Page 12: Christopher Foster Scott Thibaudeau Brian Cleary

And what does this code look like?

Page 13: Christopher Foster Scott Thibaudeau Brian Cleary

Is Itanium Fully Developed?No.

•Some registers yet to be named and used.•Windows 64 not yet available.•Cost of processor/memory production still too high.

Moore’s Law: If we keep doubling, then we can expectIA-64 to be around half as long as IA-32. That’s about5-7 years. That gives us at least 3 more.

And they haven’t written any books on the subject yet either.

Page 14: Christopher Foster Scott Thibaudeau Brian Cleary

• 256 general and floating point registers

• 64-bits wide

• Rotating registers

Register File

Page 15: Christopher Foster Scott Thibaudeau Brian Cleary

Memory Hierarchy

• Level 1 Data Cache (L1-D)• Level 1 Instruction Cache (L1-I)

– 16Kb, 4-way set associative with 32-byte lines

• Level 2 Unified Cache (L2)• Level 3 Cache (L3)• Main Memory (FSB) Bus

– Maximum Bandwidth of 2.1GB/s.

• Level 1 & Level 2 Data Translation Lookaside Buffers (L1/L2-DTLB)

• Instruction Translation Cache (ITLB)

Page 16: Christopher Foster Scott Thibaudeau Brian Cleary

Level 1 Data Cache (L1-D)

• 16 Kb, 4-way set associative, write through, no write allocate with 32-byte lines

• Integer loads have 2-cycle latency

• Floating Point loads bypass L1 Data cache

Page 17: Christopher Foster Scott Thibaudeau Brian Cleary

Level 2 Unified Cache (L2)

• 96Kb, 6-way set associative, write back and write allocate with 64-byte lines

• Integer loads have 6-cycle latency

• Floating Point have 9-cycle latency

Page 18: Christopher Foster Scott Thibaudeau Brian Cleary

L3 Cache (L3)???• Off-chip

• 2Mb or 4Mb package

• Maximum bandwidth from L3 to L2 is 16 bytes times the core frequency

• Integer loads have 21-cycle latency

• Floating Point have 24-cycle latency

So what?

Page 19: Christopher Foster Scott Thibaudeau Brian Cleary
Page 20: Christopher Foster Scott Thibaudeau Brian Cleary

L1 & L2 Data TranslationLookaside Buffer

• 32 & 96 entries, respectively

• Both fully associative

• Both support page sizes of 4k, 8k, 16k, 64k, 256k, 1M, 4M, 16M, 64M, and 256M

• Purges supported include all page sizes and 4G

Page 21: Christopher Foster Scott Thibaudeau Brian Cleary

Instruction Translation Cache

• Single-level instruction

• 64 entries

• Fully associative

Page 22: Christopher Foster Scott Thibaudeau Brian Cleary

Overall Architecture

Page 23: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 Thermal Specifications

• What are the components? How does it work?- Internal thermal circuit w/ thermal sensing diode• How does it protect itself from overheating?- Comparison to THIGH

• What happens when overheating occurs?- Thermal Alert Register tripped- To restore…• What exactly are the heat tolerances? What should be

calculated? Any equations?- According to Intel…

Page 24: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 Thermal Specifications

Page 25: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 Thermal Specifications: Dimensions of Thermal Sensor

Page 26: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 Thermal Specifications:The Processors

• What about the AMD/P4/P3?

- P4: Application Slows Down (Itanium inherits fundamental heat protection)

- P3: Application Freezes

- As for the AMD…

- Video displaying above characteristics at end of presentation

Page 27: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 Thermal Specifications: Location of Thermal Sensor

Page 28: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 System Management Bus (w/Thermal Sensory)

•Why do we care about the PIROM and EEPROM?

- EEPROM is a read write memory block that enables vendors to specify methods/standards as to how data is transferred in the data bus.

- PIROM contains write-protected information regarding certain characteristics of the processor (frequency speed).

- As for the thermal sensor, in conjunction with the above components, accurate temperature checking/regulation is achieved.

Page 29: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 System Management Bus: Data/Addressing Management

•Packet Types (Read/Write)

-Memory Units: current address read, random access read, sequential read, byte write, page write

-Thermal Unit: write byte, read byte, send byte, receive byte, ARA

• Addressing

- Memory Units: “1010XXY2b”

- Thermal Unit: “0011XXXZb” “1001XXXZb” “0101XXXZb”

Page 30: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 System Management Bus:Memory Unit Packet Types

Page 31: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 System Management Bus:Thermal Unit Packet Types

Page 32: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 Bus Architecture:SMBus Timing Diagrams

Page 33: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 Main Bus Architecture: Overview

Page 34: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 Main Bus Architecture:Specifications• 64-Bit bus running at 2.1 GB/s• Up to [4] Itaniums can be connected in parallel to

the same bus (running at 266 Mhz)• SAC: System Address Controller• SDC: System Data Controller• Above controllers assign Address or Data

Information from the Itanium(s) to the memory unit (from multiple processors to a single bus line and vice versa)

Page 35: Christopher Foster Scott Thibaudeau Brian Cleary

IA-64 Customer Feedback

• What are journalists, customers saying?

- “The heat generated from the Itanium can be compared to an EZ-Bake Oven…Intel is losing its foothold in the processor industry by relying on the archaic x86 architecture.”

- “Upgrading a mission critical system is a daunting task, especially since there exists reliable 64-bit Unix Machines. Then there’s the code conversion problem…”