15-740/18-740 Computer Architecture ece740/f11/lib/exe/fetch.php?media=wiآ  15-740/18-740 Computer Architecture

  • View
    0

  • Download
    0

Embed Size (px)

Text of 15-740/18-740 Computer Architecture ece740/f11/lib/exe/fetch.php?media=wiآ  15-740/18-740 Computer...

  • 15-740/18-740 Computer Architecture

    Lecture 3: SIMD, MIMD, and ISA Principles

    Prof. Onur Mutlu Carnegie Mellon University

    Fall 2011, 9/14/2011

  • Review of Last Lecture n  Vector processors

    q  Advantages and disadvantages q  Amortization of control overhead over multiple data elements q  Vector vs. scalar code execution time

    n  SIMD vs. MIMD n  Concept of on-chip networks

    2

  • Today n  Wrap up intro to on-chip networks n  ISA level tradeoffs

    3

  • Review: SIMD vs. MIMD n  SIMD:

    q  Concurrency arises from performing the same operations on different pieces of data

    n  MIMD:

    q  Concurrency arises from performing different operations on different pieces of data n  Control/thread parallelism: execute different threads of control in

    parallel à multithreading, multiprocessing

    q  Idea: Use multiple processors to solve a problem

    4

  • Review: Flynn’s Taxonomy of Computers

    n  Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966

    n  SISD: Single instruction operates on single data element n  SIMD: Single instruction operates on multiple data elements

    q  Array processor q  Vector processor

    n  MISD: Multiple instructions operate on single data element q  Closest form: systolic array processor, streaming processor

    n  MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) q  Multiprocessor q  Multithreaded processor

    5

  • Review: On-Chip Network Based Multi-Core Systems

    n  A scalable multi-core is a distributed system on a chip

    6

    From East

    From West

    From North

    From South

    From PE

    VC 0 VC Identifier

    VC 1 VC 2

    R PE

    R PE

    R PE

    R PE

    R PE

    R PE

    R PE

    R PE

    R PE

    R PE

    R PE

    R PE

    R PE

    R PE

    R PE

    R PE

    To East

    To PE

    To West To North To South

    Input Port with Buffers

    Control Logic

    Crossbar

    R Router

    PE Processing Element (Cores, L2 Banks, Memory Controllers, Accelerators, etc)

    Routing Unit ( RC )

    VC Allocator ( VA )

    Switch Allocator (SA)

  • Review: Idea of On-Chip Networks n  Problem: Connecting many cores with a single bus is not

    scalable q  Single point of connection limits communication bandwidth

    n  What if multiple core pairs want to communicate with each other at the same time?

    q  Electrical loading on the single bus limits bus frequency

    n  Idea: Use a network to connect cores q  Connect neighboring cores via short links q  Communicate between cores by routing packets over the

    network

    n  Dally and Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks,” DAC 2001.

    7

  • Review: Bus + Simple + Cost effective for a small number of nodes + Easy to implement coherence (snooping) - Not scalable to large number of nodes (limited bandwidth,

    electrical loading à reduced frequency) - High contention

    8

    MemoryMemoryMemoryMemory

    Proc

    cache

    Proc

    cache

    Proc

    cache

    Proc

    cache

  • Review: Crossbar n  Every node connected to every other n  Good for small number of nodes + Least contention in the network: high bandwidth - Expensive - Not scalable due to quadratic cost Used in core-to-cache-bank networks in - IBM POWER5 - Sun Niagara I/II

    9

    0 1 2 3 4 5 6 7

    0

    1

    2

    3

    4

    5

    6

    7

  • Review: Mesh n  O(N) cost n  Average latency: O(sqrt(N)) n  Easy to layout on-chip: regular and equal-length links n  Path diversity: many ways to get from one node to another

    n  Used in Tilera 100-core n  And many on-chip network prototypes

    10

  • Review: Torus n  Mesh is not symmetric on edges: performance very

    sensitive to placement of task on edge vs. middle n  Torus avoids this problem + Higher path diversity than mesh - Higher cost - Harder to lay out on-chip - Unequal link lengths

    11

  • Review: Torus, continued n  Weave nodes to make inter-node latencies ~constant

    12

    S M P

    S M P

    S M P

    S M P

    S M P

    S M P

    S M P

    S M P

  • Review: Example NoC: 100-core Tilera Processor

    q  Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007.

    13

  • The Need for QoS in the On-Chip Network n  One can create malicious applications that continuously access

    the same resource à deny service to less aggressive applications

    14

  • n  Need to provide packet scheduling mechanisms that ensure applications’ service requirements (bandwidth/latency) are satisfied

    n  Grot et al., “Preemptive Virtual Clock: A Flexible, Efficient, and Cost- effective QOS Scheme for Networks-on-Chip,” MICRO 2009.

    The Need for QoS in the On-Chip Network

    15

  • On Chip Networks: Some Questions n  Is mesh/torus the best topology? n  How do you design the router?

    q  Energy efficient, low latency q  What is the routing algorithm? Is it adaptive or deterministic?

    n  How does the router prioritize between different threads’/applications’ packets? q  How does the OS/application communicate the importance of

    applications to the routers? q  How does the router provide bandwidth/latency guarantees to

    applications that need them? n  Where do you place different resources? (e.g., memory controllers) n  How do you maintain cache coherence? n  How does the OS scheduler place tasks? n  How is data placed in distributed caches?

    16

  • What is Computer Architecture?

    n  The science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost, and other specific goals.

    n  We will soon distinguish between the terms architecture, microarchitecture, and implementation.

    17

  • Why Study Computer Architecture?

    18

  • Moore’s Law

    19

    Moore, “Cramming more components onto integrated circuits,” Electronics Magazine, 1965.

  • Why Study Computer Architecture? n  Make computers faster, cheaper, smaller, more reliable

    q  By exploiting advances and changes in underlying technology/circuits

    n  Enable new applications q  Life-like 3D visualization 20 years ago? q  Virtual reality? q  Personal genomics?

    n  Adapt the computing stack to technology trends q  Innovation in software is built into trends and changes in computer

    architecture n  > 50% performance improvement per year

    n  Understand why computers work the way they do

    20

  • An Example: Multi-Core Systems

    21

    CORE 1

    L2 C A

    C H

    E 0

    SH A

    R ED

    L3 C A

    C H

    E

    D R

    A M

    IN TER

    FA C

    E

    CORE 0

    CORE 2 CORE 3 L2 C

    A C

    H E 1

    L2 C A

    C H

    E 2

    L2 C A

    C H

    E 3

    D R

    A M

    B A

    N K

    S

    Multi-Core Chip

    *Die photo credit: AMD Barcelona

    DRAM MEMORY CONTROLLER

  • Unexpected Slowdowns in Multi-Core

    22

    Memory Performance Hog Low priority

    High priority

    (Core 0) (Core 1)

    Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

  • 23

    Why the Disparity in Slowdowns?

    CORE 1 CORE 2

    L2 CACHE

    L2 CACHE

    DRAM MEMORY CONTROLLER

    DRAM Bank 0

    DRAM Bank 1

    DRAM Bank 2

    Shared DRAM Memory System

    Multi-Core Chip

    unfairness INTERCONNECT

    matlab gcc

    DRAM Bank 3

  • DRAM Bank Operation

    24

    Row Buffer

    (Row 0, Column 0)

    R ow

    d ec

    od er

    Column mux

    Row address 0

    Column address 0

    Data

    Row 0 Empty

    (Row 0, Column 1)

    Column address 1

    (Row 0, Column 85)

    Column address 85

    (Row 1, Column 0)

    HIT HIT

    Row address 1

    Row 1

    Column address 0

    CONFLICT !

    Columns

    R ow

    s

    Access Address:

  • 25

    DRAM Controllers

    n  A row-conflict memory ac