30
Big Data - Ops vs Flops Steve Wallach swallachatconveycomputer dotcom Convey <= Convex++

Big Data Ops vs Flops.pdf

Embed Size (px)

Citation preview

Page 1: Big Data Ops vs Flops.pdf

Big Data - Ops vs Flops

Steve Wallach

swallach”at”conveycomputer “dot”com

Convey <= Convex++

Page 2: Big Data Ops vs Flops.pdf

Why is this interesting?

• Big Data is on everyone’s mind. It is in the

NEWS.

– HPC Classic (Flops) & Big Data (Ops)

• Power efficiency is on everyone’s mind. It is in

the NEWS.

• Are today’s processor architectures a match for

Exascale applications that are in the NEWS?

– Ops vs Flops

• Deep dive into some micro-architecture discussions

• A smarter computer for Exascale. It is in the

NEWS

swallach - 2012 December - CHPC 2

Page 3: Big Data Ops vs Flops.pdf

From a NEWS Perspective

• World Economic Forum

– declared data a new class of

economic asset, like currency or

gold.

• growing at 50 percent a year, or

more than doubling every two

years, estimates IDC

• “data-driven decision making” achieved

productivity gains that were 5 percent

to 6 percent higher than other factors

could explain.

• “false discoveries.” “many bits

of straw look like needles.”

swallach - 2012 December - CHPC 3

February, 11, 2012, NY Times, “The Age of Big Data”,

STEVE LOHR

Page 4: Big Data Ops vs Flops.pdf

Support from Washington DC

OBAMA ADMINISTRATION UNVEILS “BIG DATA”

INITIATIVE: ANNOUNCES $200 MILLION IN NEW

R&D INVESTMENTS

Office of Science and Technology Policy Executive Office of the President

New Executive Office Building

Washington, DC 20502

March 29, 2012

swallach - 2012 December - CHPC 4

Page 5: Big Data Ops vs Flops.pdf

Perhaps another reason

swallach - 2012 December - CHPC 5

Venture Capitalist

Page 6: Big Data Ops vs Flops.pdf

Seriously

• “Computing may be on the cusp

of another such wave.”

• “The wave will be based on

smarter machines and software

that will automate more tasks

and help people make better

decisions.”

• “The technological building

blocks, both hardware and

software, are falling into place,

stirring optimism.”

swallach - 2012 December - CHPC 6

September 8, 2012, NY Times, “Tech’s New Wave,

Driven by Data”

STEVE LOHR

Page 7: Big Data Ops vs Flops.pdf

Thinking on Technology

• Flops and Ops are

applicable to Exascale

• More items in common

then initially obvious

• Examine in more detail

• The role of Time

Sharing

swallach - 2012 December - CHPC 7

. The role of cloud

computing

Page 8: Big Data Ops vs Flops.pdf

Compare Flops with Ops

• ExaFlop (FP Intensive) – Historical Metrics/Features

• Memory Capacity – Byte/Flop

• Interconnect Bandwidth – .1 Bytes/Flop

• Memory Bandwidth – 4 Bytes/sec/Flop

• Linpack

– Relatively static data set size • 3D … X,Y,Z - multiple points per cell

• WaveFront Algorithms

– Node level synchronization • Domain Decomposition

– 64 bit memory references • 32 and 64 bit float

– Cache Friendly (In general?)

• ExaOp (Data Intensive) • Transactions per sec

• Transaction Latency

• Graph edges per sec

• YES/NO

– Data Sets are dynamic

• New data always being added

– Never enough physical memory

Memory/compute algorithms

• Limited disk access

– Flash memory/Memristor/Phase Change?

– Memcached

– Graphs, Trees, Bit and Byte strings

• Fine grain synchronization

– Threads/Transactions

• Granularity of memory references

• User defined strings

• NORA (Non-Obvious Relationship

Analysis)

– Cache Unfriendly

swallach - 2012 December - CHPC 8

Page 9: Big Data Ops vs Flops.pdf

Solving Flops – HPC Classic

• Multi (Many) Core

• Multiple FMAC’s – Floating Point

• Vector Accumulators

• Structured Data

• Attached accelerators – GPU, FPGA, …

• High Speed Interconnects – MPI Pings

• Open source numerical libraries – (e.g., LAPACK)

• Memory system – Classic – highly interleaved

– Micro - multiple cache levels

swallach - 2012 December - CHPC 9

Page 10: Big Data Ops vs Flops.pdf

Solving Ops

-An Architectural Perspective- • Very Large Physical Memory

– ExaOps ExaBytes

– 64 bits of address is not enough • Run out by 2020

• 1 to 1.5 bits increase per 1 to 1.5 years

• Single Level, Global Physical Memory – Simplifies Programming Model

– Extensible over time

– Updates in Real Time

– Runtime binding

– PGAS Languages

– MAP REDUCE / HADOOP

• Multiple machine state models can exist

• Heterogeneous Processing

• Fine grain synchronization – Threads

– Bit and Bytes

– Adjacency Matrices

swallach - 2012 December -

CHPC

10

Page 11: Big Data Ops vs Flops.pdf

swallach - 2012 December - CHPC

Now What

• Uniprocessor Performance has to be increased

– Heterogeneous here to stay

– The easiest to program will be the correct technology

• Smarter Memory Systems (PIM) – Synchronization

– Complex Data Structures

– Dram, NVRAM, Disk

• New HPC/Big Data Software must be developed.

– SMARTER COMPILERS

– ARCHITECTURALLY TRANSPARENT

– Domain Specific Languages

• New algorithms

• New ways of referencing DATA at the processor level

11

Page 12: Big Data Ops vs Flops.pdf

Ops and Flops – Processor

Memory System • Caches create in order memory

references for ALU’s from out of

order main memory references.

– Cache block is in order

• Direct memory references tend to

be out of order

– Better for random references

– Hundreds (thousands) of

outstanding loads and stores

• Which memory reference approach

is preferred?

• How does HMC (hybrid memory

cube) change the approach?

swallach - 2012 December - CHPC 12

Page 13: Big Data Ops vs Flops.pdf

Ops and Flops in one system?

• For in order memory, a vector

ISA is very efficient. For out of

order, a thread model is more

efficient (also hides latency

better)

– Threads execute independently

but synchronized.

– Supported by OpenMP

swallach - 2012 December - CHPC 13

Do I = 1,n

A(i) = B(i) +C(i)

ENDDO

Page 14: Big Data Ops vs Flops.pdf

swallach - 2012 December - CHPC 14 -

NVIDIA

Page 15: Big Data Ops vs Flops.pdf

Exascale Workshop Dec 2009, San Diego

swallach - 2012 December - CHPC 15

Page 16: Big Data Ops vs Flops.pdf

In Beta Test… Fox Coprocessor

Four Huge Virtex6

Application Engines 32 High Density

SG-DIMMs DDR3

Memory

Controllers

• Shared, Virtual Memory System

• Scatter/Gather

– Up to 4 TB (1 TB with current drams)

– 128 GB/s bandwidth

– TB/s of internal register bandwidth

• Atomic Memory Operators

– Native to memory controllers

– 8-Byte Accessible

• Tag Bit Semantics

– lock or “tag” bit for each 8-byte word

• Thousands of simultaneous threads

• Scalable coprocessor memory

interconnect

swallach - 2012 December - CHPC 15

Page 17: Big Data Ops vs Flops.pdf

Atomic Operations & FE Memory

Bits • Atomic operations avoid round

trips to memory to acquire lock,

update data, release lock

− Accessible from coprocessor

instruction space

− Accessible from host via lock engine

and compiler intrinsics

• Full/Empty Bits

− Extra bit stored with each word

− Can be used to signify when data is

valid/ready

− Accessible from host via lock

engine & compiler intrinsics

Atomic Operations

Add, Sub

Min, Max

Exch

Inc, Dec

CAS

And, Or, Xor

Tag Bit Semantic Operations

WriteEF, WriteFF

WriteXF, WriteXE

ReadFE, ReadEF, ReadFF

ReadXX

IncFF

swallach - 2012 December - CHPC 18

Page 18: Big Data Ops vs Flops.pdf

• Convey Coprocessor

Interconnect (CCI) enables

shared memory between

coprocessors – Direct connection of up to 8

coprocessors

– Architected for 32 coprocessors

in a 32TB address space

– Load/store and atomic memory

operations between

coprocessors

• Extended Memory

Allocators – Mimic the memory allocation

and layout of SHMEM or UPC

Wolfpack

Convey

Copro

cessor In

terconnect

AEH AE0 AE1 AE2 AE3

AEH AE0 AE1 AE2 AE3

AEH AE0 AE1 AE2 AE3

AEH AE0 AE1 AE2 AE3

HOST

swallach - 2012 December - CHPC 19

Page 19: Big Data Ops vs Flops.pdf

• Scalable, MIMD Personality Framework

– Simple, RISC-like instruction set

– Atomic memory operations

– Fine-grained synchronization using tag-bit semantics

– Software driven, hardware-based low-latency thread/task

scheduler

– Every instruction explicitly determines to continue or pend

• Programming model is OpenMP 3.1

– Support for OpenMP code directives

– Simple, portable parallelism

CHOMP: Convey Hybrid

swallach - 2012 December - CHPC 20

Page 20: Big Data Ops vs Flops.pdf

CHOMP Software Stack

swallach - 2012 December - CHPC 21

Page 21: Big Data Ops vs Flops.pdf

Development Frameworks for Multiple

Architectural Models

• CHOMP for Arrays of Customized Cores

− MIMD parallelism with many simple cores

− OMP based shared memory programming model

− Compile and run user model

• Convey Vectorizer for Custom SIMD

− vector personalities optimized for specific data types

− automatic vectorization of C/C++ and Fortran

− support for customized instructions via intrinsics

• PDK for Algorithmic Personalities

− specific routines in hardware

− pipelining and replication for very high performance

− PDK supports implementations in Verilog or with

high level design tools

X(I,J,K) = S0*Y(I ,J ,K ) + S1*Y(I-1,J ,K ) + S2*Y(I+1,J ,K ) + S3*Y(I ,J-1,K ) + S4*Y(I ,J+1,K ) + S5*Y(I ,J ,K-1) + S6*Y(I ,J ,K+1)

ACTGTGACATGCTGACAT

GCTAG Stencils [FD/FV/FE]

Genomics/Proteomics

Graph Theory

Edge Vector

Vertex Vector

, etc,

etc

swallach - 2012 December - CHPC 22

Page 22: Big Data Ops vs Flops.pdf

swallach - 2012 December - CHPC 23

Page 23: Big Data Ops vs Flops.pdf

swallach - 2012 December - CHPC 24

Page 24: Big Data Ops vs Flops.pdf

swallach - 2012 December - CHPC 25

Page 25: Big Data Ops vs Flops.pdf

System Level - Trends

• Global Flat, Virtual Address Space

– Common to Flops and Ops

– Program using a PGAS Language

• Single Program Multiple Data (SPMD)

– 64 bits growing to 128 bits in 2020 and beyond

swallach - 2012 December - CHPC 26

Physical Page Address- 40

Node

Field Control

Control – Physical Memory Level

Dram – Flash – Disk

Node Field – PGAS model

Data Semantics – PIM Control

private vs. shared

4 KB – 1 MB Page Size

4 PetaBytes To 1 ExaBytes

of Physical Address

PTE Data

Semantic

Page 26: Big Data Ops vs Flops.pdf

swallach - 2012 December - CHPC

Exascale Programming Model

1: #include<upc_relaxed.h>

2: #define N 200*THREADS

3: shared [N] double A[N][N]; NOTE: Thread is 16000

4: shared double b[N], x[N];

5: void main()

6: {

7: int i,j;

8: /* reading the elements of matrix A and the

9: vector x and initializing the vector b to zeros

10: */

11: upc_forall(i=0;i<N;i++;i)

12: for(j=0;j<N;j++)

13: b[i]+=A[i][j]*x[j] ;

14: }

Example 4.1-1: Matrix by Vector Multiply

27

Page 27: Big Data Ops vs Flops.pdf

Optical Technology

IBM HP

swallach - 2012 December - CHPC 28

March 15, 2012, HP Scientists Envision 10-Teraflop

Manycore Chip, HPC Wire

http://domino.research.ibm.com

Page 28: Big Data Ops vs Flops.pdf

Why OPTICS? • Bandwidth per channel exceeds electronics

• One connector per node

– NIC does parallel to serial conversion(s) – outgoing

– NIC does serial to parallel - incoming

• Scalable bandwidth per node, using WDM

– Constant mechanics

– No distance considerations within a computer room

– No noise/cross talk

– More colors more bandwidth

• Same network/switch for Flops and Ops – Flops has more channels then ops (perhaps)

– Facilitates a REAL Cloud Computer

swallach - 2012 December - CHPC 29

Page 29: Big Data Ops vs Flops.pdf

swallach - 2012 December - CHPC

Cloud/Big Data System

I/O

ALL-OPTICAL

SWITCH

1

2 3

16000

14000

12000

8000

4 5

1000

2000

3000

4000

5000 7000

6000

Single Node

Ops/Flops

10 meters= 50 NS Delay

...

...

...

...

LAN/WAN 30

One color one channel

On demand:

-compute

-storage

-link bandwidth

Page 30: Big Data Ops vs Flops.pdf

swallach - 2012 December - CHPC

Finally

The system which is the simplest to

program will win. USER cycles are more

important that CPU cycles

31

FLOPs OPs Flops/Ops

/Watt