Workload Optimized Systems: The Wheel of … Optimized Systems: The Wheel of Reincarnation ... Hardware/Software co-design optimize workload ... 2D graphics SW system for cut and paste

Workload Optimized Systems:The Wheel of Reincarnation

Michael Sporer, Netezza Appliance Hardware Architect21 April 2013

Outline2

� Definition

� Technology

� Minicomputers – Prime

� Workstations – Apollo

� Graphics – Stellar/Stardent

� SMP Database Machines – Data General

� MPP Database Machines – Netezza/IBM

� What’s Next?

� Conclusion

What Is Workload Optimized?3

� Technology is advancing

� Groups of technologies play together

� Step functions lead to qualitative opportunities

� Opportunity is defined by a specific market

� Market is identified by workload

� Hardware/Software co-design optimize workload

� Often startups are leading indicator

� Established vendors follow or perish

Technology - Hardware4

� Transistor Technology – Moore’s Law� ICs -> CPUs, ASICs, FPGAs

� DRAM

� Busses and their protocols

� Magnetic Recording Technology - bpi� Disks

� Networks – Ethernet, primarily

� Graphic displays -> GPUs

� Cache and chip interconnect

� SMP and NUMA

Technology - Software5

� Time-share OS

� Virtual Memory

� Languages: ASM, Fortran, C, C++, Java

� Object-Oriented Methods

� Open Source

Minicomputer – Prime (1)6

� Early 1970’s

� Engineers need computer access

� Mainframes too expensive

� New IC technology, DRAM, PROM and cost-effective, removable 14” Winchester disk drives

� Firmware technology for flexibility and ISA range

� Virtual memory and large address spaces

� OS technology pioneered by MULTICS

Minicomputer – Prime (2)7

� Designed a multi-user time-shared minicomputer for CAD and SW development

� Architecture defined by software engineers

� Wrote our own optimizing compilers

� Process switch and VM table walk in firmware

� 32-bit ISA as extension to prior 16-bit version

� All 16-bit applications ran at GA

� New and recompiled apps ran in 32-bit mode

� Floating Point in firmware in PROM

Workstation – Apollo (1)8

� Early 1980’s

� Engineers needed dedicated performance

� Workgroups cooperating

� Early microprocessors (68010)

� Ethernet proves networks can be built

� High resolution graphic displays available

� Bit-map technology proven by PARC and MIT

Workstation – Apollo (2)9

� Design a personal engineering workstation

� High resolution bit-map display

� Network OS with seamless file sharing

� 2D graphics SW system for cut and paste

� VM implemented by using 2x68010

� Application development environment

Graphics Supercomputer – Stellar/Stardent (1)

10

� Engineers need 3D displays for CAD, Biotech

� Heavy compute load using multiple threads

� Sea-of-gates IC technology

� YACC

� Single-clock scan path design proven (by IBM)

� Parallelizing compilers beginning to be understood

Graphics Supercomputer – Stellar/Stardent (2)

11

� Design a Graphics Supercomputer� 20 MIPs, 80 MFLOPs, 120K Gouraud-shaded triangles

� Home-brew C-like simulation language using YACC� 11 designs, 49 SEA-of-Gates chips per system

� First-pass operational

� Scan path SW

� Unix OS

� Home-brew parallelizing Fortran compiler

SMP Database Machine – Data General (1)12

� Early 90’s

� Database usage surging – mostly OLTP

� SMP reaching limits

� Dense AISCs available

� Intel CPUs becoming good

SMP Database Machine – Data General (2)13

� Need lots of CPUs and lots of I/O to process the high OLTP tps demands and database size

� Design a 32-way system with distributed I/O� SMP: 8 groups of 4-way M88K CPUs on a packet bus

� NUMA: 8 groups of 4-way Pentium CPUs on an SCI bus

� DG/UX to handle NUMA

� I/O subsystem aware of NUMA

MPP Database Machine – Netezza/IBM (1)14

� Data accumulating – could it be used?

� OLTP systems groaning under OLAP workloads

� Cyber Bricks at Microsoft – computation is low cost

� Active Disks at CMU – parallel algorithms near disk

� Inexpensive low-power CPU

� Inexpensive FPGA

� Inexpensive consumer disks

� Cheap 100Mb -> 1Gb Ethernet

MPP Database Machine – Netezza/IBM (2)15

� Design a simple SPU – disk + FPGA + CPU + NIC

� Put lots into a rack

� Develop MPP SW� Postgres front end

� Tuned Optimizer

� Parallelizer

� Query compiler – code for CPU and FPGA

� Distributer of Snippet of work

� FPGA – Decompress, Restrict, Project – reduce CPU load

� Appliance – easy to install, use, service, support

Netezza Snippet Processor16

� CPU

� FPGA

� Disk

� Network1M GateFPGA

1 GB RAMSocketed DIMM

440GXPower PC

GigE toeach SPU

Enterprise SATA Disk Drive

400 GB

Netezza Streaming Processing17

� Identify work not needed to done

� Do as much work as we can in parallel on SPU

� Move restricts into the FPGA

� Increase effective disk performance

� Execute operational streaming analytics

ZoneMaps

Base TableData Blocks

Col 1:Date

Col 2:Zip

Zone Maps: 18 out of 48 Extents Read

CBTMaps

CBT: 2 out of 48 Extents Read

Dim

ensi

on 1

: Dat

e

Dim 2:Zip

Analytic Functions

“On Stream”

Project Restrict

PowerPCQuery Engine

JoiningSorting

Grouping

Snippet Queue

Fault Recovery

Main Memory

StreamingRecord Processor

Transaction/LockManager

DMA

SPUSwap

Mirror

Primary

Project Restrict

Compiled Tables

Netezza Evolution18

� First product – 2003� 112 80MHz CPU

� 112 small FPGAs – 8-bit datapath

� 64MB DRAM

� 33MB/s 40GB disk

� Current product – 2013� 112 2.2GHz CPU cores

� 112 FPGA cores – 32-bit datapath

� 8GB/core DRAM + 512MB/FPGA core for disk cache

� 160MB/s 600GB disk

What Netezza Got Right19

� Streaming balanced architecture� Disk, CPU, network overlap – query time is max of all 3

� FPGA technology� Improve scan performance of HDD

� Decrease data to be processed by CPU -> cheap CPU

� MPP – move processing toward disk

� Focus on GB/sec/$

� New products when step function in technology

What’s Next From Market?20

� Competition in markets

� Urgency to get ahead or at least stay even

� Need more sophisticated use of data

� More complex analytics

� On more data

� Better ability to do what if

� More uses and users of DB systems

What’s Next From Technology?21

� At limit of number of disks per rack

� Cache often not big enough

� Flash provides very high IOPs� Crossover IOPs/$

� Lots of cores per CPU

� Lots of DRAM bandwidth

� Faster, lower overhead networks using RDMA

� GPUs – can they be harnessed?

� FPGAs – how do we get SW developers to use them?

Conclusions22

� We are ready for a big step in technology (>10x)

� Old SW stacks will no longer work� OLTP systems continue to not scale

� OLAP systems centered on HDDs

� What are SW limits?� MPP – can it scale to 1000 nodes? How about 10,000?

� How do we minimize overheads hidden by HDDs?

� It is imperative for HW/SW to be co-designed.

� How are you handling it?

Documents

Workload Optimized Systems: The Wheel of … Optimized Systems: The Wheel of Reincarnation ... Hardware/Software co-design optimize workload ... 2D graphics SW system for cut and paste