20
1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

Embed Size (px)

Citation preview

Page 1: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

1

Tandem Daytona

TeraByte Sort: Tsort

1 TB in 47.5 MinutesDaivd Cossock,

Sam Fineberg,Pankaj Mehra,

John Peck

Trophy presentation by Jim Gray

Page 2: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

2

Benchmark History

WisconsinBitton Boral DeWitt Turbyfill

IBM TP 1-7CA and Tony Lukes

Debit CreditGray

DatamationAnon et al

TPC-A

MCCBoral &...

TPC-B

TPC-C

1970

1980

1990

2000TPC-W ?

TeradataBollinger &...

TPC-D

Sort

PennySortMinuteSort

Page 3: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

3

A Short History of Sort• April Fools 1995: Datamation Sort

– Sort 1M 100 B records– An IO benchmark: 15-min to 1 hr!

• 1993: {Minute | Penny}x{Daytona | Indy}

• 1998: TeraByte Sort• Web site:

http://research.Microsoft.com/barc/SortBenchmark/

Page 4: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

4

Ground Rules

• How much can you sort for a penny (in a minute).– Hardware and Software cost– Depreciated over 3 years– 1M$ system gets about 1 second,– 1K$ system gets about 1,000 seconds.– Time (seconds) = SystemPrice ($) / 946,080

• Input and output are disk resident• Input is

– 100-byte records (random data)– key is first 10 bytes.

• Must create output file and fill with sorted version of input file.

• Daytona (product) and Indy (special) categories

Page 5: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

5

Bottleneck Analysis

• Drawn to linear scale

TheoreticalBus Bandwidth

422MBps = 66 Mhz x 64 bits

MemoryRead/Write

~150 MBps

MemCopy~50 MBps

Disk R/W~15MBps

Page 6: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

6

Bottleneck Analysis• NTFS Read/Write • 18 Ultra 3 SCSI on 4 strings (2x4 and 2x5)

3 PCI 64 ~ 155 MBps Unbuffered read (175 raw)~ 95 MBps Unbuffered write

• Recently: SQL Server on Xeon: 190MBps scan.Good, but 10x down from S390/SGI/UE10k

Memory Read/Write ~250 MBps

PCI~110 MBps

Adapter~70 MBps

PCI

Adapter

Adapter

Adapter

155

MB

ps

Page 7: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

7

PennySort• Hardware

– 266 Mhz Intel PPro– 64 MB SDRAM (10ns)– Dual Fujitsu DMA 3.2GB EIDE disks

• Software– NT workstation 4.3– NT 5 sort

• Performance– sort 15 M 100-byte records (~1.5 GB)

– Disk to disk– elapsed time 820 sec

• cpu time = 404 sec

PennySort Machine (1107$ )

board13%

Memory8%

Cabinet + Assembly

7%

Network, Video, floppy

9%

Software6%

Other22%

cpu 32%

Disk25%

Page 8: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

8

Recent Results• NCSAsort: 10.3 GB in .9 minute 60 Intel/NT/Myranet

nodes

• MilleniumSort: 16x Dell NT cluster: 100 MB in 1.08 Sec (Datamation)

Page 9: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

9

1999 PennySort

• Daytona & Indy: 2.58 GB in 917 sec

• HMsort: Brad Helmkamp, Keith McCready, Stenograph LLC

• Intel 400Mhz2 IDE disks

Page 10: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

10

1998 TB Sort

• Chris NybergNsortSGI 32x Origin2000151 Minutes

Page 11: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

11

1999 Terabyte Sort• Daytona:

Daivd Cossock, Sam Fineberg,Pankaj Mehra, John PeckTandem/Sandia TSort: 68 CPU ServerNet47 minutes

• Indy: IBM SPsort

408 nodes, 1952 cpu 2168 disks

17.6 minutes = 1057sec(all for 1/3 of 94M$, slice price is 64k$ for 4cpu, 2GB ram, 6 9GB disks + interconnect

Page 12: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

12

Sandia/Compaq/ServerNet/NT Sort• Sort 1.1 Terabyte

(13 Billion records) in 47 minutes

• 68 nodes (dual 450 Mhz processors)543 disks, 1.5 M$

• 1.2 GBps network rap (2.8 GBps pap)

• 5.2 GBps of disk rap (same as pap)

• (rap=real application performance,pap= peak advertised performance)

Bisection Line (Each switch on this line adds 3 links to bisection width)

Y Fabric (14 bidirectional bisection links)

X Fabric (10 bidirectional bisection links)

To Y fabric

To X Fabric 512 MB

SDRAM

2 400 MHz CPUs

6-port ServerNet I crossbar switch

6-port ServerNet I crossbar switch

Compaq Proliant 1850R Server

4 SCSI busses, each with 2 data disks

The 72-Node 48-Switch ServerNet-I Topology Deployed at Sandia National Labs

PCI Bus

ServerNet I dual-ported PCI NIC

Page 13: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

13

SP sort• 2 – 4 GBps!

432

node

s37

rac

ksco

mpu

te

488 nodes 55 racks1952 processors, 732 GB RAM, 2168 disks

56 n

odes

18 r

acks

Stor

age

Compute rack:16 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM1 32x33 PCI bus9 GB scsi disk150MBps full duplex SP switch

Storage rack:8 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM3 32x33 PCI bus30x4 GB scsi disk (4+1 RAID5)150MBps full duplex SP switch

56 storage nodes manage 1680 4GB disks 336 4+P twin tail RAID5 arrays (30/node)

432

node

s37

rac

ksco

mpu

te

488 nodes 55 racks1952 processors, 732 GB RAM, 2168 disks

56 n

odes

18 r

acks

Stor

age

Compute rack:16 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM1 32x33 PCI bus9 GB scsi disk150MBps full duplex SP switch

Storage rack:8 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM3 32x33 PCI bus30x4 GB scsi disk (4+1 RAID5)150MBps full duplex SP switch

56 storage nodes manage 1680 4GB disks 336 4+P twin tail RAID5 arrays (30/node)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0 100 200 300 400 500 600 700 800 900

Elapsed time (seconds)G

B/s

GPFS read

GPFS write

Local read

Local write

Page 14: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

14

1999 Sort Records

1999 Sort Records

  Daytona Indy

Penny 2.58 GB in 917 sec HMsort: Brad Helmkamp, Keith McCready,

Stenograph LLC

2.58 GB in 917 secHMsort: Brad Helmkamp, Keith McCready,

Stenograph LLC

Minute 7.6 GB in 60 secondsOrdinal Nsort

SGI 32 cpu Origin  IRIX 

10.3 GB in 56.51 secNOW+MPI HPVMsort

Luis Rivera UIUC & Andrew Chien UCSD

TeraByte 49 minutes Daivd Cossock, Sam Fineberg,

Pankaj Mehra, John Peck68x2 Compaq &Sandia Labs

1057 secondsSPsort 1952 SP cluster 2168 disks

Jm Wyllie PDF SPsort.pdf (80KB)

Datamation 1.18 Seconds Phillip Buonadonna, Spencer Low, Josh Coates,

UC Berkeley Millennium Sort 16x2 Dell NT Myrinet

            

Page 15: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

15

• Partly hardware

• Partly software

• Partly economics

1.E-03

1.E+00

1.E+03

1.E+06

1985 1990 1995 2000

Records Sorted per SecondDoubles Every Year

GB Sorted per DollarDoubles Every Year

2x/year!

Page 16: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

16

Progress on Sorting

• Speedup comes from Moore’s law 40%/year• Processor/Disk/Network arrays: 60%/year

(this is a software speedup).

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1985 1990 1995 2000

Ordinal+SGI

Sort Records/second vs Time

Bitton M68000

Cray YMP

IBM 3090

Tandem

Kitsuregawa Hardware Sorter

Sequent Intel HyperCube

IBM RS6000NOW

Alpha

PennyNTsort

Sandia/Compaq/NT

SPsort/IB

1.E-03

1.E+00

1.E+03

1.E+06

1985 1990 1995 2000

Records Sorted per SecondDoubles Every Year

GB Sorted per DollarDoubles Every Year

Compaq/NT NT/PennySort

SPsort

Page 17: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

17

Musings: PennySort=TBsort

• Sorts 1TB in 1Minute

• 2 pass so 3TB of disk

• = 10 disks if 330GB/disk

• = 5Gps (if each disk is 50Mbps)

• So, 600 seconds (3TB/5GBps)

• So, node costs 1.5k$

• Costs 100x that today

• maybe in 10 years?

Page 18: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

18

Data Gravity Processing Moves to Transducers

• Move Processing to data sources

• Move to where the power (and sheet metal) is

• Processor in– Modem– Display– Microphones (speech recognition)

& cameras (vision)– Storage: Data storage and analysis

• System is “distributed” (a cluster/mob)

Page 19: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

19

Disk = Node• has magnetic storage (100 GB?)

• has processor & DRAM

• has SAN attachment

• has execution environment

OS KernelSAN driver Disk driver

File System RPC, ...Services DBMS

Applications

Page 20: 1 Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes Daivd Cossock, Sam Fineberg, Pankaj Mehra, John Peck Trophy presentation by Jim Gray

20

Gbps SAN: 110 MBps

SAN: Standard Interconnect

PCI: 70 MBps

UW Scsi: 40 MBps

FW scsi: 20 MBps

scsi: 5 MBps

• LAN faster than memory bus?

• 1 GBps links in lab.

• 100$ port cost soon

• Port is computer

• Winsock: 110 MBps(10% cpu utilization at each end)

RIPFDDI

RIPATM

RIPSCI

RIPSCSI

RIPFC

RIP?