Upload
nathaniel-francis
View
217
Download
1
Embed Size (px)
Citation preview
1
Tandem Daytona
TeraByte Sort: Tsort
1 TB in 47.5 MinutesDaivd Cossock,
Sam Fineberg,Pankaj Mehra,
John Peck
Trophy presentation by Jim Gray
2
Benchmark History
WisconsinBitton Boral DeWitt Turbyfill
IBM TP 1-7CA and Tony Lukes
Debit CreditGray
DatamationAnon et al
TPC-A
MCCBoral &...
TPC-B
TPC-C
1970
1980
1990
2000TPC-W ?
TeradataBollinger &...
TPC-D
Sort
PennySortMinuteSort
3
A Short History of Sort• April Fools 1995: Datamation Sort
– Sort 1M 100 B records– An IO benchmark: 15-min to 1 hr!
• 1993: {Minute | Penny}x{Daytona | Indy}
• 1998: TeraByte Sort• Web site:
http://research.Microsoft.com/barc/SortBenchmark/
4
Ground Rules
• How much can you sort for a penny (in a minute).– Hardware and Software cost– Depreciated over 3 years– 1M$ system gets about 1 second,– 1K$ system gets about 1,000 seconds.– Time (seconds) = SystemPrice ($) / 946,080
• Input and output are disk resident• Input is
– 100-byte records (random data)– key is first 10 bytes.
• Must create output file and fill with sorted version of input file.
• Daytona (product) and Indy (special) categories
5
Bottleneck Analysis
• Drawn to linear scale
TheoreticalBus Bandwidth
422MBps = 66 Mhz x 64 bits
MemoryRead/Write
~150 MBps
MemCopy~50 MBps
Disk R/W~15MBps
6
Bottleneck Analysis• NTFS Read/Write • 18 Ultra 3 SCSI on 4 strings (2x4 and 2x5)
3 PCI 64 ~ 155 MBps Unbuffered read (175 raw)~ 95 MBps Unbuffered write
• Recently: SQL Server on Xeon: 190MBps scan.Good, but 10x down from S390/SGI/UE10k
Memory Read/Write ~250 MBps
PCI~110 MBps
Adapter~70 MBps
PCI
Adapter
Adapter
Adapter
155
MB
ps
7
PennySort• Hardware
– 266 Mhz Intel PPro– 64 MB SDRAM (10ns)– Dual Fujitsu DMA 3.2GB EIDE disks
• Software– NT workstation 4.3– NT 5 sort
• Performance– sort 15 M 100-byte records (~1.5 GB)
– Disk to disk– elapsed time 820 sec
• cpu time = 404 sec
PennySort Machine (1107$ )
board13%
Memory8%
Cabinet + Assembly
7%
Network, Video, floppy
9%
Software6%
Other22%
cpu 32%
Disk25%
8
Recent Results• NCSAsort: 10.3 GB in .9 minute 60 Intel/NT/Myranet
nodes
• MilleniumSort: 16x Dell NT cluster: 100 MB in 1.08 Sec (Datamation)
9
1999 PennySort
• Daytona & Indy: 2.58 GB in 917 sec
• HMsort: Brad Helmkamp, Keith McCready, Stenograph LLC
• Intel 400Mhz2 IDE disks
10
1998 TB Sort
• Chris NybergNsortSGI 32x Origin2000151 Minutes
11
1999 Terabyte Sort• Daytona:
Daivd Cossock, Sam Fineberg,Pankaj Mehra, John PeckTandem/Sandia TSort: 68 CPU ServerNet47 minutes
• Indy: IBM SPsort
408 nodes, 1952 cpu 2168 disks
17.6 minutes = 1057sec(all for 1/3 of 94M$, slice price is 64k$ for 4cpu, 2GB ram, 6 9GB disks + interconnect
12
Sandia/Compaq/ServerNet/NT Sort• Sort 1.1 Terabyte
(13 Billion records) in 47 minutes
• 68 nodes (dual 450 Mhz processors)543 disks, 1.5 M$
• 1.2 GBps network rap (2.8 GBps pap)
• 5.2 GBps of disk rap (same as pap)
• (rap=real application performance,pap= peak advertised performance)
Bisection Line (Each switch on this line adds 3 links to bisection width)
Y Fabric (14 bidirectional bisection links)
X Fabric (10 bidirectional bisection links)
To Y fabric
To X Fabric 512 MB
SDRAM
2 400 MHz CPUs
6-port ServerNet I crossbar switch
6-port ServerNet I crossbar switch
Compaq Proliant 1850R Server
4 SCSI busses, each with 2 data disks
The 72-Node 48-Switch ServerNet-I Topology Deployed at Sandia National Labs
PCI Bus
ServerNet I dual-ported PCI NIC
13
SP sort• 2 – 4 GBps!
432
node
s37
rac
ksco
mpu
te
488 nodes 55 racks1952 processors, 732 GB RAM, 2168 disks
56 n
odes
18 r
acks
Stor
age
Compute rack:16 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM1 32x33 PCI bus9 GB scsi disk150MBps full duplex SP switch
Storage rack:8 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM3 32x33 PCI bus30x4 GB scsi disk (4+1 RAID5)150MBps full duplex SP switch
56 storage nodes manage 1680 4GB disks 336 4+P twin tail RAID5 arrays (30/node)
432
node
s37
rac
ksco
mpu
te
488 nodes 55 racks1952 processors, 732 GB RAM, 2168 disks
56 n
odes
18 r
acks
Stor
age
Compute rack:16 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM1 32x33 PCI bus9 GB scsi disk150MBps full duplex SP switch
Storage rack:8 nodes, each has4x332Mhz PowerPC604e1.5 GB RAM3 32x33 PCI bus30x4 GB scsi disk (4+1 RAID5)150MBps full duplex SP switch
56 storage nodes manage 1680 4GB disks 336 4+P twin tail RAID5 arrays (30/node)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0 100 200 300 400 500 600 700 800 900
Elapsed time (seconds)G
B/s
GPFS read
GPFS write
Local read
Local write
14
1999 Sort Records
1999 Sort Records
Daytona Indy
Penny 2.58 GB in 917 sec HMsort: Brad Helmkamp, Keith McCready,
Stenograph LLC
2.58 GB in 917 secHMsort: Brad Helmkamp, Keith McCready,
Stenograph LLC
Minute 7.6 GB in 60 secondsOrdinal Nsort
SGI 32 cpu Origin IRIX
10.3 GB in 56.51 secNOW+MPI HPVMsort
Luis Rivera UIUC & Andrew Chien UCSD
TeraByte 49 minutes Daivd Cossock, Sam Fineberg,
Pankaj Mehra, John Peck68x2 Compaq &Sandia Labs
1057 secondsSPsort 1952 SP cluster 2168 disks
Jm Wyllie PDF SPsort.pdf (80KB)
Datamation 1.18 Seconds Phillip Buonadonna, Spencer Low, Josh Coates,
UC Berkeley Millennium Sort 16x2 Dell NT Myrinet
15
• Partly hardware
• Partly software
• Partly economics
1.E-03
1.E+00
1.E+03
1.E+06
1985 1990 1995 2000
Records Sorted per SecondDoubles Every Year
GB Sorted per DollarDoubles Every Year
2x/year!
16
Progress on Sorting
• Speedup comes from Moore’s law 40%/year• Processor/Disk/Network arrays: 60%/year
(this is a software speedup).
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1985 1990 1995 2000
Ordinal+SGI
Sort Records/second vs Time
Bitton M68000
Cray YMP
IBM 3090
Tandem
Kitsuregawa Hardware Sorter
Sequent Intel HyperCube
IBM RS6000NOW
Alpha
PennyNTsort
Sandia/Compaq/NT
SPsort/IB
1.E-03
1.E+00
1.E+03
1.E+06
1985 1990 1995 2000
Records Sorted per SecondDoubles Every Year
GB Sorted per DollarDoubles Every Year
Compaq/NT NT/PennySort
SPsort
17
Musings: PennySort=TBsort
• Sorts 1TB in 1Minute
• 2 pass so 3TB of disk
• = 10 disks if 330GB/disk
• = 5Gps (if each disk is 50Mbps)
• So, 600 seconds (3TB/5GBps)
• So, node costs 1.5k$
• Costs 100x that today
• maybe in 10 years?
18
Data Gravity Processing Moves to Transducers
• Move Processing to data sources
• Move to where the power (and sheet metal) is
• Processor in– Modem– Display– Microphones (speech recognition)
& cameras (vision)– Storage: Data storage and analysis
• System is “distributed” (a cluster/mob)
19
Disk = Node• has magnetic storage (100 GB?)
• has processor & DRAM
• has SAN attachment
• has execution environment
OS KernelSAN driver Disk driver
File System RPC, ...Services DBMS
Applications
20
Gbps SAN: 110 MBps
SAN: Standard Interconnect
PCI: 70 MBps
UW Scsi: 40 MBps
FW scsi: 20 MBps
scsi: 5 MBps
• LAN faster than memory bus?
• 1 GBps links in lab.
• 100$ port cost soon
• Port is computer
• Winsock: 110 MBps(10% cpu utilization at each end)
RIPFDDI
RIPATM
RIPSCI
RIPSCSI
RIPFC
RIP?