65
1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research [email protected] http://research.Microsoft.com/~Gr ay Presented to DARPA WindowsNT workshop 5 Aug 1998, Seattle WA.

1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research [email protected] Gray

Embed Size (px)

Citation preview

Page 1: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

1

Scaleable Systems Research at Microsoft

(really: what we do at BARC)• Jim Gray

Microsoft Research [email protected]://research.Microsoft.com/~Gray

Presented to DARPA WindowsNT workshop 5 Aug 1998, Seattle WA.

Page 2: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

2

Outline

• PowerCast, FileCast & Reliable Multicast

• RAGS: SQL Testing

• TerraServer (a big DB)

• Sloan Sky Survey (CyberBricks)

• Billion Transactions per day

• WolfPack Failover

• NTFS IO measurements

• NT-Cluster-Sort

• AlwaysUp

Page 3: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

3

Telepresence• The next killer app

• Space shifting:

»Reduce travel

• Time shifting:

»Retrospective

»Offer condensations

»Just in time meetings.

• Example: ACM 97

»NetShow and Web site.

»More web visitors than attendees

• People-to-People communication

Page 4: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

4

Telepresence Prototypes• PowerCast: multicast PowerPoint

» Streaming - pre-sends next anticipated slide» Send slides and voice rather than talking head and voice» Uses ECSRM for reliable multicast» 1000’s of receivers can join and leave any time.» No server needed; no pre-load of slides.» Cooperating with NetShow

• FileCast: multicast file transfer.» Erasure encodes all packets» Receivers only need to receive as many bytes

as the length of the file» Multicast IE to solve Midnight-Madness problem

• NT SRM: reliable IP multicast library for NT

• Spatialized Teleconference Station» Texture map faces onto spheres

» Space map voices

Page 5: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

5

RAGS: RAndom SQL test Generator

• Microsoft spends a LOT of money on testing. (60% of development according to one source).

• Idea: test SQL by » generating random correct queries» executing queries against database» compare results with SQL 6.5, DB2, Oracle, Sybase

• Being used in SQL 7.0 testing.» 375 unique bugs found (since 2/97)

» Very productive test tool

Page 6: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

6

Sample Rags Generated Statement

SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notesFROM titles T0, roysched T1WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY (

SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS (

SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 )

This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error:Query processor could not produce a query plan.

Page 7: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

7

Automation

• Simpler Statement with same errorSELECT roysched.royalty FROM titles, royschedWHERE EXISTS (

SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1)

• Control statement attributes»complexity, kind, depth, ...

• Multi-user stress tests»tests concurrency, allocation, recovery

Page 8: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

8

One 4-Vendor Rags Test3 of them vs Us

• 60 k Selects on MSS, DB2, Oracle, Sybase.

• 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements.

• Examine 10 suspects, filed 4 Bugs!One duplicate. Assume 3/10 are new

• Note: This is the SS Beta 2 ProductQuality rising fast (and RAGS sees that)

Page 9: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

9

Outline

• FileCast & Reliable Multicast

• RAGS: SQL Testing

• TerraServer (a big DB)

• Sloan Sky Survey (CyberBricks)

• Billion Transactions per day

• Wolfpack Failover

• NTFS IO measurements

• NT-Cluster-Sort

Page 10: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

Billions Of Clients

• Every device will be “intelligent”

• Doors, rooms, cars…

• Computing will be ubiquitous

Page 11: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

Billions Of ClientsNeed Millions Of Servers

MobileMobileclientsclients

FixedFixedclients clients

ServerServer

SuperSuperserverserver

ClientsClients

ServersServers

All clients networked All clients networked to serversto servers May be nomadicMay be nomadic

or on-demandor on-demand Fast clients wantFast clients want

fasterfaster servers servers Servers provide Servers provide

Shared DataShared Data ControlControl CoordinationCoordination CommunicationCommunication

Page 12: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

ThesisMany little beat few big

Smoking, hairy golf ballSmoking, hairy golf ball How to connect the many little parts?How to connect the many little parts? How to program the many little parts?How to program the many little parts? Fault tolerance?Fault tolerance?

$1 $1 millionmillion $100 K$100 K $10 K$10 K

MainframeMainframe MiniMiniMicroMicro NanoNano

14"14"9"9"

5.25"5.25" 3.5"3.5" 2.5"2.5" 1.8"1.8"1 M SPECmarks, 1TFLOP1 M SPECmarks, 1TFLOP

101066 clocks to bulk ram clocks to bulk ram

Event-horizon on chipEvent-horizon on chip

VM reincarnatedVM reincarnated

Multiprogram cache,Multiprogram cache,On-Chip SMPOn-Chip SMP

10 microsecond ram

10 millisecond disc

10 second tape archive

10 nano-second ram

Pico Processor

10 pico-second ram

1 MM 3

100 TB

1 TB

10 GB

1 MB

100 MB

Page 13: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

Performance = Storage Accesses

not Instructions Executed• In the “old days” we counted instructions and IO’s

• Now we count memory references

• Processors wait most of the time

Where the time goes: clock ticks for AlphaSort Components

SortDisc Wait SortDisc Wait OS

Memory Wait

D-Cache Miss

I-Cache MissB-Cache

Data Miss

70 MIPS“real” apps have worse Icache misses so run at 60 MIPSif well tuned, 20 MIPS if not

Page 14: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

Scale Up and Scale Out

SMPSMPSuper ServerSuper Server

DepartmentalDepartmentalServerServer

PersonalPersonalSystemSystem

Grow Up with SMPGrow Up with SMP4xP6 is now standard4xP6 is now standard

Grow Out with ClusterGrow Out with Cluster

Cluster has inexpensive partsCluster has inexpensive parts

Clusterof PCs

Page 15: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

15

Microsoft TerraServer: Scaleup to Big Databases

• Build a 1 TB SQL Server database• Data must be

» 1 TB» Unencumbered» Interesting to everyone everywhere» And not offensive to anyone anywhere

• Loaded » 1.5 M place names from Encarta World Atlas» 3 M Sq Km from USGS (1 meter resolution)» 1 M Sq Km from Russian Space agency (2 m)

• On the web (world’s largest atlas)• Sell images with commerce server.

Page 16: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

16

Microsoft TerraServer Background

• Earth is 500 Tera-meters square» USA is 10 tm2

• 100 TM2 land in 70ºN to 70ºS

• We have pictures of 6% of it» 3 tsm from USGS

» 2 tsm from Russian Space Agency

• Compress 5:1 (JPEG) to 1.5 TB.

• Slice into 10 KB chunks

• Store chunks in DB

• Navigate with

» Encarta™ Atlas• globe

• gazetteer

» StreetsPlus™ in the USA

40x60 km2 jump image

20x30 km2 browse image

10x15 km2 thumbnail

1.8x1.2 km2 tile

• Someday» multi-spectral image

» of everywhere

» once a day / hour

Page 17: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

17

USGS Digital Ortho Quads (DOQ) • US Geologic Survey

• 4 Tera Bytes

• Most data not yet published

• Based on a CRADA» Microsoft TerraServer makes

data available.

USGS “DOQ”

1x1 meter4 TBContinentalUSNew DataComing

Page 18: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

18

Russian Space Agency(SovInfomSputnik) SPIN-2 (Aerial Images is Worldwide Distributor)

• 1.5 Meter Geo Rectified imagery of (almost) anywhere

• Almost equal-area projection

• De-classified satellite photos (from 200 KM),

• More data coming (1 m)

• Selling imagery on Internet.

• Putting 2 tm2 onto Microsoft TerraServer.

SPIN-2

Page 19: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

19

Demo • navigate by coverage map to White House

• Download image

• buy imagery from USGS

• navigate by name to Venice

• buy SPIN2 image & Kodak photo

• Pop out to Expedia street map of Venice

• Mention that DB will double in next 18 months (2x USGS, 2X SPIN2)

Page 20: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

20

1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 )

Hardware

100 MbpsEthernet Switch

DS3

SiteServersInternet

MapServer

SPIN-2

Web Servers

STK9710DLTTapeLibrary

489 GBDrives

AlphaServer8400

Enterprise Storage Array

8 x 440MHzAlpha cpus

10 GB DRAM

489 GBDrives

489 GBDrives

489 GBDrives

489 GBDrives

489 GBDrives

489 GBDrives

Page 21: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

21

The Microsoft TerraServer Hardware

• Compaq AlphaServer 8400

• 8x400Mhz Alpha cpus

• 10 GB DRAM

• 324 9.2 GB StorageWorks Disks» 3 TB raw, 2.4 TB of RAID5

• STK 9710 tape robot (4 TB)

• WindowsNT 4 EE, SQL Server 7.0

Page 22: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

22

browser

HTMLJava

Viewer

The Internet

Web Client

Microsoft AutomapActiveX Server

Internet InfoServer 4.0

Image DeliveryApplication

SQL Server7

MicrosoftSite Server EE

Internet InformationServer 4.0

Image Provider Site(s)

TerraServer DB Automap Server

Terra-ServerStored Procedures

InternetInformationServer 4.0

ImageServer

Active Server Pages

MTS

TerraServer Web Site

Software

SQL Server 7

Page 23: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

23

• Backup and Recovery

»STK 9710 Tape robot

»Legato NetWorker™

»SQL Server 7 Backup & Restore

»Clocked at 80 MBps (peak)(~ 200 GB/hr)

• SQL Server Enterprise Mgr

»DBA Maintenance

»SQL Performance Monitor

System Management & Maintenance

Page 24: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

24

Microsoft TerraServer File Group Layout

• Convert 324 disks to 28 RAID5 setsplus 28 spare drives

• Make 4 WinNT volumes (RAID 50)

595 GB per volume

• Build 30 20GB files on each volume

• DB is File Group of 120 files

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

E: F: G: H:

HSZ70 A

HSZ70 B

Page 25: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

25

Image Delivery and LoadIncremental load of 4 more TB in next 18 months

DLTTape “tar”

\Drop’N’ DoJobWait 4Load

LoadMgrDB

100mbitEtherSwitch

108 9.1 GBDrives

Enterprise Storage Array

AlphaServer8400

108 9.1 GBDrives

108 9.1 GBDrives

STKDLTTape

Library

604.3 GBDrives

AlphaServer4100

ESAAlphaServer4100

LoadMgr

DLTTape

NTBackup

ImgCutter

\Drop’N’ \Images

10: ImgCutter20: Partition30: ThumbImg40: BrowseImg45: JumpImg50: TileImg55: Meta Data60: Tile Meta70: Img Meta80: Update Place

...LoadMgr

Page 26: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

26

Technical ChallengeKey idea

• Problem: Geo-Spatial Search without geo-spatial access methods.(just standard SQL Server)

• Solution:Geo-spatial search key:

Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y)

Z-transform X & Y into single Z value, build B-tree on Z

Adjacent images stored next to each other

Search Method:Latitude and Longitude => X, Y, then Z

Select on matching Z value

Page 27: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

27

Sloan Digital Sky Survey

• Digital Sky»30 TB raw

»3TB cooked (1 billion 3KB objects)

»Want to scan it frequently

• Using cyberbricks

• Current status: »175 MBps per node

»24 nodes => 4 GBps

»5 minutes to scan whole archive

Page 28: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

28

Some Tera-Byte DatabasesKilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

• The Web: 1 TB of HTML

• TerraServer 1 TB of images

• Several other 1 TB (file) servers

• Hotmail: 7 TB of email

• Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked

• EOS/DIS (picture of planet each week)» 15 PB by 2007

• Federal Clearing house: images of checks» 15 PB by 2006 (7 year history)

• Nuclear Stockpile Stewardship Program» 10 Exabytes (???!!)

Page 29: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

29

Library of Congress (text)

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

A novel

A letter

All Disks

All Tapes

A Movie

LoC (image)

Info Capture• You can record everything you see or hear or read.

• What would you do with it?

• How would you organize & analyze it?

Video 8 PB per lifetime (10GBph)Audio 30 TB (10KBps) Read or write: 8 GB (words)

See: http://www.lesk.com/mlesk/ksg97/ksg.html

Page 30: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

30

Michael Lesk’s Points www.lesk.com/mlesk/ksg97/ksg.html

• Soon everything can be recorded and kept

• Most data will never be seen by humans

• Precious Resource: Human attention Auto-SummarizationAuto-Search

will be a key enabling technology.

Page 31: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

31

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

A novel A letter

Library of Library of Congress Congress (text)(text)

All Disks

All Tapes

A Movie

LoC (image)

All Photos

LoC (sound + cinima)

All Information!

Page 32: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

32

Outline

• FileCast & Reliable Multicast

• RAGS: SQL Testing

• TerraServer (a big DB)

• Sloan Sky Survey (CyberBricks)

• Billion Transactions per day

• Wolfpack Failover

• NTFS IO measurements

• NT-Cluster-Sort

Page 33: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

33

Scalability1 billion 1 billion

transactionstransactions

1.8 million 1.8 million mail messagesmail messages

4 terabytes of 4 terabytes of datadata

100 million100 millionweb hitsweb hits

• Scale up: to large SMP nodesScale up: to large SMP nodes• Scale out: to clusters of SMP nodesScale out: to clusters of SMP nodes

Page 34: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

Billion Transactions per Day Project

• Built a 45-node Windows NT Cluster (with help from Intel & Compaq) > 900 disks

• All off-the-shelf parts

• Using SQL Server & DTC distributed transactions

• DebitCredit Transaction

• Each node has 1/20 th of the DB

• Each node does 1/20 th of the work

• 15% of the transactions are “distributed”

Page 35: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

35

Type nodes CPUs DRAM ctlrs disks RAIDspace

WorkflowMTS

20CompaqProliant

2500

20x

2

20x

128

20x

1

20x

1

20x

2 GB

SQL Server

20CompaqProliant

5000

20x

4

20x

512

20x

4

20x36x4.2GB7x9.1GB

20x

130 GB

DistributedTransactionCoordinator

5CompaqProliant

5000

5x

4

5x

256

5x

1

5x

3

5x

8 GB

TOTAL 45 140 13 GB 105 895 3 TB

Billion Transactions Per Day Hardware

• 45 nodes (Compaq Proliant)

• Clustered with 100 Mbps Switched Ethernet

• 140 cpu, 13 GB, 3 TB.

Page 36: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

36

1.2 B tpd• 1 B tpd ran for 24 hrs.

• Out-of-the-box software

• Off-the-shelf hardware

• AMAZING!

•Sized for 30 days•Linear growth•5 micro-dollars per transaction

Page 37: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

37

Millions of Transactions Per Day

0.1

1.

10.

100.

1,000.

1 Btpd Visa ATT BofA NYSE

Mtp

d

Millions of Transactions Per Day

0.100.200.300.400.500.600.700.800.900.

1,000.

1 Btpd Visa ATT BofA NYSE

Mtp

d

How Much Is 1 Billion Tpd?• 1 billion tpd = 11,574 tps

~ 700,000 tpm (transactions/minute)• ATT

» 185 million calls per peak day (worldwide)

• Visa ~20 million tpd» 400 million customers» 250K ATMs worldwide» 7 billion transactions

(card+cheque) in 1994

• New York Stock Exchange » 600,000 tpd

• Bank of America» 20 million tpd checks cleared

(more than any other bank)» 1.4 million tpd ATM transactions

• Worldwide Airlines Reservations: 250 Mtpd

Page 38: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

38

NCSA Super Cluster

• National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana

• 512 Pentium II cpus, 2,096 disks, SAN• Compaq + HP +Myricom + WindowsNT• A Super Computer for 3M$• Classic Fortran/MPI programming• DCOM programming model

http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html

Page 39: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

39

Outline

• FileCast & Reliable Multicast

• RAGS: SQL Testing

• TerraServer (a big DB)

• Sloan Sky Survey (CyberBricks)

• Billion Transactions per day

• Wolfpack Failover

• NTFS IO measurements

• NT-Cluster-Sort

Page 40: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

40

NT Clusters (Wolfpack)• Scale DOWN to PDA: WindowsCE

• Scale UP an SMP: TerraServer

• Scale OUT with a cluster of machines

• Single-system image

»Naming

»Protection/security

»Management/load balance

• Fault tolerance

»“Wolfpack”

• Hot pluggable hardware & software

Page 41: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

41

Web Web sitesite

DatabaseDatabase

Web site filesWeb site files

Database filesDatabase files

Server 1Server 1

BrowserBrowser

Symmetric Virtual Server Failover Example

Server 1Server 1 Server 2Server 2

Web site filesWeb site files

Database filesDatabase files

Web Web sitesite

DatabaseDatabase

Web Web sitesite

DatabaseDatabase

Page 42: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

42

Clusters & BackOffice• Research: Instant & Transparent failover

• Making BackOffice PlugNPlay on Wolfpack

»Automatic install & configure

• Virtual Server concept makes it easy

»simpler management concept

»simpler context/state migration

»transparent to applications

• SQL 6.5E & 7.0 Failover

• MSMQ (queues), MTS (transactions).

Page 43: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

43

Next Steps in Availability

• Study the causes of outages

• Build AlwaysUp system:

»Two geographically remote sites

»Users have instant and transparent failover to 2nd site.

»Working with WindowsNT and SQL Server groups on this.

Page 44: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

44

Outline

• FileCast & Reliable Multicast

• RAGS: SQL Testing

• TerraServer (a big DB)

• Sloan Sky Survey (CyberBricks)

• Billion Transactions per day

• Wolfpack Failover

• NTFS IO measurements

• NT-Cluster-Sort

Page 45: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

45

Storage Latency: How Far Away is the Data?

Storage Latency: How Far Away is the Data?

RegistersOn Chip CacheOn Board Cache

Memory

Disk

12

10

100

Tape /Optical Robot

109

106

This CampusThis Room

10 min

My Head 1 min

1.5 hrSacramento

2 YearsPluto

2,000 YearsAndromeda

Page 46: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

46Controller

The Memory Hierarchy

• Measuring & Modeling Sequential IO

• Where is the bottleneck?

• How does it scale with

»SMP, RAID, new interconnects

Adapter SCSIFile cache PCI

MemoryGoals:balanced bottlenecksLow overheadScale many processors (10s)Scale many disks (100s)

Mem

bus

App address space

Page 47: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

47

PAP (peak advertised Performance) vs RAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point)

System Bus422 MBps

7.2 MB/s

133 MBps7.2 MB/s

10-15 MBps7.2 MB/s

SCSIFile System Buffers

ApplicationData

Disk

PCI

40 MBps7.2 MB/s

Page 48: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

48

The Best Case: Temp File, NO IO• Temp file Read / Write File System Cache

• Program uses small (in cpu cache) buffer.

• So, write/read time is bus move time (3x better than copy)

• Paradox: fastest way to move data is to write then read it.

• This hardware islimited to 150 MBpsper processor

Temp File Read/Write

148 136

54

0

50

100

150

200

Temp read Temp write Memcopy ()

MB

ps

Page 49: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

49

Bottleneck Analysis

• Drawn to linear scale

TheoreticalBus Bandwidth

422MBps = 66 Mhz x 64 bits

MemoryRead/Write

~150 MBps

MemCopy~50 MBps

Disk R/W~9MBps

Page 50: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

50

3 Stripes and Your Out!• 3 disks can saturate adapter

• Similar story with UltraWide

• CPU time goes down with request size

• Ftdisk (striping is cheap)

Read Throughput vs Stripes - 3 deep Fast

0

5

10

15

20

2 4 8 16 32 64 128 192Request Size (K bytes)

Th

rou

gh

pu

t (M

B/s

)

WriteThroughput vs Stripes - 3 deep Fast

0

5

10

15

20

2 4 8 16 32 64 128 192Request Size (K bytes)

Th

rou

gh

pu

t (M

B/s

)

1 Disk

2 Disks

3 Disks

4 Disks

CPU miliseconds per MB

1

10

100

2 4 8 16 32 64 128 192

Request Size (bytes)

Co

st (

CP

U m

s/M

B)

=

Page 51: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

51

Parallel SCSI Busses Help• Second SCSI bus nearly

doubles read and wce throughput

• Write needs deeper buffers

• Experiment is unbuffered(3-deep +WCE)

One or Two SCSI Busses

0

5

10

15

20

25

2 4 8 16 32 64 128 192

Request Size (K bytes)

Th

rou

gh

pu

t (M

B/s

)

ReadWriteWCEReadWriteWCE

2 busses

1 Bus

2 x

Page 52: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

52

File System Buffering & Stripes(UltraWide Drives)

• FS buffering helps small reads

• FS buffered writes peak at 12MBps

• 3-deep async helps

• Write peaks at 20 MBps

• Read peaks at 30 MBps

Three Disks, 1 Deep

0

5

10

15

20

25

30

35

2 4 8 16 32 64 128 192Request Size (K Bytes)

Th

rou

gh

pu

t (M

B/s

)

FS Read

ReadFS Write WCE

Write WCE

Three Disks, 3 Deep

0

5

10

15

20

25

30

35

2 4 8 16 32 64 128 192Request Size (K Bytes)

Th

rou

gh

pu

t (M

B/s

)

Page 53: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

53

PAP vs RAP• Reads are easy, writes are hard

• Async write can match WCE.

422 MBps

142 MBps

133 MBps

72 MBps

10-15 MBps

9 MBps

SCSI

File System

ApplicationData

PCI SCSI

Disks40 MBps

31 MBps

Page 54: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

54

Bottleneck Analysis• NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI

~ 65 MBps Unbuffered read~ 43 MBps Unbuffered write

~ 40 MBps Buffered read

~ 35 MBps Buffered write

Memory Read/Write ~150 MBps

PCI~70 MBps

Adapter~30 MBps

Adapter

70 M

Bps

Page 55: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

55

Hypothetical Bottleneck Analysis• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI

(not measured, we had only one PCI bus available, 2nd one was “internal”)

~ 120 MBps Unbuffered read

~ 80 MBps Unbuffered write

~ 40 MBps Buffered read

~ 35 MBps Buffered write

Memory Read/Write ~150 MBps

PCI~70 MBps

Adapter~30 MBps

PCI

Adapter

Adapter

Adapter

120

MB

ps

Page 56: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

56

Year 2002 Disks• Big disk (10 $/GB)

» 3”

» 100 GB

» 150 kaps (k accesses per second)

» 20 MBps sequential

• Small disk (20 $/GB)» 3”

» 4 GB

» 100 kaps

» 10 MBps sequential

• Both running Windows NT™ 7.0?(see below for why)

Page 57: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

57

How Do They Talk to Each Other?• Each node has an OS

• Each node has local resources: A federation.

• Each node does not completely trust the others.

• Nodes use RPC to talk to each other» CORBA? DCOM? IIOP? RMI?

» One or all of the above.

• Huge leverage in high-level interfaces.

• Same old distributed system story.

Wire(s)h

stre

ams

data

gram

s

RP

C?

Applications

VIAL/VIPL

streams

datagrams

RP

C ?

Applications

Page 58: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

58

Outline

• FileCast & Reliable Multicast

• RAGS: SQL Testing

• TerraServer (a big DB)

• Sloan Sky Survey (CyberBricks)

• Billion Transactions per day

• Wolfpack Failover

• NTFS IO measurements

• NT-Cluster-Sort

Page 59: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

59

Penny Sort Ground Ruleshttp://research.microsoft.com/barc/SortBenchmark

• How much can you sort for a penny.» Hardware and Software cost» Depreciated over 3 years» 1M$ system gets about 1 second,» 1K$ system gets about 1,000 seconds.» Time (seconds) = SystemPrice ($) / 946,080

• Input and output are disk resident

• Input is » 100-byte records (random data)» key is first 10 bytes.

• Must create output file and fill with sorted version of input file.

• Daytona (product) and Indy (special) categories

Page 60: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

60

PennySort• Hardware

» 266 Mhz Intel PPro

» 64 MB SDRAM (10ns)

» Dual Fujitsu DMA 3.2GB EIDE

• Software» NT workstation 4.3

» NT 5 sort

• Performance» sort 15 M 100-byte records (~1.5 GB)

» Disk to disk

» elapsed time 820 sec • cpu time = 404 sec

PennySort Machine (1107$ )

board13%

Memory8%

Cabinet + Assembly

7%

Network, Video, floppy

9%

Software6%

Other22%

cpu 32%

Disk25%

Page 61: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

61

Cluster Sort Conceptual Model

•Multiple Data Sources

•Multiple Data Destinations

•Multiple nodes

•Disks -> Sockets -> Disk -> DiskB

AAABBBCCC

A

AAABBBCCC

C

AAABBBCCC

BBBBBBBBB

AAAAAAAAA

CCCCCCCCC

BBBBBBBBB

AAAAAAAAA

CCCCCCCCC

Page 62: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

62

Cluster Install & Execute

•If this is to be used by others, it must be:

•Easy to install•Easy to execute

• Installations of distributed systems take time and can be tedious. (AM2, GluGuard)

• Parallel Remote execution is non-trivial. (GLUnix, LSF)

How do we keep this “simple” and “built-in” to NTClusterSort ?

Page 63: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

63

Remote Install

RegConnectRegistry()

RegCreateKeyEx()

•Add Registry entry to each remote node.

Page 64: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

64

Cluster Execution

MULT_QI COSERVERINFO•Setup :

MULTI_QI structCOSERVERINFO struct

•CoCreateInstanceEx()

•Retrieve remote object handle from MULTI_QI struct

•Invoke methods as usual

HANDLEHANDLE

HANDLE

Sort()

Sort()

Sort()

Page 65: 1 Scaleable Systems Research at Microsoft (really: what we do at BARC) Jim Gray Microsoft Research Gray@Microsoft.com Gray

65

Gbps Ethernet: 110 MBps

SAN: Standard

Interconnect

PCI 32: 70 MBps

UW Scsi: 40 MBps

FW scsi: 20 MBps

scsi: 5 MBps

• LAN faster than memory bus?

• 1 GBps links in lab.

• 300$ port cost soon

• Port is computer

RIPFDDI

RIPATM

RIPSCI

RIPSCSI

RIPFC

RIP?