50
Computer Technology Forecast Jim Gray Microsoft Research [email protected] http:// ~research.Microsoft.com/~Gray

Computer Technology Forecast Jim Gray Microsoft Research [email protected] http://~research.Microsoft.com/~Gray

Embed Size (px)

Citation preview

Page 1: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

ComputerTechnology Forecast

Jim Gray

Microsoft Research

[email protected]

http://~research.Microsoft.com/~Gray

Page 2: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Reality Check

• Good news– In the limit, processing & storage & network is free– Processing & network is infinitely fast

• Bad news– Most of us live in the present.– People are getting more expensive.

Management/programming cost exceeds hardware cost.– Speed of light not improving.– WAN prices have not changed much in last 8 years.

Page 3: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Interesting Topics

• I’ll talk about server-side hardware

• What about client hardware? – Displays, cameras, speech,….

• What about Software?– Databases, data mining, PDB, OODB

– Objects / class libraries …

– Visualization

– Open Source movement

Page 4: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

How Much Information Is there?

• Soon everything can be recorded and indexed

• Most data never be seen by humans

• Precious Resource: Human attention

Auto-SummarizationAuto-Search

is key technology.www.lesk.com/mlesk/ksg97/ksg.html

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

KiloA BookA Book

.Movie

All LoC books(words)

All Books MultiMedia

Everything!

Recorded

A PhotoA Photo

24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

Page 5: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Moore’s Law• Performance/Price doubles every 18 months• 100x per decade• Progress in next 18 months

= ALL previous progress– New storage = sum of all old storage (ever)– New processing = sum of all old processing.

• E. coli double ever 20 minutes!

15 years ago

Page 6: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Trends: ops/s/$ Had Three Growth Phases

1890-1945Mechanical

Relay

7-year doubling

1945-1985Tube, transistor,..

2.3 year doubling

1985-2000Microprocessor

1.0 year doubling 1.E-06

1.E-03

1.E+00

1.E+03

1.E+06

1.E+09

1880 1900 1920 1940 1960 1980 2000

doubles every 7.5 years

doubles every 2.3 years

doubles every 1.0 years

ops per second/$

Page 7: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

System Bus

PCI Bus PCI Bus

What’s a Balanced System?

Page 8: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Storage capacity beating Moore’s law

• 5 k$/TB today (raw disk)

1E+3

1E+4

1E+5

1E+6

1E+7

1988 1991 1994 1997 2000

disk TB growth: 112%/y

Moore's Law: 58.7%/y

ExaByte

Disk TB Shipped per Year1998 Disk Trend (J im Porter)

http://www.disktrend.com/pdf/portrpkg.pdf.

Moores law 58.70% /year

Revenue 7.47%TB growth 112.30% (since 1993)

Price decline 50.70% (since 1993)

Page 9: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Cheap Storage• Disks are getting cheap:• 7 k$/TB disks (25 40 GB disks @ 230$ each)

y = 5.7156x + 47.857

y = 15.895x + 13.446

0

100

200

300

400

500

600

700

800

900

0 10 20 30 40 50 60Raw Disk unit Size GB

$

IDE

SCSI

Price vs disk capacity

7

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50 60Disk unit size GB

$

IDE

SCSI

raw k$/TB

Page 10: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Cheap Storage or Balanced System

• Low cost storage (2 x 1.5k$ servers) 7K$ TB2x (1K$ system + 8x60GB disks + 100MbEthernet)

• Balanced server (7k$/.5 TB)– 2x800Mhz (2k$)– 256 MB (400$)– 8 x 60 GB drives (3K$)– Gbps Ethernet + switch (1.5k$)– 14k$ TB, 28K$/RAIDED TB

2x800 Mhz256 MB

Page 11: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

The “Absurd” Disk

• 2.5 hr scan time (poor sequential access)

• 1 aps / 5 GB (VERY cold data)

• It’s a tape!

1 TB100 MB/s

200 Kaps

Page 12: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Hot Swap Drives for Archive or Data Interchange

• 25 MBps write(so can write N x 60 GB in 40 minutes)

• 60 GB/overnite

= ~N x 2 MB/second

@ 19.95$/nite 17$

260$

Page 13: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

240 GB, 2k$ (now)300 GB by year end.

• 4x60 GB IDE(2 hot plugable)– (1,100$)

• SCSI-IDE bridge– 200k$

• Box– 500 Mhz cpu– 256 MB SRAM– Fan, power, Enet– 700$

• Or 8 disks/box600 GB for ~3K$ ( or 300 GB RAID)

Page 14: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Hot Swap Drives for Archive or Data Interchange

• 25 MBps write(so can write N x 74 GB in 3 hours)

• 74 GB/overnite

= ~N x 2 MB/second

@ 19.95$/nite

Page 15: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

It’s Hard to Archive a PetabyteIt takes a LONG time to restore it.

• At 1GBps it takes 12 days!• Store it in two (or more) places online (on disk?).

A geo-plex• Scrub it continuously (look for errors)• On failure,

– use other copy until failure repaired, – refresh lost copy from safe copy.

• Can organize the two copies differently (e.g.: one by time, one by space)

Page 16: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Disk vs Tape

• Disk– 60 GB

– 30 MBps

– 5 ms seek time

– 3 ms rotate latency

– 7$/GB for drive 3$/GB for ctlrs/cabinet

– 4 TB/rack

– 1 hour scan

• Tape– 40 GB– 10 MBps– 10 sec pick time– 30-120 second seek time– 2$/GB for media

8$/GB for drive+library– 10 TB/rack

– 1 week scan

The price advantage of tape is narrowing, and the performance advantage of disk is growingAt 10K$/TB, disk is competitive with nearline tape.

GuestimatesCern: 200 TB3480 tapes2 col = 50GBRack = 1 TB=20 drives

Page 17: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Trends: Gilder’s Law: 3x bandwidth/year for 25 more years

• Today: – 10 Gbps per channel– 4 channels per fiber: 40 Gbps– 32 fibers/bundle = 1.2 Tbps/bundle

• In lab 3 Tbps/fiber (400 x WDM)• In theory 25 Tbps per fiber• 1 Tbps = USA 1996 WAN bisection bandwidth• Aggregate bandwidth doubles every 8 months!

1 fiber = 25 Tbps

Page 18: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

300 MBps OC48 = G2Or

memcpy()

90 MBps PCI

Sense of scale

• How fat is your pipe?

• Fattest pipe on MS campus is the WAN!

20 MBps disk / ATM / OC3

94 MBps Coast to Coast

Page 19: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Redmond/Seattle, WA

San Francisco, CA

New York

Arlington, VA

5626 km10 hops

Information Sciences InstituteInformation Sciences InstituteMicrosoftMicrosoft

QwestQwestUniversity of WashingtonUniversity of Washington

Pacific Northwest GigapopPacific Northwest GigapopHSCC HSCC (high speed connectivity consortium)(high speed connectivity consortium)

DARPADARPA

Page 20: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

The PathDC -> SEAC:\tracert -d 131.107.151.194Tracing route to 131.107.151.194 over a maximum of 30 hops 0 ------- DELL 4400 Win2K WKSArlington Virginia, ISI Alteon GbE 1 16 ms <10 ms <10 ms 140.173.170.65 ------- Juniper M40 GbEArlington Virginia, ISI Interface ISIe 2 <10 ms <10 ms <10 ms 205.171.40.61 ------- Cisco GSR OC48Arlington Virginia, Qwest DC Edge 3 <10 ms <10 ms <10 ms 205.171.24.85 ------- Cisco GSR OC48Arlington Virginia, Qwest DC Core 4 <10 ms <10 ms 16 ms 205.171.5.233 ------- Cisco GSR OC48New York, New York, Qwest NYC Core 5 62 ms 63 ms 62 ms 205.171.5.115 ------- Cisco GSR OC48San Francisco, CA, Qwest SF Core 6 78 ms 78 ms 78 ms 205.171.5.108 ------- Cisco GSR OC48Seattle, Washington, Qwest Sea Core 7 78 ms 78 ms 94 ms 205.171.26.42 ------- Juniper M40 OC48

Seattle, Washington, Qwest Sea Edge 8 78 ms 79 ms 78 ms 208.46.239.90 ------- Juniper M40 OC48Seattle, Washington, PNW Gigapop 9 78 ms 78 ms 94 ms 198.48.91.30 ------- Cisco GSR OC48 Redmond Washington, Microsoft 10 78 ms 78 ms 94 ms 131.107.151.194 ------- Compaq SP750 Win2K WKS Redmond Washington, Microsoft SysKonnect GbE

Page 21: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

“ PetaBumps”• 751 mbps for 300 seconds = (~28 GB)

single-thread single-stream tcp/ip desktop-to-desktop out of the box performance*

• 5626 km x 751Mbps =

~ 4.2e15 bit meter / second ~ 4.2 Peta bmps

• Multi-steam is 952 mbps ~5.2 Peta bmps

•4470 byte MTUs were enabled on all routers.•20 MB window size

Page 22: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray
Page 23: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

0

50

100

150

200

250

100Mbps Gbps SAN

Transmitreceivercpusender cpu

Time µs toSend 1KB

The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/

• Yesterday: – 10 MBps (100 Mbps Ethernet)

– ~20 MBps tcp/ip saturates 2 cpus

– round-trip latency ~250 µs

• Now– Wires are 10x faster

Myrinet, Gbps Ethernet, ServerNet,…

– Fast user-level communication

• tcp/ip ~ 100 MBps 10% cpu

• round-trip latency is 15 us

• 1.6 Gbps demoed on a WAN

Page 24: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Pointers• The single-stream submission:

http://research.microsoft.com/~gray/papers/Windows2000_I2_land_Speed_Contest_Entry_(Single_Stream_mail).htm

• The multi-stream submission: http://research.Microsoft.com/~gray/papers/

Windows2000_I2_land_Speed_Contest_Entry_(Multi_Stream_mail).htm

• The code: http://research.Microsoft.com/~gray/papers/speedy.htm

speedy.hspeedy.c

And a PowerPoint presentation about it. http://research.Microsoft.com/~gray/papers/

Windows2000_WAN_Speed_Record.ppt

Page 25: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Networking • WANS are getting faster than LANS

G8 = OC192 = 8Gbps is “standard”• Link bandwidth improves 4x per 3 years• Speed of light (60 ms round trip in US)• Software stacks

have always been the problem.

Time = SenderCPU + ReceiverCPU + bytes/bandwidth

This has been the problem

Page 26: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Rules of Thumb in Data Engineering• Moore’s law -> an address bit per 18 months.• Storage grows 100x/decade (except 1000x last decade!)• Disk data of 10 years ago now fits in RAM (iso-price).• Device bandwidth grows 10x/decade – so need parallelism• RAM:disk:tape price is 1:10:30 going to 1:10:10• Amdahl’s speedup law: S/(S+P)• Amdahl’s IO law: bit of IO per instruction/second

(tBps/10 top! 50,000 disks/10 teraOP: 100 M$ Dollars)

• Amdahl’s memory law: byte per instruction/second (going to 10)(1 TB RAM per TOP: 1 TeraDollars)

• PetaOps anyone?• Gilder’s law: aggregate bandwidth doubles every 8 months.• 5 Minute rule: cache disk data that is reused in 5 minutes.• Web rule: cache everything!http://research.Microsoft.com/~gray/papers/

MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc

Page 27: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Dealing With TeraBytes (Petabytes):Requires Parallelism

Dealing With TeraBytes (Petabytes):Requires Parallelism

parallelism: use many little devices in parallel

1 Terabyte

10 MB/s

At 10 MB/s: 1.2 days to scan

1 Terabyte

1,000 x parallel: 100 seconds scan.

Use 100 processors & 1,000 disks

Page 28: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Parallelism Must Be Automatic

• There are thousands of MPI programmers.

• There are hundreds-of-millions of people using parallel database search.

• Parallel programming is HARD!

• Find design patterns and automate them.

• Data search/mining has parallel design patterns.

Page 29: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Scalability: Up and Out

OutOut

•“Scale Out” clones & partitions–Use commodity servers –Add clones & partitions as needed

UpUp •“Scale Up”–Use “big iron” (SMP)

–Cluster into packs for availability

Page 30: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Everyone scales outWhat’s the Brick?

• 1M$/slice– IBM S390?– Sun E 10,000?

• 100 K$/slice– HPUX/AIX/Solaris/IRIX/EMC

• 10 K$/slice– Utel / Wintel 4x

• 1 K$/slice– Beowulf / Wintel 1x

Page 31: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Terminology for scaleability

• Farms of servers:– Clones: identical

• Scaleability + availability

– Partitions: • Scaleability

– Packs• Partition availability via fail-over

• GeoPlex for disaster tolerance.

Farm

Clone

SharedNothing

SharedDisk

Partition

Pack

SharedNothing

Active-Active

Active-Passive

Page 32: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Shared Nothing Clones Shared Disk Clones

Partitions Packed Partitions

Farm

Clone

SharedNothing

SharedDisk

Partition

Pack

SharedNothing

Active-Active

Active-Passive

Page 33: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Unpredictable Growth

• The TerraServer Story:– We expected 5 M hits per day– We got 50 M hits on day 1– We peak at 15-20 M hpd on a “hot” day– Average 5 M hpd after 1 year

• Most of us cannot predict demand– Must be able to deal with NO demand– Must be able to deal with HUGE demand

Page 34: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

An Architecture for Internet Services?

• Need to be able to add capacity – New processing

– New storage

– New networking

• Need continuous service– Online change of all components (hardware and software)

– Multiple service sites

– Multiple network providers

• Need great development tools– Change the application several times per year.

– Add new services several times per year.

Page 35: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Premise: Each Site is a Farm • Buy computing by the slice (brick):

– Rack of servers + disks.

• Grow by adding slices– Spread data and

computation to new slices

• Two styles:– Clones: anonymous servers– Parts+Packs: Partitions fail over within a pack

• In both cases, remote farm for disaster recovery

SwitchedEthernet

SwitchedEthernet

www.microsoft.com(3)

search.microsoft.com(1)

premium.microsoft.com(1)

European Data Center

FTPDownload Server

(1)

SQL SERVERS(2)

Router

msid.msn.com(1)

MOSW estAdmin LAN

SQLNetFeeder LAN

FDDI Ring(MIS4)

Router

www.microsoft.com(5)

Building 11

Live SQL Server

Router

home.microsoft.com(5)

FDDI Ring(MIS2)

www.microsoft.com(4)

activex.microsoft.com(2)

search.microsoft.com(3)

register.microsoft.com(2)

msid.msn.com(1)

FDDI Ring(MIS3)

www.microsoft.com(3)

premium.microsoft.com(1)

msid.msn.com(1)

FDDI Ring(MIS1)

www.microsoft.com(4)

premium.microsoft.com(2)

register.microsoft.com(2)

msid.msn.com(1) Primary

Gigaswitch

SecondaryGigaswitch

Staging Servers(7)

search.microsoft.com(3)

support.microsoft.com(2)

register.msn.com(2)

The Microsoft.Com Site

MOSWest

DMZ Staging Servers

\ \Tweeks\Statist ics\LAN and Server Name Info\Cluster Process F low\MidYear98a.vsd12/15/97

Internet

Internet

Log Processing

All servers in Building11are accessable fromcorpnet.

IDC Staging Servers

Live SQL Servers

SQL Consolidators

Japan Data Centerwww.microsoft.com

(3)premium.microsoft.com(1)

HTTPDownload Servers

(2) Router

search.microsoft.com(2)

SQL SERVERS(2)

msid.msn.com(1)

FTPDownload Server

(1)Router

Router

Router

Router

Router

Router

Router

Router

Internal WW W

SQL Reporting

home.microsoft.com(4)

home.microsoft.com(3)

home.microsoft.com(2)

register.microsoft.com(1)

support.microsoft.com(1)

Internet

13DS3

(45 Mb/Sec Each)

2OC3

(100Mb/Sec Each)

2Ethernet

(100 Mb/Sec Each)

cdm.microsoft.com(1)

FTP Servers

DownloadReplication

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $83KFY98 Fcst: 12

Ave CFG: 4xP5,256 RAM,12 GB HDAve Cost: $24KFY98 Fcst: 0

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 3

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 17

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $43KFY98 Fcst: 10

Ave CFG: 4xP6512 RAM28 GB HDAve Cost: $35KFY98 Fcst: 17 Ave CFG: 4xP6,

256 RAM,30 GB HDAve Cost: $25KFY98 Fcst: 2

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 3

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $35KFY98 Fcst: 2

Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 12

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 2

Ave CFG: 4xP6,1 GB RAM,180 GB HDAve Cost: $128KFY98 Fcst: 2

Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 0

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 7

Ave CFG: 4xP5,256 RAM,20 GB HDAve Cost: $29KFY98 Fcst: 2

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 9

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,50 GB HDAve Cost: $50KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,160 GB HDAve Cost: $80KFY98 Fcst: 1

FTP.microsoft.com(3)

Ave CFG: 4xP5,512 RAM,30 GB HDAve Cost: $28KFY98 Fcst: 0

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 1

Ave CFG: 4xP6,512 RAM,30 GB HDAve Cost: $35KFY98 Fcst: 1

Ave CFG: 4xP6,1 GB RAM,160 GB HDAve Cost: $83KFY98 Fcst: 2

Page 36: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Clones: Availability+Scalability• Some applications are

– Read-mostly – Low consistency requirements– Modest storage requirement (less than 1TB)

• Examples:– HTML web servers (IP sprayer/sieve + replication)– LDAP servers (replication via gossip)

• Replicate app at all nodes (clones)Replicate app at all nodes (clones)• Spray requests across nodes.• Grow by adding clones• Fault tolerance: stop sending to that clone.• Growth: add a clone.

Page 37: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Two Clone Geometries• Shared-Nothing: exact replicas

• Shared-Disk (state stored in server)

Shared Nothing Clones Shared Disk Clones

Page 38: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Facilities Clones Need• Automatic replication

– Applications (and system software)

– Data

• Automatic request routing– Spray or sieve

• Management:– Who is up?

– Update management & propagation

– Application monitoring.

• Clones are very easy to manage:– Rule of thumb: 100’s of clones per admin

Page 39: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

PartitionsPartitions for Scalability• Clones are not appropriate for some apps.

– Statefull apps do not replicate well

– high update rates do not replicate well

• Examples– Email / chat / …

– Databases

• Partition state among servers • Scalability (online):

– Partition split/merge

– Partitioning must be transparent to client.

Page 40: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Partitioned/Clustered Apps• Mail servers

– Perfectly partitionable

• Business Object Servers– Partition by set of objects.

• Parallel Databases • Transparent access to partitioned tables• Parallel Query

Page 41: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Packs for Availability• Each partition may fail (independent of others)

• Partitions migrate to new node via fail-over– Fail-over in seconds

• Pack: the nodes supporting a partition– VMS Cluster– Tandem Process Pair– SP2 HACMP– Sysplex™– WinNT MSCS (wolfpack)

• Cluster In A Box now commodity

• Partitions typically grow in packs.

Page 42: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

What Parts+Packs Need• Automatic partitioning (in dbms, mail, files,…)

– Location transparent

– Partition split/merge

– Grow without limits (100x10TB)

• Simple failover model– Partition migration is transparent

– MSCS-like model for services

• Application-centric request routing• Management:

– Who is up?

– Automatic partition management (split/merge)

– Application monitoring.

Page 43: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Partitions and Packs• Packs for availabilty

Partitions Packed Partitions

Page 44: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

GeoPlex: Farm pairs• Two farms

• Changes from one sent to other

• When one farm failsother provides service

• Masks– Hardware/Software faults– Operations tasks (reorganize, upgrade move

– Environmental faults (power fail)

Page 45: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Services on Clones & Partitions

• Application provides a set of services

• If cloned: – Services are on subset of clones

• If partitioned:– Services run at each partition

• System load balancing routes request to – Any clone– Correct partition.– Routes around failures.

Page 46: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Cluster Scenarios: 3- tier systems

Web Clients

A simple web site

Front End

Web File Store SQL Temp State

SQL Database

Packs for availability

Clones for availability

Load BalanceLoad Balance

Page 47: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

ClonedPacked

file servers

Packed Partitions: Database Transparency

Cluster Scale Out Scenarios

SQL Temp StateWeb File StoreA

ClonedFront Ends(firewall, sprayer,

web server)

SQL Partition 3

The FARM: Clones and Packs of Partitions

Web Clients

Web File StoreBreplication

SQL DatabaseSQL Partition 2 SQL Partition1

Load BalanceLoad Balance

Page 48: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Terminology

• Terminology for scaleability

• Farms of servers:– Clones: identical

• Scaleability + availability

– Partitions: • Scaleability

– Packs• Partition availability via fail-over

• GeoPlex for disaster tolerance.

Farm

Clone

SharedNothing

SharedDisk

Partition

Pack

SharedNothing

Active-Active

Active-Passive

Page 49: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

• Helping move the data to SQL– Database design

– Data loading

• Experimenting with queries on a 4 M object DB– 20 questions like “find gravitational lens candidates”

– Queries use parallelism, most run in a few seconds.(auto parallel)

– Some run in hours (neighbors within 1 arcsec)

– EASY to ask questions.

• Helping with an “outreach” website: SkyServer• Personal goal:

Try datamining techniques to “re-discover” Astronomy

Color Magnitude Diff/Ratio Distribution

1.E+0

1.E+1

1.E+2

1.E+3

1.E+4

1.E+5

1.E+6

1.E+7

-30 -20 -10 0 10 20 30

Magnitude Diff/Ratio

Co

un

ts

u-g

g-r

r-i

i-z

What we have been doing with SDSS

Page 50: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

References (.doc or .pdf)

• Technology forecast: http://research.microsoft.com/~gray/papers/MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc

• Gbps experiments:http://research.microsoft.com/~gray/

• Disk experiments (10K$ TB)http://research.microsoft.com/~gray/papers/Win2K_IO_MSTR_2000_55.doc

• Scaleability Terminologyhttp://research.microsoft.com/~gray/papers/MS_TR_99_85_Scalability_Terminology.doc