View
219
Download
0
Category
Tags:
Preview:
Citation preview
1Bell & Gray 4/15 / 95
Parallel Database SystemsParallel Database Systems A SNAP Application A SNAP Application
Gordon Bell
450 Old Oak CourtLos Altos, CA 94022
GBell@Microsoft.com
Jim Gray
310 Filbert, SF CA 94133Gray@Microsoft.com
NetworkPlatform
2Bell & Gray 4/15 / 95
OutlineOutline
• Cyberspace Pep Talk: •Databases are the dirt of Cyberspace•Billions of clients mean millions of servers
• Parallel Imperative:• Hardware trend: Many little devices• Consequence: Servers are arrays of commodity components• PC’s are the bricks of Cyberspace• Must automate parallel {design / operation / use}• Software parallelism via dataflow & Data Partitioning
• Parallel database techniques• Parallel execution of many little jobs (OLTP)• Data Partitioning• Pipeline Execution• Automation techniques)
• Summary
3Bell & Gray 4/15 / 95
Kinds Of Information ProcessingKinds Of Information Processing
Point-to-Point Broadcast
Immediate
TimeShifted
conversationmoney
lectureconcert
mail booknewspaper
NetNetworkwork
DataDataBaseBase
Its ALL going electronicImmediate is being stored for analysis (so ALL database)Analysis & Automatic Processing are being added
4Bell & Gray 4/15 / 95
Why Put Everything in Cyberspace?Why Put Everything in Cyberspace?
Low rentmin $/byte
Shrinks timenow or later
Shrinks spacehere or there
Automate processingknowbots
Point-to-Point OR Broadcast
Imm
edia
te O
R T
ime
Del
ayed
Network
DataBase
LocateProcessAnalyzeSummarize
6Bell & Gray 4/15 / 95
Database Store ALL Data Database Store ALL Data TypesTypes
• The New World:•Billions of objects•Big objects (1MB)•Objects have behavior
(methods)
• The Old World:
– Millions of objects
– 100-byte objects
People
Name Address Papers Picture Voice
Mike
Won
David NY
Berk
Austin
People
Name Address
Mike
Won
David NY
Berk
Austin Paperless officeLibrary of congress onlineAll information online entertainment publishing businessInformation Network, Knowledge Navigator, Information at your fingertips
7Bell & Gray 4/15 / 95
Magnetic Storage Cheaper than PaperMagnetic Storage Cheaper than Paper
• File Cabinet: cabinet (4 drawer) 250$paper (24,000 sheets) 250$space (2x3 @ 10$/ft2) 180$total 700$3 ¢/sheet
•Disk: disk (8 GB =) 4,000$ASCII: 4 m pages0.1 ¢/sheet (30x cheaper)
• Image: 200 k pages2 ¢/sheet (similar to paper)
• Store everything on disk
8Bell & Gray 4/15 / 95
Cyberspace Demographics Cyberspace Demographics
• Computer History:
• most computers are smallNEXT: 1 Billion X for some X
(phone?)
• most of the money is inclients and wiring1990: 50% desktop1995: 75% desktop
1950 National Computer1960 Corporate Computer1970 Site Computer1980 Departmental Computer1990 Personal Computer2000 ?
100M
1M
1KSUPER MAIN
FRAMEMINI WS PC
po
pu
lati
on
1B$
100B$
10B$
Rev
enu
e
9Bell & Gray 4/15 / 95
Billions of Clients Billions of Clients
• Every device will be “intelligent”•Doors, rooms, cars, ...
•Computing will be ubiquitous
10Bell & Gray 4/15 / 95
Billions of Clients Need Billions of Clients Need Millions of ServersMillions of Servers
mobileclients
fixed clients
server
superserver
Clients
Servers
Super ServersLarge DatabasesHigh Traffic shared data
All clients are networked to serversmay be nomadic or on-demand
Fast clients want faster servers
Servers provide
data,
control,
coordination
communication
17Bell & Gray 4/15 / 95
OutlineOutline
• Cyberspace Pep Talk: • Databases are the dirt of Cyberspace• Billions of clients mean millions of servers
• Parallel Imperative:•Hardware trend: Many little devices•Consequence: Server arrays of commodity parts•PC’s are the bricks of Cyberspace•Must automate parallel {design / operation / use}•Software parallelism via dataflow & Data Partitioning
• Parallel database techniques• Parallel execution of many little jobs (OLTP)• Data Partitioning• Pipeline Execution• Automation techniques)
• Summary
18Bell & Gray 4/15 / 95
Hardware trends: Few generic parts: CPU RAM Disk & Tape arrays
ATM for LAN/WAN?? for CAN
?? for OS
These parts will be inexpensive (commodity components)Systems will be arrays of these partsSoftware challenge: how to program arrays
1 M$100 K$ 10 K$
Mainframe MiniMicro Nano
9"5.25" 3.5" 2.5" 1.8"
Moore’s Law RestatedMoore’s Law RestatedMany Little Won over Few BigMany Little Won over Few Big
19Bell & Gray 4/15 / 95
Future SuperServerFuture SuperServer
Array of processors, disks, tapescomm lines
Challenge:How to program itMust use parallelism
Pipeline hide latency
Partitionbandwidthscaleup
1,000 discs = 10 Terrorbytes
100 Tape Transports
= 1,000 tapes = 1 PetaByte
100 Nodes 1 Tips
Hig
h S
pe
ed N
etw
ork
( 10
Gb
/s)
21Bell & Gray 4/15 / 95
The Hardware is in Place and Then A Miracle Occurs
SNAPSNAPScaleable Network And Platforms
Commodity Distributed OSCommodity Distributed OS built onCommodity PlatformsCommodity Network Interconnect
?
22Bell & Gray 4/15 / 95
Why Parallel Access To Data?Why Parallel Access To Data?
1 Terabyte
10 MB/s
At 10 MB/s1.2 days to scan
1 Terabyte
1,000 x parallel1.3 minute SCAN.
Parallelism: divide a big problem into many smaller ones
to be solved in parallel.
BANDWIDTH
23Bell & Gray 4/15 / 95
DataFlow ProgrammingDataFlow ProgrammingPrefetch & Postwrite Hide Latency Prefetch & Postwrite Hide Latency
• Can't wait for the data to arrive• Need a memory that gets the data in advance ( 100MB/S)
• Solution: •Pipeline from source (tape, disc, ram...) to cpu cache•Pipeline results to destination LATENCY
24Bell & Gray 4/15 / 95
The New Law of ComputingThe New Law of Computing
Grosch's Law:
Parallel Law: Needs
Linear Speedup and Linear ScaleupNot always possible
1 MIPS1 $
1,000 $
1,000 MIPS
2x $ is 2x performance
1 MIPS1 $
1,000 MIPS 32 $.03$/MIPS
2x $ is 4x performance
25Bell & Gray 4/15 / 95
Parallelism: Performance is the GoalParallelism: Performance is the Goal
Goal is to get 'good' performance.
Law 1: parallel system should be faster than serial system
Law 2: parallel system should give near-linear scaleup or
near-linear speedup orboth.
Parallelism is faster, not cheaper:trades money for time.
27Bell & Gray 4/15 / 95
The Perils of ParallelismThe Perils of Parallelism
Startup: Creating processesOpening filesOptimization
Interference: Device (cpu, disc, bus)logical (lock, hotspot, server, log,...)
Skew: If tasks get very small, variance > service time
Processors & Discs
The Good Speedup Curve
Processors & Discs
Three Perils
Sta
rtu
p
Inte
rfer
enc
e
Ske
w
Processors & Discs
A Bad Speedup Curve
Linearity
No Parallelism Benefit
29Bell & Gray 4/15 / 95
Kinds of Parallel ExecutionKinds of Parallel Execution
Pipeline
Partition outputs split N ways inputs merge M ways
Any Sequential Program
Any Sequential Program
SequentialSequential
SequentialSequential Any Sequential Program
Any Sequential Program
30Bell & Gray 4/15 / 95
Data RiversData Rivers Split + Merge StreamsSplit + Merge Streams
River
M ConsumersN producers
Producers add records to the river, Consumers consume records from the riverPurely sequential programming.River does flow control and buffering
does partition and merge of data recordsRiver = Exchange operator in Volcano.
N X M Data Streams
31Bell & Gray 4/15 / 95
Partitioned Data and ExecutionPartitioned Data and Execution
A...E F...J K...N O...S T...Z
A Table
Count Count Count Count Count
Count
Spreads computation and IO among processors
Partitioned data gives NATURAL execution parallelism
32Bell & Gray 4/15 / 95
Partitioned + Merge + Pipeline Partitioned + Merge + Pipeline Execution Execution
A...E F...J K...N O...S T...Z
Merge
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Pure dataflow programming Gives linear speedup & scaleup
But, top node may be bottleneckSo....
33Bell & Gray 4/15 / 95
N xM way ParallelismN xM way Parallelism
A...E F...J K...N O...S T...Z
Merge
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Merge Merge
N inputs, M outputs, no bottlenecks.
34Bell & Gray 4/15 / 95
Why are Relational OperatorsWhy are Relational OperatorsSuccessful for Parallelism?Successful for Parallelism?
Relational data model uniform operatorson uniform data streamClosed under composition
Each operator consumes 1 or 2 input streamsEach stream is a uniform collection of dataSequential data in and out: Pure dataflow
partitioning some operators (e.g. aggregates, non-equi-join, sort,..)
requires innovation
AUTOMATIC PARALLELISM
35Bell & Gray 4/15 / 95
SQL SQL a NonProcedural Programming Languagea NonProcedural Programming Language
• SQL: functional programming language describes answer set.
• Optimizer picks best execution plan•Picks data flow web (pipeline), •degree of parallelism (partitioning)•other execution parameters (process placement, memory,...)
GUI
Schema
Plan
Monitor
Optimizer
ExecutionPlanning
Rivers
Executors
36Bell & Gray 4/15 / 95
Database Systems “Hide” Parallelism Database Systems “Hide” Parallelism
•Automate system management via tools•data placement•data organization (indexing)•periodic tasks (dump / recover / reorganize)
•Automatic fault tolerance•duplex & failover• transactions
•Automatic parallelism•among transactions (locking)•within a transaction (parallel execution)
37Bell & Gray 4/15 / 95
Success StoriesSuccess Stories
•Online Transaction Processing •many little jobs•SQL systems support 3700 tps-A
(24 cpu, 240 disk)
•SQL systems support 21,000 tpm-C (110 cpu, 800 disk)
•Batch (decision support and Utility)• few big jobs, parallelism inside•Scan data at 100 MB/s•Linear Scaleup to 50 processors
tran
sact
ion
s /
sec
hardware
recs
/ se
c
hardware
38Bell & Gray 4/15 / 95
Kinds of Partitioned DataKinds of Partitioned Data
Split a SQL table to subset of nodes & disks
Partition within set:Range Hash Round Robin
A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z
Good for equijoins, range queriesgroup-by
Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning
Good for equijoins Good to spread load
41Bell & Gray 4/15 / 95
Picking Data RangesPicking Data Ranges
Disk PartitioningFor range partitioning, sample load on disks.
Cool hot disks by making range smallerFor hash partitioning,
Cool hot disks by mapping some buckets to others
River PartitioningUse hashing and assume uniform If range partitioning, sample data and use
histogram to level the bulk
Teradata, Tandem, Oracle use these tricks
42Bell & Gray 4/15 / 95
Parallel Data ScanParallel Data Scan
Select imagefrom landsatwhere date between 1970 and 1990and overlaps(location, :Rockies) and snow_cover(image) >.7;
Temporal
Spatial
Image
date loc image
Landsat
1/2/72.........4/8/95
33N120W.......34N120W
Assign one process per processor/disk:find images with right data & locationanalyze image, if 70% snow, return it
image
Answer
date, location, & image tests
44Bell & Gray 4/15 / 95
Parallel AggregatesParallel Aggregates
For aggregate function, need a decomposition strategy:
count(S) = count(s(i)), ditto for sum()avg(S) = ( sum(s(i))) / count(s(i))and so on...
For groups,sub-aggregate groups close to the sourcedrop sub-aggregates into a hash river.
A...E F...J K...N O...S T...Z
A Table
Count Count Count Count Count
Count
46Bell & Gray 4/15 / 95
Sub-sortsgenerateruns
Mergeruns
Range or Hash Partition River
River is range or hash partitioned
Scan or other source
Parallel SortParallel Sort
M input N output Sort design
Disk and mergenot needed if sort fits in memory
Scales linearly because
6
12= => 2x slowerlog(10 ) 6
log(10 ) 12
Sort is benchmark from hell for shared nothing machinesnet traffic = disk bandwidth, no data filtering at the source
47Bell & Gray 4/15 / 95
Blocking Operators =Short PiplelinesBlocking Operators =Short Piplelines
An operator is blocking, if it does not produce any output, until it has consumed all its input
Examples:Sort, Aggregates, Hash-Join (reads all of one operand)
Blocking operators kill pipeline parallelismMake partition parallelism all the more important.
Sort RunsScan
Sort Runs
Sort Runs
Sort Runs
Tape File SQL Table Process
Merge Runs
Merge Runs
Merge Runs
Merge Runs
Table Insert
Index Insert
Index Insert
Index Insert
SQL Table
Index 1
Index 2
Index 3
Database LoadTemplate hasthree blocked phases
50Bell & Gray 4/15 / 95
Hash JoinHash Join
Hash smaller table into N buckets (hope N=1)
If N=1 read larger table, hash to smallerElse, hash outer to disk then
bucket-by-bucket hash join.
Purely sequential data behavior
Always beats sort-merge and nestedunless data is clustered.
Good for equi, outer, exclusion joinLots of papers,
products just appearing (what went wrong?)
Hash reduces skew
Right Table
LeftTable
HashBuckets
51Bell & Gray 4/15 / 95
Observation: Execution “easy”Observation: Execution “easy”Automation “hard”Automation “hard”
It is “easy” to build a fast parallel execution environment(no one has done it, but it is just programming)
It is hard to write a robust and world-class query optimizer.There are many tricksOne quickly hits the complexity barrier
Common approach:Pick best sequential planPick degree of parallelism based on bottleneck analysisBind operators to processPlace processes at nodesPlace scratch files near processesUse memory as a constraint
52Bell & Gray 4/15 / 95
Systems That Work This WaySystems That Work This Way
Shared NothingTeradata: 400 nodesTandem: 110 nodesIBM / SP2 / DB2: 48 nodesATT & Sybase 112 nodesInformix/SP2 48 nodes
Shared DiskOracle 170 nodesRdb 24 nodes
Shared MemoryInformix 9 nodes RedBrick ? nodes
CLIENTS
MemoryProcessors
CLIENTS
CLIENTS
53Bell & Gray 4/15 / 95
Research Research ProblemsProblems
• Automatic data placement (partition: random or organized)
• Automatic parallel programming (process placement)
• Parallel concepts, algorithms & tools
• Parallel Query Optimization
• Execution Techniques load balance, checkpoint/restart, pacing,
1,000 discs = 10 Terrorbytes
100 Tape Transports
= 1,000 tapes = 1 PetaByte
100 Nodes 1 Tips
Hig
h S
pe
ed N
etw
ork
( 10
Gb
/s)
54Bell & Gray 4/15 / 95
SummarySummary
• Cyberspace is Growing• Databases are the dirt of cybersspace
PCs are the bricks, Networks are the morter. Many little devices: Performance via Arrays of {cpu, disk ,tape}
• Then a miracle occurs: a scaleable distributed OS and net•SNAP: Scaleable Networks and Platforms
• Then parallel database systems give software parallelism•OLTP: lots of little jobs run in parallel•Batch TP: data flow & data partitioning•Automate processor & storage array administration •Automate processor & storage array programming
• 2000 platforms as easy as 1 platform.
Recommended