High Performance Communication for Oracle using InfiniBand
Ross SchiblerCTO
Topspin Communications, Inc
Session id: #36568
Peter OgilviePrincipal Member of Technical Staff
Oracle Corporation
Session Topics Why the Interest in InfiniBand Clusters InfiniBand Technical Primer Performance Oracle 10g InfiniBand Support Implementation details
Why the Interest in InfiniBand InfiniBand is key new feature in Oracle 10g
Enhances price/performance and scalability; simplifies systems
InfiniBand fits broad movement towards lower costsHorizontal scalability; converged networks, system virtualization...grid
Initial DB performance & scalability data is superbNetwork tests done; Application level benchmarks now in progress
InfiniBand is widely supported standard - available todayOracle…Dell, HP, IBM, Network Appliance, Sun and ~100 others involved.
Tight alliance btw Oracle and Topspin enables IB for 10gIntegrated & tested; delivers complete Oracle “wish list” for high speed interconnects
Server Revenue Mix
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
$0-2.9K $3-5.9K $6-9.9K $10-24.9K
$25-49.9K
$50-99.9K
$100-249.9K
$250-499.9K
$500-999.9K
$1M-3M $3M+
Price Band
Sh
are
of
Re
ven
ues
1996
2001
2002
Source: IDC Server Tracker, 12/2002
23%Entry
Mid
High-End
Server Revenue Mix
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
$0-2.9K $3-5.9K $6-9.9K $10-24.9K
$25-49.9K
$50-99.9K
$100-249.9K
$250-499.9K
$500-999.9K
$1M-3M $3M+
Price Band
Sh
are
of
Re
ven
ues
1996
2001
2002
Source: IDC Server Tracker, 12/2002
23%
39%
Entry
Mid
High-End
Server Revenue Mix
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
$0-2.9K $3-5.9K $6-9.9K $10-24.9K
$25-49.9K
$50-99.9K
$100-249.9K
$250-499.9K
$500-999.9K
$1M-3M $3M+
Price Band
Sh
are
of
Re
ven
ues
1996
2001
200223%
39%
43%
Entry
Mid
High-End
Source: IDC Server Tracker, 12/2002
System Transition Presents Opportunity
Major shift to standard systems - blade impact not even factored in yet Customer benefits from scaling horizontally across standard systems
– Lower up-front costs, Granular scalability, High availability
The Near Future
Server Revenue Mix
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
$0-2.9K $3-5.9K $6-9.9K $10-24.9K
$25-49.9K
$50-99.9K
$100-249.9K
$250-499.9K
$500-999.9K
$1M-3M $3M+
Price Band
Sh
are
of
Re
ven
ues
Scale Out Scale Up
Legacy &
Big Iron Apps
Database Clusters &
Grids
Market Splits around Scale-Up vs. Scale-Out
Database grids provide foundation for scale out
InfiniBand switched computing interconnects are critical enabler
Enterprise Apps
Web Services
Application Servers
SharedStorage
Oracle RAC
Gigabit Ethernet
Traditional RAC Cluster
Fibre Channel
Application Servers
SharedStorage
Oracle RAC
OUCH!OUCH!
Scalability within the Database Tier limited by Interconnect Latency, Bandwidth, and Overhead
Gigabit Ethernet
Three Pain Points
OUCH!OUCH!
OUCH!OUCH!
Throughput Between the Application Tier and Database Tier limited by Interconnect Bandwidth, and Overhead
I/O Requirements driven by number of servers instead of application performance requirements
Fibre Channel
Application Servers
SharedStorage
Oracle RAC
Clustering with Topspin InfiniBand
Application Servers
SharedStorage
Oracle RAC
Removes all Three Bottlenecks
Central server to storage I/O scalability through InfiniBand switch
Removes I/O bottlenecks to storage and provides smoother scalability
InfiniBand provides 10 Gigabit low latency interconnect for clusterApplication tier can run over
InfiniBand, benefiting from same high throughput and low latency as cluster
Example Cluster with Converged I/O
Ethernet to InfiniBand gateway for LAN access Four Gigabit Ethernet ports per gateway Create virtual Ethernet pipe to each server
Fibre Channel to InfiniBand gateway for storage access Two 2Gbps Fibre Channel ports per gateway Create 10Gbps virtual storage pipe to each server
InfiniBand switches for cluster interconnect Twelve 10Gbps InfiniBand ports per switch card Up to 72 ports total ports with optional modules Single fat pipe to each server for all network traffic
Industry Standard Storage
Industry Standard Server
Industry Standard Network
Industry Standard Storage
Industry Standard Storage
Industry Standard Storage
Industry Standard Server
Industry Standard Server
Industry Standard Server
Industry Standard Network
Industry Standard Network
Industry Standard Network
Industry Standard Server
Topspin InfiniBand Cluster Solution
Ethernet or Fibre ChannelGateway modules
Integrated System and Subnet management
Family of switches
Host Channel Adapter With Upper Layer Protocols
Protocols
uDAPL SDP
SRP IPoIB
Platform Support
Linux: Redhat, Redhat AS, SuSE
Solaris: S10
Windows: Win2k & 2003
Processors: Xeon, Itanium, Opteron
Cluster Interconnect with Gateways for I/O Virtualization
InfiniBand is a new technology used to interconnect servers, storage and networks together within the datacenter
Runs over copper cables (<17m) or fiber optics (<10km)
Scalable interconnect:– 1X = 2.5Gb/s– 4X = 10Gb/s– 10X = 30Gb/s
InfiniBand Primer
ServerServer ServerServer
InfiniBand Nomenclature
Ethernet Storage Network
Topspin 360/90
HostHostHostHost
HostHostHostHost
HostHostHostHost
HostHostHostHost
HostHostHostHost
HostHostHostHost
HostHostHostHost
HostHostHostHost
HostHostHostHost
HostHostHostHost
HostHostServerServerServerServer
CPU
CPU
Ho
st I
nte
rco
nn
ect
MemCntlr
SystemMemory
IB L
ink
HCA
SM
Switch
IB Link TCA
IB Link TCA
Ethernet linkIB Link
FC link
InfiniBand Nomenclature
SM
Switch
IB Link TCA
IB Link TCA
Ethernet linkIB Link
FC link
CPU
CPU
Ho
st I
nte
rco
nn
ect
MemCntlr
SystemMemory
IB L
ink
HCA
HCA – Host Channel Adaptor SM - Subnet manager TCA – Target Channel
Adaptor
Kernel BypassKernel Bypass Model
Hardware
Application
Kernel
User
TCP/IPTransport
Driver
uDAPLSocketsLayer
SDP
async sockets
NIC
Copy on Receive
CPU
CPU
Ho
st I
nte
rco
nn
ect
MemCntlr
Server (Host)
inte
rcon
nect
System Memory
OS Buffer
App Buffer
Data traverses bus 3 times
With RDMA and OS Bypass
HCA
CPU
CPU
Ho
st I
nte
rco
nn
ect
MemCntlr
Server (Host)
inte
rcon
nect
System Memory
OS Buffer
App Buffer
Data traverses bus once, saving CPU and memory cycles
6.4Gb/s6.4Gb/s3.2Gb/s
1.2Gb/s
APIs and Performance
BSD Sockets Async I/O
extension
Application
1GE
RDMAIPoIB
TCP
IP
SDP
10G IB
0.8Gb/s
uDAPL
Why SDP for OracleNet & uDAPL for RAC?
RAC IPC– Message based– Latency sensitive– Mixture of previous APIs
use of uDAPL OracleNet
– Streams based– Bandwidth intensive– Previously written to sockets
use of Sockets Direct Protocol API
InfiniBand Cluster Performance Benefits
Source: Oracle Corporation and Topspin on dual Xeon processor nodes
0
5000
10000
15000
20000
25000
30000
2-node cluster 4-node cluster
InfiniBand
GigE
Network Level Cluster Performance for Oracle RAC
InfiniBand delivers 2-3X higher block transfers/sec as compared to GigE
Block Transfer/sec (16KB)
InfiniBand Application to Database Performance Benefits
InfiniBand delivers 30-40% lower CPU utilization and 100% higher throughput as compared to Gigabit Ethernet
Source: Oracle Corporation and Topspin
0
50
100
150
200
250
CPU Utilization Throughput
InfiniBand
GigE
Percent
Broad Scope of InfiniBand Benefits
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Oracle RAC
Application Servers
Network
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
Sniffer Servermonitoring/analysis
SharedStorage
Ethernetgateway
FC gateway: host/lun mapping
OracleNet: over SDP over IB
Intra RAC: IPC over uDAPL over IB
DAFS over IB
SAN
NAS
20% improvement in throughput
2x improvement in throughput and 45% less
CPU
3-4x improvement in block updates/sec
30% improvement in DB performance
Database
uDAPL Optimization Timeline
IB HW/FW
uDAPL
skgxp
CacheFusion
Workload
CM
Sept 2002: uDAPL functional with 6Gb/s throughput
Dec 2002: Oracle interconnect performance released, showing improvements in bandwidth (3x),
latency(10x) and cpu reduction (3x)
Feb 2003: Cache Block Updates show fourfold performance improvement in 4-node RAC
April-August 2003: Gathering OAST and industry standard workload performance metrics. Fine tuning
and optimization at skgxp, uDAPL and IB layers
Jan 2003: added Topspin CM for improved scaling of number of connections and reduced setup times
LM
RAC Cluster Communication
High speed communication is key– must be faster to fetch a block from a remote cache than to read
the block from disk– Scalability is a function of communication CPU overhead
Two Primary Oracle Consumers– Lock manager / Oracle buffer cache– Inter instance parallel query communication
SKGXP Oracle’s IPC driver interface– Oracle is coded to skgxp– Skgxp is coded to vendor high performance interfaces– IB support delivered as a shared library libskgxp10.so
Cache Fusion Communication
Shadowprocesses
to client
LMSLock request
cache cache
RDMA
Parallel Query Communication
PXServers
PXServers
to client msg data
data
data
Cluster Interconnect Wish List
OS bypass (user mode communication) Protocol offload Efficient asynchronous communication model RDMA with high bandwidth and low latency Huge memory registrations for Oracle buffer caches Support large number of processes in an instance Commodity Hardware Software interfaces based on open standards Cross platform availability
InfiniBand is first interconnect to meet all of these requirements
Asynchronous Communication
Benefits– Reduces impact of latency– Improves robustness by avoiding communication
dead lock– Increases bandwidth utilization
Drawback- Historically costly, as synchronous operations are
broken into separate submit and reap operations
Protocol Offload & OS Bypass
Bypass makes submit cheap– Requests are queued directly to hardware from
Oracle Offload
– Completions move from the hardware to Oracle’s memory
– Oracle can overlap commutation and computation without a trap to the OS or context switch
InfiniBand Benefits by Stress Area
Stress Area BenefitCluster Network Extremely low latency
10 Gig throughput
Compute CPU & kernel offload removes TCP overhead
Frees CPU cycles
Server I/O Single converged 10 Gig network for cluster, storage, LAN
Central I/O scalability
Stress level varies over time with each queryInfiniBand provides substantial benefits in all three areas
Benefits for Different Workloads
High bandwidth and low latency benefits for Decision Support (DSS)
– Should enable serious DSS workloads on RAC clusters
Low latency benefits for scaling Online Transaction Processing (OLTP)
Our estimate: One IB Link replaces 6-8 Gigabit Ethernet links
Commodity Hardware
Higher capabilities and lower cost than propriety interconnects
InfiniBand’s large bandwidth capability means that a single link can replace multiple GigE and FC interconnects
Memory Requirements
The Oracle buffer cache can consume 80% of a host’s physical memory
64 bit addressing and decreasing memory prices mean ever larger buffer caches
Infiniband provides…– Zero copy RDMA between very large buffer
caches– Large shared registrations moves memory
registration out of the performance path
Two Efforts Coming TogetherRAC/Cache Fusion and Oracle Net
Two Oracle engineering teams working at cluster and application tiers
– 10g incorporates both efforts Oracle Net benefits from many of the same capabilities
as Cache Fusion– OS kernel bypass – CPU offload– New transport protocol (SDP) support– Efficient asynchronous communication model– RDMA with high bandwidth and low latency– Commodity hardware
Working on external and internal deployments
Open Standard Software APIsuDAPL and Async Sockets/SDP
Each new communication driver is a large investment for Oracle
One stack which works across multiple platforms means improved robustness
Oracle grows closer to the interfaces over time Ready today for immerging technologies Ubiquity and robustness of IP for high speed
communication
Summary Oracle and major system & storage vendors are supporting
InfiniBand
InfiniBand presents superb opportunity for enhanced horizontal scalability and lower cost
Oracle Net’s InfiniBand Support significantly improves performance for both the app server and the database in Oracle 10g
Infiniband provides the performance to move applications to low cost Linux RAC databases. ????
AQ&Q U E S T I O N SQ U E S T I O N S
A N S W E R SA N S W E R S
Next Steps….
See InfiniBand demos first hand on the show floor– Dell, Intel, Netapp, Sun, Topspin (booth #620)– Includes clustering, app tier and storage over
InfiniBand
InfiniBand whitepapers on both Oracle and Topspin websites
– www.topspin.com– www.oracle.com