Upload
brooke-snow
View
214
Download
0
Embed Size (px)
Citation preview
Architectural Software Support for Processing ClustersJohannes Gutleber, Luciano Orsini
European Organization for Nuclear ResearchDiv. EP/CMD, The CMS CollaborationCERN, 1211 Geneva 23, Switzerland
2
The Issue
1988The biggest problem with creating distributed computing systems is devising a method of intercomputer communication that is reliable, fast and simple.
J.E. Tomayko, NASA CR-182505, p.228, Mar 1988
2000High-speed networks […] can obtain communication speeds close to those of supercomputers, but realizing this potential is a challenging problem.
H. Bal, ACM Op Sys Rev, p. 79, Oct 2000
3
The Approach
• invest in alternative communication paradigms• optimise communication libraries
Do not…
• Lightweight framework for homogeneous communication• Configure with low-level communication libraries• Plug-in application components• homogeneous subsystem interface design support
Provide architectural software support
4
Architectural Software Support
• Architecture support comprises– a processing model– subsystem addressing– configuration and control– Application Programmer Interface requirements
Everything that is needed tobuild and operate a
Distributed application
5
Motivation
• In large scale data acquisition systems we have to cope with– Long operational lifetimes (10-15 yrs)– Modifications due to generation jumps (networking, processing)– Deployance of one application in various different environments– Bridging of hardware/software performance gaps
• From the special case we can extrapolate to general cluster based systems– Search engines, document retrieval systems– Plant control systems – Medical imaging networks in hospitals
Available tools don´t match the requirements
6
HDM/FPGAHDM/IOP
Architecture Basis: I2O
• A specification for hardware and operating system independent device driver framework
• Targeted at collaboration between...
Messaging Layer
Host andIntelligentdevices
Intelligent deviceintercommunication
PCI busUNIX - OSM Windows - OSM
7
I2O IOP Environment
• Inbound/Outbound queue (pass frame pointers, Zcopy)• Homogeneous frame format• Event driven processing• Uniform hardware access API
IRQ
bar ( )
Network
HDM, framework
foo( )
Inbound outbound
8
I2O Message Frame
Used to implement an active message model
MessageSize MessageFlags VersionOffset
TargetAddressInitiatorAddressFunction (= FFh)
InitiatorContext
TransactionContext
XFunctionCodeOrganizationID
PrivatePayload = function parameters
PrivatePayload
3 2 1 031 24 23 16 15 8 7 0
Sta
nd
ard
Fra
me
Pri
vate
Fra
me
Ext
en
sio
n
Assigned by application and returned in reply (cookie)
Assigned by message layer. Used for routing back reply
9
I2O Messaging
• A Message frame contains two addresses– initiatorTid = where the message comes from– destinationTid = to which DDM/ISM it shall go
• Message is associated with a handler function– Predefined Functions for I2O messages– Private frame extension for application specific messages
• Message length limited to 265 KB. Frame should only contain control information– Message data should go into Scatter-Gather Lists
• I2O frame byte order is little endian
10
Peer and Peer2Peer Operations
• Peer Operation uses the queue pair on one PCI segment• Peer-to-Peer commands for network communication
Executive
Peer TransportAgent
Executive
Peer TransportAgent
PeerTransport
DDM
Messaging Layer
Executive
Messaging Layer
Device Driver
Module
Non-I2Omessages
I2O message frames
11
I2O Peer Operation for Clusters
• Application component device• Processing node IOP• Controller node host
• Homogeneous communication– frameSend for local, remote, host– single addressing scheme (Tid)
• Application framework
Executive
Messaging Layer
Peer TransportAgent
Messaging Layer
Executive
Peer TransportAgent
ˆ
‚
ƒ
„ …
†
‡
‰PeerTransport
Application Application
I2O Message Frames
12
TargetAddrClassId
InstanceDispatcher
Applications are I2O Classes
in XDAQthey are
equivalent toC++ classes
Listener
DDmAdapter UtilAdapter UserAdapter
Application
Each class exposes an
interface that is implemented by the application
13
Polling Peer Transport Agent+ low OS service overhead- executive uses CPU continuously- no blocking PTs
Peer Transport Configurations
PTATCP
Myri
DLPI
FIFOPTA
TCP
Myri
DLPI
FIFO
Thread per Peer Transport- higher OS service overhead+ no CPU monopolisation+ allows integration with other software
14
I2O for Cluster Configuration
executive tasks
RUIO (IOP480)VxWorks
PPC (MVME2306)VxWorks, Linux
WorkstationIntel Linux,
Sparc Solaris
15
Boot
• Executives on each node in the cluster wait for I2O configuration messages
• Configuration and Control can be done through– Native I2O messages– XML/HTTP mapping Zzz..zzz…zzz..
Parameter set/get isAlso done through I2O/XML
16
I2O Configuration Commands
• Where (e.g. IOP 34) ExecSysTabSet• How (e.g TCP, DLPI, Myrinet)• Who (e.g RU1 – Tid 10, RU2 – Tid 20) ExecDeviceAssign
Detector Frontend
Computing Services
ReadoutSystems
FilterSystems
Event Manager Builder Networks
Level 1Trigger
RunControl
17
Ready
• What ExecSwDownload (e.g.libRU.so, libEVM.so)
LocalApp2
RemoteApp2
RemoteApp1
RemoteApp3
LocalApp3
18
Operational
App2App1
frameSend (...)App3
DdmSystemChange
19
Efficiency Evolution
• Roundtrip test, reporting half-roundtrip-time• Calculate difference to the bare-bones use of Myrinet GM library
June July August September October November
10
5
3
4
2
1
original efficiency, paper450 MHz, PCI 32/33
on-demand buffer-pool allocation450 MHz, PCI 32/33
750 MHz, PCI 32/33
µsecs
time
20
Point To Point Efficiency
GM/XDAQ Latencies
y = -0,0000x + 2,1289
0
10
20
30
40
50
60
70
80
90
100
110
120
0 1024 2048 3072 4096
Bytes transferred
Mic
rose
cond
s
XDAQ
GM 1.2.3
21
SOAP
CMS Data Acquisition System
XML
Java
I2O
I2OO(500) real-timesystems
Giga E´NetMyrinet, Infiniband
100 kHz input@ 2KB per node
Custom readout
O(500) builder units
O(2000) physics Analysis nodes
Prototype cluster 2000: 32 x 32 PCs2.5 Gbps Myrinet 2000Gigabit Ethernet
22
Summary
• Lightweight middleware• 2.1 sec per remote function invocation
(50 000 calls/s on GM)
– Abstraction from hardware– Ease of adaptability and extensibility
is feasibile.• Need architectural support
– to efficiently integrate layers– to be able to keep pace with technology
evolution w/o a need for change– to construct homogeneous applications
for heterogeneous processing clusters
OS and Device Drivers
HTTP
Ethernet Myrinet
XDAQ
Util/DDM
Processing
Sensor readout
TCP
PCI