Fast Communication

Fast CommunicationFirefly RPCLightweight RPCCS 614Tuesday March 13, 2001Jeff Hoy

Why Remote Procedure Call?Simplify building distributed systems and applicationsLooks like local procedure callTransparent to userBalance between semantics and efficiencyUniversal programming toolSecure inter-process communication

RPC ModelClient ApplicationClient StubClient RuntimeServer ApplicationServer StubServer RuntimeNetworkCallReturn

RPC In Modern ComputingCORBA and Internet Inter-ORB Protocol (IIOP)Each CORBA server object exposes a set of methodsDCOM and Object RPCBuilt on top of RPCJava and Java Remote Method Protocol (JRMP)Interface exposes a set of methodsXML-RPC, SOAPRPC over HTTP and XML

GoalsFirefly RPCInter-machine CommunicationMaintain Security and FunctionalitySpeedLightweight RPCIntra-machine CommunicationMaintain Security and FunctionalitySpeed

Firefly RPCHardwareDEC Firefly multiprocessor1 to 5 MicroVAX CPUs per nodeConcurrency considerations10 megabit EthernetTakes advantage of 5 CPUs

Fast Path in a RPCTransport MechanismsIP / UDPDECNet byte streamShared Memory (intra-machine only)Determined at bind timeInside transport procedures Starter, Transporter, Ender, and Receiver for the server

Caller StubGets control from calling programCalls Starter for packet bufferCopies arguments into the bufferCalls Transporter and waits for replyCopies result data onto callers result variablesCalls Ender and frees result packet

Server StubReceives incoming packetCopies data into stack, a new data block, or left in the packetCalls server procedureCopies result into the call packet and transmit

Transport MechanismTransporter procedureCompletes RPC headerCalls Sender to complete UDP, IP, and Ethernet headers (Ethernet is the chosen means of communication)Invoke Ethernet driver via kernel trap and queue the packet

Transport MechanismReceiver procedureServer thread awakens in ReceiverReceiver calls the stub interface included in the received packet, and the interface stub calls the procedure stubReply is similar

ThreadingClient Application creates RPC threadServer Application creates call thread Threads operate in server applications address spaceNo need to spawn entire processThreads need to consider locking resources

Threading

Performance EnchancementsOver traditional RPCStubs marshal arguments rather than library functions handling argumentsRPC procedures called through procedure variables rather than by lookup tableServer retains call packet for resultsBuffers reside in shared memorySacrifices abstract structure

Performance AnalysisNull() ProcedureNo arguments or return valueMeasures base latency of RPC mechanism

Multi-threaded caller and server

Time for 10,000 RPCsBase latency 2.66msMaxResult latency (1500 bytes) 6.35ms

Send and Receive Latency

Send and Receive LatencyWith larger packets, transmission time dominatesOverhead becomes less of an issueGood for Firefly RPC, assuming large transmission over networkIs overhead acceptable for intra-machine communication?

Stub LatencySignificant overhead for small packets

Fewer ProcessorsSeconds for 1,000 Null() calls

Fewer ProcessorsWhy the slowdown with one processor?Fast path can be followed only in multiprocessor environmentLock conflicts, scheduling problemsWhy little speedup past two processors?

Future ImprovementsHardwareFaster network will help larger packetsTriple CPU speed will reduce Null() time by 52% and MaxResult by 36%SoftwareOmit IP and UDP headers for Ethernet datagrams, 2~4% gainRedesign RPC protocol ~ 5% gainBusy thread wait, 10~15% gainWrite more in assembler, 5~10% gain

Other ImprovementsFirefly RPC handles intra-machine communication through the same mechanisms as inter-machine communicationFirefly RPC also has very high overhead for small packetsDoes this matter?

RPC Size DistributionMajority of RPC transfers under 200 bytes

Frequency of Remote ActivityMost calls are to the same machine

Traditional RPCMost calls are small messages that take place between domains of the same machineTraditional RPC contains unnecessary overhead, likeSchedulingCopyingAccess validation

Lightweight RPC (LRPC)Also written for the DEC Firefly systemMechanism for communication between different protection domains on the same systemSignificant performance improvements over traditional RPC

Overhead AnalysisTheoretical minimum to invoke Null() across domains: kernal trap + context change to call and a trap + context change to returnTheoretical minimum on Firefly RPC: 109 us.Actual cost: 464us

Sources of Overhead355us addedStub overheadMessage buffer overheadNot so much in Firefly RPCMessage transfer and flow controlScheduling and abstract threadsContext Switch

Implementation of LRPCSimilar to RPCCall to server is done through kernel trap Kernel validates the callerServers export interfacesClients bind to server interfaces before making a call

BindingServers export interfaces through a clerkThe clerk registers the interfaceClients bind to the interface through a call to the kernelServer replies with an entry address and size of its A-stackClient gets a Binding Object from the kernel

CallingEach procedure is represented by a stubClient makes a call through the stubManages A-stacksTraps to the kernelKernel switches context to the serverServer returns by its own stubNo verification needed

Stub GenerationProcedure representationCall stub for clientEntry stub for serverLRPC merges protocol layersStub generator creates run-time stubs in assembly languagePortability sacrificed for Performance Falls back on Modula2+ for complex calls

Multiple ProcessorsLRPC caches domains on idle processorsKernel checks for an idling processor in the server domainIf a processor is found, caller thread can execute on the idle processor without switching context

Argument CopyingTraditional RPC copies arguments four times for intra-machine callsClient stub to RPC message to kernels message to servers message to servers stackIn many cases, LRPC needs to copy the arguments only onceClient stub to A-stack

Performance AnalysisLRPC is roughly three times faster than traditional RPCNull() LRPC cost: 157us, close to the 109us theoretical minimumAdditional overhead from stub generation and kernel execution

Single-Processor Null() LRPC

Performance ComparisonLRPC versus traditional RPC (in us)

Multiprocessor Speedup

Inter-machine CommunicationLRPC is best for messages between domains on the on the same machineThe first instruction of the LRPC stub checks if the call is cross-machineIf so, stub branches to conventional RPCLarger messages are handled well, LRPC scales by packet size linearly like traditional RPC

CostLRPC avoids needless scheduling, copying, and locking by integrating the client, kernel, server, and message protocolsAbstraction is sacrificed for functionalityRPC is built into operating systems (Linux DCE RPC, MS RPC)

ConclusionFirefly RPC is fast compared to most RPC implementations. LRPC is even faster. Are they fast enough?The performance of Firefly RPC is now good enough that programmers accept it as the standard way to communicate (1990)Is speed still an issue?

Documents

Fast Communication