Faster! Vidhyashankar Venkataraman CS614 Presentation

  • View
    213

  • Download
    0

Embed Size (px)

Text of Faster! Vidhyashankar Venkataraman CS614 Presentation

  • Slide 1
  • Faster! Vidhyashankar Venkataraman CS614 Presentation
  • Slide 2
  • U-Net : A User-Level Network Interface for Parallel and Distributed Computing
  • Slide 3
  • Background Fast Computing Emergence of MPP Massively Parallel Processors in the early 90s Repackage hardware components to form a dense configuration of very large parallel computing systems Repackage hardware components to form a dense configuration of very large parallel computing systems But require custom software But require custom software Alternative : NOW (Berkeley) Network Of Workstations Formed by inexpensive, low latency, high bandwidth, scalable, interconnect networks of workstations Formed by inexpensive, low latency, high bandwidth, scalable, interconnect networks of workstations Interconnected through fast switches Interconnected through fast switches Challenge: To build a scalable system that is able to use the aggregate resources in the network to execute parallel programs efficiently Challenge: To build a scalable system that is able to use the aggregate resources in the network to execute parallel programs efficiently
  • Slide 4
  • Issues Problem with traditional networking architectures Software path through kernel involves several copies - processing overhead Software path through kernel involves several copies - processing overhead In faster networks, may not get application speed-up commensurate with network performance In faster networks, may not get application speed-up commensurate with network performanceObservations: Small messages : Processing overhead is more dominant than network latency Small messages : Processing overhead is more dominant than network latency Most applications use small messages Most applications use small messages Eg.. UCB NFS Trace : 50% of bits sent were messages of size 200 bytes or less
  • Slide 5
  • Issues (contd.) Flexibility concerns: Protocol processing in kernel Protocol processing in kernel Greater flexibility if application specific information is integrated into protocol processing Greater flexibility if application specific information is integrated into protocol processing Can tune protocol to applications needs Can tune protocol to applications needs Eg.. Customized retransmission of video frames Eg.. Customized retransmission of video frames
  • Slide 6
  • U-Net Philosophy Achieve flexibility and performance by Removing kernel from the critical path Removing kernel from the critical path Placing entire protocol stack at user level Placing entire protocol stack at user level Allowing protected user-level access to network Allowing protected user-level access to network Supplying full bandwidth to small messages Supplying full bandwidth to small messages Supporting both novel and legacy protocols Supporting both novel and legacy protocols
  • Slide 7
  • Do MPPs do this? Parallel machines like Meiko CS-2, Thinking Machines CM-5 Have tried to solve the problem of providing user-level access to network Have tried to solve the problem of providing user-level access to network Use of custom network and network interface No flexibility Use of custom network and network interface No flexibility U-Net targets applications on standard workstations Using off-the-shelf components Using off-the-shelf components
  • Slide 8
  • Basic U-Net architecture Virtualize N/W device so that each process has illusion of owning NI Mux/ Demuxing device virtualizes the NI Offers protection! Kernel removed from critical path Kernel involved only in setup
  • Slide 9
  • The U-Net Architecture Building Blocks Application End-points Communication Segment(CS) Message QueuesSending Assemble message in CS EnQ Message DescriptorReceiving Poll-driven/ Event-driven DeQ Message Descriptor Consume message EnQ buffer in free Q An application endpoint A region of memory
  • Slide 10
  • U-Net Architecture (contd.) More on event-handling (upcalls) Can be UNIX signal handler or user-level interrupt handler Amortize cost of upcalls by batching receptions Mux/ Demux : Each endpoint uniquely identified by a tag (eg.. VCI in ATM) OS performs initial route setup and security tests and registers a tag in U-Net for that application The message tag mapped to a communication channel
  • Slide 11
  • Observations Have to preallocate buffers memory overhead! Protected User-level access to NI : Ensured by demarcating into protection boundaries Defined by endpoints and communication channels Defined by endpoints and communication channels Applications cannot interfere with each other because Applications cannot interfere with each other because Endpoints, CS and message queues user-owned Outgoing messages tagged with originating endpoint address Incoming messages demuxed by U-Net and sent to correct endpoint
  • Slide 12
  • Zero-copy and True zero-copy Two levels of sophistication depending on whether copy is made at CS Base-Level Architecture Base-Level Architecture Zero-copy : Copied in an intermediate buffer in the CS CSes are allocated, aligned, pinned to physical memory Optimization for small messages Direct-access Architecture Direct-access Architecture True zero copy : Data sent directly out of data structure Also specify offset where data has to be deposited CS spans the entire process address space Limitations in I/O Addressing force one to resort to Zero- copy
  • Slide 13
  • Kernel emulated end-point Communication segments and message queues are scarce resources Optimization: Provide a single kernel emulated endpoint Cost : Performance overhead
  • Slide 14
  • U-Net Implementation U-Net architectures implemented in two systems Using Fore Systems SBA 100 and 200 ATM network interfaces Using Fore Systems SBA 100 and 200 ATM network interfaces But why ATM? But why ATM? Setup : SPARCStations 10 and 20 on SunOS 4.1.3 with ASX- 200 ATM switch with 140 Mbps fiber links Setup : SPARCStations 10 and 20 on SunOS 4.1.3 with ASX- 200 ATM switch with 140 Mbps fiber links SBA-200 firmware 25 MHz On-board i960 processor, 256 KB RAM, DMA capabilities 25 MHz On-board i960 processor, 256 KB RAM, DMA capabilities Complete redesign of firmware Complete redesign of firmware Device Driver Protection offered through VM system (CSes) Protection offered through VM system (CSes) Also through mappings Also through mappings
  • Slide 15
  • U-Net Performance RTT and bandwidth measurements Small messages 65 s RTT (optimization for single cells) Fiber saturated at 800 B
  • Slide 16
  • U-Net Active Messages Layer An RPC that can be implemented efficiently on a wide range of hardware A basic communication primitive in NOW Allow overlapping of communication with computation Message contains data & ptr to handler Reliable Message delivery Reliable Message delivery Handler moves data into data structures for some (ongoing) operation Handler moves data into data structures for some (ongoing) operation
  • Slide 17
  • AM Micro-benchmarks Single-cell RTT RTT ~ 71 s for a 0-32 B message RTT ~ 71 s for a 0-32 B message Overhead of 6 s over raw U-Net Why? Overhead of 6 s over raw U-Net Why? Block store BW 80% of the maximum limit with blocks of 2KB size 80% of the maximum limit with blocks of 2KB size Almost saturated at 4KB Almost saturated at 4KB Good performance! Good performance!
  • Slide 18
  • Split-C application benchmarks Parallel Extension to C Implemented on top of UAM Tested on 8 processors ATM cluster performs close to CS-2
  • Slide 19
  • TCP/IP and UDP/IP over U-Net Good performance necessary to show flexibility Traditional IP-over-ATM shows very poor performance eg.. TCP : Only 55% of max BW eg.. TCP : Only 55% of max BW TCP and UDP over U-Net show improved performance Primarily because of tighter application-network coupling Primarily because of tighter application-network couplingIP-over-U-Net: IP-over-ATM does not exactly correspond to IP-over-UNet IP-over-ATM does not exactly correspond to IP-over-UNet Demultiplexing for the same VCI is not possible Demultiplexing for the same VCI is not possible
  • Slide 20
  • Performance Graphs UDP Performance Saw-tooth behavior for Fore UDP TCP Performance
  • Slide 21
  • Conclusion U-Net provides virtual view of network interface to enable user- level access to high-speed communication devices The two main goals were to achieve performance and flexibility By avoiding kernel in critical path Achieved? Look at the table below
  • Slide 22
  • Lightweight Remote Procedure Calls
  • Slide 23
  • Motivation Small kernel OSes have most services implemented as separate user-level processes Have separate, communicating user processes Improve modular structure Improve modular structure More protection More protection Ease of system design and maintenance Ease of system design and maintenance Cross-domain & cross-machine communication treated equal - Problems? Fails to isolate the common case Fails to isolate the common case Performance and Simplicity considerations Performance and Simplicity considerations
  • Slide 24
  • Measurements Measurements show cross-domain predominance V System 97% V System 97% Taos Firefly 94% Taos Firefly 94% Sun UNIX+NFS Diskless 99.4% Sun UNIX+NFS Diskless 99.4% But how about RPCs these days? But how about RPCs these days? Taos takes 109 s for a Null() local call and 464 s for RPC 3.5x overhead Most i