DESIGNING HIGH-PERFORMANCE AND SCALABLE CLUSTERED NETWORK ATTACHED STORAGE ?· 2017-07-18 · DESIGNING…

  • Published on
    09-Jun-2018

  • View
    212

  • Download
    0

Embed Size (px)

Transcript

  • DESIGNING HIGH-PERFORMANCE AND SCALABLECLUSTERED NETWORK ATTACHED STORAGE WITH

    INFINIBAND

    DISSERTATION

    Presented in Partial Fulfillment of the Requirements for

    the Degree Doctor of Philosophy in the

    Graduate School of The Ohio State University

    By

    Ranjit Noronha, MS

    * * * * *

    The Ohio State University

    2008

    Dissertation Committee:

    Dhabaleswar K. Panda, Adviser

    Ponnuswammy Sadayappan

    Feng Qin

    Approved by

    AdviserGraduate Program in

    Computer Science andEngineering

  • c Copyright by

    Ranjit Noronha

    2008

  • ABSTRACT

    The Internet age has exponentially increased the volume of digital media that is being

    shared and distributed. Broadband Internet has made technologies such as high quality

    streaming video on demand possible. Large scale supercomputers also consume and cre-

    ate huge quantities of data. This media and data must be stored, cataloged and retrieved

    with high-performance. Researching high-performance storage subsystems to meet the I/O

    demands of applications in modern scenarios is crucial.

    Advances in microprocessor technology have given rise to relatively cheap off-the-shelf

    hardware that may be put together as personal computers as well as servers. The servers

    may be connected together by networking technology to create farms or clusters of work-

    stations (COW). The evolution of COWs has significantly reduced the cost of ownership of

    high-performance clusters and has allowed users to build fairly large scale machines based

    on commodity server hardware.

    As COWs have evolved, networking technologies like InfiniBand and 10 Gigabit Eth-

    ernet have also evolved. These networking technologies not only give lower end-to-end

    latencies, but also allow for better messaging throughput between the nodes. This allows

    us to connect the clusters with high-performance interconnects at a relatively lower cost.

    ii

  • With the deployment of low-cost, high-performance hardware and networking technol-

    ogy, it is increasingly becoming important to design a storage system that can be shared

    across all the nodes in the cluster. Traditionally, the different components of the file system

    have been stringed together using the network to connect them. The protocol generally

    used over the network is TCP/IP. The TCP/IP protocol stack in general has been shown to

    have poor performance especially for high-performance networks like 10 Gigabit Ethernet

    or InfiniBand. This is largely due to the fragmentation and reassembly cost of TCP/IP. The

    cost of multiple copies also serves to severely degrade the performance of the stack. Also,

    TCP/IP has been been shown to reduce the capacity of network attached storage systems

    because of problems like incast.

    In this dissertation, we research the problem of designing high-performance communi-

    cation subsystems for network attached storage (NAS) systems. Specifically, we delve into

    the issues and potential solutions with designing communication protocols for high-end

    single-server and clustered server NAS systems. Orthogonally, we also investigate how a

    caching architecture may potentially enhance the performance of a NAS system. Finally,

    we look at the potential performance implications of using some of these designs in two

    scenarios; over a long haul network and when used as a basis for checkpointing parallel

    applications.

    iii

  • TABLE OF CONTENTS

    Page

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

    List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

    List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

    Chapters:

    1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1 Overview of Storage Terminology . . . . . . . . . . . . . . . . . . . . . 41.1.1 File System Concepts . . . . . . . . . . . . . . . . . . . . . . . 41.1.2 UNIX Notion of Files . . . . . . . . . . . . . . . . . . . . . . . 51.1.3 Mapping files to blocks on storage devices . . . . . . . . . . . . 61.1.4 Techniques to improve file system performance . . . . . . . . . . 8

    1.2 Overview of Storage Architectures . . . . . . . . . . . . . . . . . . . . . 91.2.1 Direct Attached Storage (DAS) . . . . . . . . . . . . . . . . . . 91.2.2 Network Storage Architectures: In-Band and Out-Of-Band Access 111.2.3 Network Attached Storage (NAS) . . . . . . . . . . . . . . . . . 121.2.4 System Area Networks (SAN) . . . . . . . . . . . . . . . . . . . 131.2.5 Clustered Network Attached Storage (CNAS) . . . . . . . . . . 141.2.6 Object Based Storage Systems (OBSS) . . . . . . . . . . . . . . 16

    1.3 Representative File Systems . . . . . . . . . . . . . . . . . . . . . . . . 171.3.1 Network File Systems (NFS) . . . . . . . . . . . . . . . . . . . 181.3.2 Lustre File System . . . . . . . . . . . . . . . . . . . . . . . . . 19

    1.4 Overview of Networking Technologies . . . . . . . . . . . . . . . . . . 201.4.1 Fibre Channel (FC) . . . . . . . . . . . . . . . . . . . . . . . . 21

    iv

  • 1.4.2 10 Gigabit Ethernet . . . . . . . . . . . . . . . . . . . . . . . . 221.4.3 InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.1 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3. Single Server NAS: NFS on InfiniBand . . . . . . . . . . . . . . . . . . . . . 34

    3.1 Why is NFS over RDMA important? . . . . . . . . . . . . . . . . . . . 343.2 Overview of the InfiniBand Communication Model . . . . . . . . . . . . 39

    3.2.1 Communication Primitives . . . . . . . . . . . . . . . . . . . . 403.3 Overview of NFS/RDMA Architecture . . . . . . . . . . . . . . . . . . 41

    3.3.1 Inline Protocol for RPC Call and RPC Reply . . . . . . . . . . . 433.3.2 RDMA Protocol for bulk data transfer . . . . . . . . . . . . . . 44

    3.4 Proposed Read-Write Design and Comparison to the Read-Read Design . 463.4.1 Limitations in the Read-Read Design . . . . . . . . . . . . . . . 493.4.2 Potential Advantages of the Read-Write Design . . . . . . . . . 503.4.3 Proposed Registration Strategies For the Read-Write Protocol . . 51

    3.5 Experimental Evaluation of NFSv3 over InfiniBand . . . . . . . . . . . . 573.5.1 Comparison of the Read-Read and Read-Write Design . . . . . . 573.5.2 Impact of Registration Strategies . . . . . . . . . . . . . . . . . 603.5.3 Multiple Clients and Real Disks . . . . . . . . . . . . . . . . . . 63

    3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4. Enhancing the Performance of NFSv4 with RDMA . . . . . . . . . . . . . . . 67

    4.1 Design of NFSv4 with RDMA . . . . . . . . . . . . . . . . . . . . . . . 684.1.1 Compound Procedures . . . . . . . . . . . . . . . . . . . . . . . 684.1.2 Read and Write Operations . . . . . . . . . . . . . . . . . . . . 704.1.3 Readdir/Readlink Operations . . . . . . . . . . . . . . . . . . . 71

    4.2 Evaluation of NFSv4 over RDMA . . . . . . . . . . . . . . . . . . . . . 714.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 714.2.2 Impact of RDMA on NFSv4 . . . . . . . . . . . . . . . . . . . . 724.2.3 Comparison between NFSv4/TCP and NFSv4/RDMA . . . . . . 72

    4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    v

  • 5. Performance in a WAN environment . . . . . . . . . . . . . . . . . . . . . . . 74

    5.1 InfiniBand WAN: Range Extension . . . . . . . . . . . . . . . . . . . . 745.2 WAN Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 765.3 Performance of NFS/RDMA with increasing delay . . . . . . . . . . . . 765.4 NFS WAN performance characteristics with RDMA and TCP/IP . . . . . 765.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    6. Clustered NAS: pNFS on InfiniBand . . . . . . . . . . . . . . . . . . . . . . . 79

    6.1 NFSv4.1: Parallel NFS (pNFS) and Sessions . . . . . . . . . . . . . . . 826.2 Design Considerations for pNFS over RDMA . . . . . . . . . . . . . . . 85

    6.2.1 Design of pNFS using a file layout . . . . . . . . . . . . . . . . 856.2.2 RPC Connections from clients to MDS (Control Path) . . . . . . 876.2.3 RPC Connections from MDS to DS (MDS-DS control path) . . . 886.2.4 RPC Connections from clients to data servers (Data paths) . . . 896.2.5 Sessions Design with RDMA . . . . . . . . . . . . . . . . . . . 90

    6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 906.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 916.3.2 Impact of RPC/RDMA on Performance from the client to the

    Metadata Server . . . . . . . . . . . . . . . . . . . . . . . . . . 916.3.3 RPC/RDMA versus RPC/TCP on metadata server to Data Server 946.3.4 RPC/RDMA versus RPC/TCP from clients to DS . . . . . . . . 97

    6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    7. Caching in a clustered NAS environment . . . . . . . . . . . . . . . . . . . . . 104

    7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.1.1 Introduction to GlusterFS . . . . . . . . . . . . . . . . . . . . . 1077.1.2 Introduction to MemCached . . . . . . . . . . . . . . . . . . . . 107

    7.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.3 Design of a Cache for File Systems . . . . . . . . . . . . . . . . . . . . 110

    7.3.1 Overall Architecture of Intermediate Memory Caching (IMCa)Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

    7.3.2 Design for Management File System Operations in IMCa . . . . 1137.3.3 Data Transfer Operations . . . . . . . . . . . . . . . . . . . . . 1147.3.4 Potential Advantages/Disadvantages of IMCa . . . . . . . . . . 118

    7.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 119

    vi

  • 7.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 1207.4.2 Performance of Stat With the Cache . . . . . . . . . . . . . . . . 1207.4.3 Latency: Single Client . . . . . . . . . . . . . . . . . . . . . . . 1227.4.4 Latency: Multiple Clients . . . . . . . . . . . . . . . . . . . . . 1247.4.5 IOzone Throughput . . . . . . . . . . . . . . . . . . . . . . . . 1277.4.6 Read/Write Sharing Experiments . . . . . . . . . . . . . . . . . 129

    7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

    8. Evaluation of Check-pointing With High-Performance I/O . . . . . . . . . . . 131

    8.1 Overview of Checkpoint Approaches and Issues . . . . . . . . . . . . . 1338.1.1 Checkpoint Initiation . . . . . . . . . . . . . . . . . . . . . . . 1338.1.2 Blocking versus Non-Blocking Checkpointing . . . . . . . . . . 1348.1.3 Application versus System Level Checkpointing . . . . . . . . . 134

    8.2 Storage Systems and Checkpointing in MVAPICH2: Issues, PerformanceEvaluation and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 1368.2.1 Storage Subsystem Issues . . . . . . . . . . . . . . . . . . . . . 1368.2.2 Impact of Different File Systems . . . . . . . . . . . . . . . . . 1378.2.3 Impact of Checkpointing Interval . . . . . . . . . . . . . . . . . 1398.2.4 Impact of System Size . . . . . . . . . . . . . . . . . . . . . . . 140

    8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

    9. Open Source Software Release and Its Impact . . . . . . . . . . . . . . . . . . 143

    10. Conclusions and Future Research Directions . . . . . . . . . . . . . . . . . . . 144

    10.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . 14410.1.1 A high-performance single-server network file system (NFSv3)

    over RDMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14410.1.2 A high-performance single-server network file system (NFSv4)

    over RDMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14510.1.3 NFS over RDMA in a WAN environment . . . . . . . . . . . . . 14510.1.4 High-Performance parallel network file-system (pNFS) over In-

    finiBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14610.1.5 Intermediate Caching Architecture . . . . . . . . . . . . . . . . 14710.1.6 System-Level Checkpointing With MVAPICH2 and Lustre . . . 147

    10.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . 14810.2.1 Investigation of registration modes . . . . . . . . . . . . . . . . 148

    vii

  • 10.2.2 Scalability to very large clusters . . . . . . . . . . . . . . . . . . 14810.2.3 Metadata Parallelization . . . . . . . . . . . . . . . . . . . . . . 14810.2.4 InfiniBand based Fault Tolerance for storage subsystems . . . . . 14910.2.5 Cache Structure and Lookup for Parallel and Distributed File

    Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14910.2.6 Helper Core to Reduce Checkpoint Overhead . . . . . . . . . . . 150

    Appendices:

    A. Los Alamos National Laboratory Copyright Notice for Figure relating to IBMRoadRunner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

    Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

    viii

  • LIST OF TABLES

    Table Page

    3.1 Communication Primitive Properties . . . . . . . . . . . . . . . . . . . . . 41

    ix

  • LIST OF FIGURES

    Figure Page

    1.1 IBM RoadRunner: the first petaflop cluster. Courtesy, Leroy N. Sanchez,LANL [26], copyright, Appendix A . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Local File System Protocol Stack. Courtesy, Welch, et.al. [40] . . . . . . . 11

    1.3 In-Band Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    1.4 Out-of-Band Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    1.5 Network Attached Storage Protocols. Based on illustration by Welch, et.al. [40] 14

    1.6 System Area Networks. Based on illustration by Welch, et.al. [40] . . . . . 15

    1.7 Clustered NAS: High-Level Architecture . . . . . . . . . . . . . . . . . . . 17

    1.8 Clustered NAS Forwarding Model: Protocol Stack, Courtesy Welch, et.al. [40] 17

    1.9 Lustre Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    1.10 Fibre Channel Connectors [3] . . . . . . . . . . . . . . . . . . . . . . . . . 22

    1.11 Fibre Channel Topologies [3] . . . . . . . . . . . . . . . . . . . . . . . . . 22

    1.12 InfiniBand Fibre Cables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    1.13 InfiniBand Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    x

  • 2.1 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.1 Architecture of the NFS/RDMA stack in OpenSolaris . . . . . . . . . . . . 42

    3.2 RPC/RDMA header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    3.3 Read-Read Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.4 Read-Write Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.5 Registration points (Read-Write) . . . . . . . . . . . . . . . . . . . . . . . 45

    3.6 Latency and Registration costs in InfiniBand on OpenSola...

Recommended

View more >