Guide for Running AI Workloads on Red Hat OpenShift and ... viii Guide for Running AI Workloads on Red

  • View
    3

  • Download
    0

Embed Size (px)

Text of Guide for Running AI Workloads on Red Hat OpenShift and ... viii Guide for Running AI Workloads on...

  • Redpaper

    Front cover

    Deployment and Usage Guide for Running AI Workloads on Red Hat OpenShift and NVIDIA DGX Systems with IBM Spectrum Scale

    Simon Lorenz

    Gero Schmidt

    Thomas Schoenemeyer

  • IBM Redbooks

    Deployment and Usage Guide for Running AI Workloads on Red Hat OpenShift and NVIDIA DGX Systems with IBM Spectrum Scale

    November 2020

    REDP-5610-00

  • © Copyright International Business Machines Corporation 2020. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

    First Edition (November 2020)

    This edition applies to Version 4, Release 4, Modification 3 of Red Hat OpenShift and Version 5 and Release 0, Modification 4.3 of IBM Spectrum Scale.

    Note: Before using this information and the product it supports, read the information in “Notices” on page v.

  • Contents

    Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

    Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Authors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

    Chapter 1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Proof of concept background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Chapter 2. Proof of concept environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    Chapter 3. Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 Configuring the NVIDIA Mellanox EDR InfiniBand network . . . . . . . . . . . . . . . . . . . . . 10 3.2 Integrating DGX-1 systems as worker nodes into a Red Hat OpenShift 4.4.3 cluster . 11

    3.2.1 Installing the Red Hat Enterprise Linux 7.6 and DGX software . . . . . . . . . . . . . . 12 3.2.2 Installing NVIDIA Mellanox InfiniBand drivers (MLNX_OFED) . . . . . . . . . . . . . . . 13 3.2.3 Installing the GDRDMA kernel module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.4 Installing the NVIDIA Mellanox SELinux module . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.5 Adding DGX-1 systems as worker nodes to the Red Hat OpenShift cluster. . . . . 14

    3.3 Adding DGX-1 systems as client nodes to the IBM Spectrum Scale cluster. . . . . . . . . 14 3.4 Installing and configuring more components in the Red Hat OpenShift 4.4.3 stack . . . 16

    3.4.1 Special Resource Operator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.2 NVIDIA Mellanox RDMA Shared Device plug-in. . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.3 Enabling the IPC_LOCK capability in the user namespace for the RDMA Shared

    Device plug-in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.4 MPI Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4.5 IBM Spectrum Scale CSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    Chapter 4. Preparation and functional testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 Testing remote direct memory access through an InfiniBand network . . . . . . . . . . . . . 28 4.2 Preparing persistent volumes with IBM Spectrum Scale Container Storage Interface . 31 4.3 MPIJob definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 Connectivity tests with the NVIDIA Collective Communications Library . . . . . . . . . . . . 36 4.5 Multi-GPU and multi-Node GPU scaling with TensorFlow ResNet-50 benchmark . . . . 46

    Chapter 5. Deep neural network training on the Audi Autonomous Driving Dataset semantic segmentation data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    5.1 Description of the A2D2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2 Multi-node GPU scaling results for deep neural network training jobs . . . . . . . . . . . . . 55 5.3 Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.4 Integrating IBM Spectrum Discover and IBM Spectrum LSF to find the correct data based

    on labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    © Copyright IBM Corp. 2020. iii

  • Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    iv Guide for Running AI Workloads on Red Hat OpenShift and NVIDIA DGX Systems with IBM Spectrum Scale

  • Notices

    This information was developed for products and services offered in the US. This material might be available from IBM in other languages. However, you may be required to own a copy of the product or product version in that language in order to access it.

    IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to evaluate and verify the operation of any non-IBM product, program, or service.

    IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US

    INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

    This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

    Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk.

    IBM may use or distribute any of the information you provide in any way it believes appropriate without incurring any obligation to you.

    The performance data and client examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and operating conditions.

    Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

    Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only.

    This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious a