28
DA-08260-49911_v02 | August 2020

DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

  • Upload
    others

  • View
    25

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

DA-08260-49911_v02 | August 2020

Page 2: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

NVIDIA DGX OS Server Release 4.99.11 Release Notes ii

Page 3: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

NVIDIA DGX OS Server Release 4.99.11 Release Notes 3

Page 4: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

NVIDIA DGX OS Server 4.99.11 Release Notes

NVIDIA DGX OS Server Release 4.99.11 Release Notes 4

Ubuntu Wiki

Upgrades

https://usn.ubuntu.com/

1 See the NVIDIA Deep Learning Frameworks documentation website (http://docs.nvidia.com/deeplearning/dgx/index.htm) for information on the latest container releases as well as https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html for instructions on how to access them.

Page 5: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

NVIDIA DGX OS Server 4.99.11 Release Notes

NVIDIA DGX OS Server Release 4.99.11 Release Notes 5

DGX OS Server Software Content

nvidia-peer-memory

Page 6: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

NVIDIA DGX OS Server 4.99.11 Release Notes

NVIDIA DGX OS Server Release 4.99.11 Release Notes 6

Component Version

GPU Driver 450.51.06

NVIDIA Container Toolkit libnvidia-container1 1.1.0-1

libnvidia-container-tools 1.1.0-1

nvidia-container-runtime 3.1.4-1

nvidia-container-toolkit 1.0.6-1

nvidia-docker2 2.2.2-1

Ubuntu 18.04.4 LTS

Ubuntu kernel 5.4.0-422

Docker Engine 19.03.8

NVIDIA System Health Monitor (NVSM) NVSM 20.05.17

Data Center GPU Management (DCGM) 2.0.10

Mellanox OFED MLNX 5.0-2.1.8.0

Component Version

GPU Driver 450.51.06

NVIDIA Container Toolkit libnvidia-container1 1.1.0-1

libnvidia-container-tools 1.1.0-1

nvidia-container-runtime 3.1.4-1

nvidia-container-toolkit 1.0.6-1

nvidia-docker2 2.2.2-1

Ubuntu 18.04.4 LTS

Page 7: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

NVIDIA DGX OS Server 4.99.11 Release Notes

NVIDIA DGX OS Server Release 4.99.11 Release Notes 7

Component Version

Ubuntu kernel 5.3.0-593

Docker Engine 19.03.8

NVIDIA System Health Monitor (NVSM) NVSM 20.05.9

Data Center GPU Management (DCGM) 2.0.10

Mellanox OFED MLNX 5.0-2.1.8.0

Component Version

GPU Driver 450.51.05

NVIDIA Container Toolkit libnvidia-container1 1.1.0-1

libnvidia-container-tools 1.1.0-1

nvidia-container-runtime 3.1.4-1

nvidia-container-toolkit 1.0.6-1

nvidia-docker2 2.2.2-1

Ubuntu 18.04.4 LTS

Ubuntu kernel 5.3.0-594

Docker Engine 19.03.8

NVIDIA System Health Monitor (NVSM) NVSM 20.05.9

Data Center GPU Management (DCGM) 2.0.8

Mellanox OFED MLNX 5.0-2.1.8.0

Component Version

GPU Driver 450.36.06

NVIDIA Container Toolkit libnvidia-container1 1.1.0-1

libnvidia-container-tools 1.1.0-1

nvidia-container-runtime 3.1.4-1

nvidia-container-toolkit 1.0.6-1

Page 8: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

NVIDIA DGX OS Server 4.99.11 Release Notes

NVIDIA DGX OS Server Release 4.99.11 Release Notes 8

Component Version

nvidia-docker2 2.2.2-1

Ubuntu 18.04.4 LTS

Ubuntu kernel 5.3.0-535

Docker Engine 19.03.8

NVIDIA System Health Monitor (NVSM) NVSM 20.05.3

Data Center GPU Management (DCGM) 2.0.4

Mellanox OFED MLNX 5.0-2.1.8.0

DGX A100 System Firmware Update Container Version 20.05.12.3

Page 9: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

NVIDIA DGX OS Server Release 4.99.11 Release Notes 9

NOTE: SSH can be used to perform the update. However, if the Ethernet port is

configured for DHCP, there is the potential that the IP address can change after the

DGX server is rebooted during the update, resulting in loss of connection. If this

happens, connect using either a direct connection or through the BMC to continue the

update process.

WARNING: Connect directly to the DGX server console if the DGX is connected to a

172.17.xx.xx subnet.

DGX OS Server software installs Docker CE which uses the 172.17.xx.xx subnet by

default for Docker containers. If the DGX server is on the same subnet, you will not be

able to establish a network connection to the DGX server.

Refer to the appropriate DGX-Server User Guide for instructions on how to change the

default Docker network settings after performing the update.

Page 10: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Updating the Software

NVIDIA DGX OS Server Release 4.99.11 Release Notes 10

$ wget -O f1-changelogs http://changelogs.ubuntu.com/meta-release-lts

$ wget -O f2-archive

http://archive.ubuntu.com/ubuntu/dists/bionic/Release

$ wget -O f3-usarchive

http://us.archive.ubuntu.com/ubuntu/dists/bionic/Release

$ wget -O f4-security

http://security.ubuntu.com/ubuntu/dists/bionic/Release

$ wget -O f5-international

http://international.download.nvidia.com/dgx/repos/bionic/dists/bionic/

Release

$ wget -O f6-international

http://international.download.nvidia.com/dgx/repos/bionic/dists/bionic-

4.99/Release

wget

Page 11: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Updating the Software

NVIDIA DGX OS Server Release 4.99.11 Release Notes 11

Connecting to the DGX Console

CAUTION: These instructions update all software for which updates are available

from your configured software sources, including applications that you installed

yourself. If you want to prevent an application from being updated, you can

instruct the Ubuntu package manager to keep the current version. For more

information, see Introduction to Holding Packages on the Ubuntu Community Help

Wiki.

Verifying the DGX Server Connection to the

Repositories

$ sudo apt update

$ sudo apt full-upgrade -s

Introduction to Holding Packages.

$ sudo apt full-upgrade

• nvidia-docker.service ,

Page 12: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Updating the Software

NVIDIA DGX OS Server Release 4.99.11 Release Notes 12

DGX A100 User

Guide

Page 13: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

NVIDIA DGX OS Server Release 4.99.11 Release Notes 13

$ sudo mdadm --manage /dev/md1 --add /dev/<device-name>

Page 14: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Known Issues

NVIDIA DGX OS Server Release 4.99.11 Release Notes 14

nvsm start rebuild

nvml nvidia-smi

Page 15: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Known Issues

NVIDIA DGX OS Server Release 4.99.11 Release Notes 15

Page 16: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Known Issues

NVIDIA DGX OS Server Release 4.99.11 Release Notes 16

nvsm stress-test

nvsm stress-test

ccp initialization failed

Page 17: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Known Issues

NVIDIA DGX OS Server Release 4.99.11 Release Notes 17

(/var/log/syslog SM LID is 0, maybe no

SM is running

srp_daemon

srp_daemon

Page 18: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Known Issues

NVIDIA DGX OS Server Release 4.99.11 Release Notes 18

$ sudo systemctl disable srp_daemon.service

$ sudo systemctl disable srptools.service

nvsm show nvswitches

nvsm show /systems/localhost/nvswitches

/systems/localhost/nvswitches

Targets:

NVSwitch10

NVSwitch11

NVSwitch12

NVSwitch13

NVSwitch8

NVSwitch9

Page 19: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Known Issues

NVIDIA DGX OS Server Release 4.99.11 Release Notes 19

$ mdadm --run /dev/md?*

$ exit

detected NVSwitch non-fatal error 10003 on NVSwitch pci

Page 20: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Known Issues

NVIDIA DGX OS Server Release 4.99.11 Release Notes 20

https://www.mellanox.com/support/firmware/connectx6ib

:~/temp-mlnx-download$ ls

fw-ConnectX6-rel-20_27_2008-MCX653105A-HDA_Ax-UEFI-14.20.22-

FlexBoot-3.5.901.bin

fw-ConnectX6-rel-20_27_2008-MCX653106A-HDA_Ax-UEFI-14.20.22-

FlexBoot-3.5.901.bin

sudo mlxfwmanager -u -y

u y

:~/temp-mlnx-download$ sudo mlxfwmanager -u -y

Querying Mellanox devices firmware ...

Device #1:

----------

Device Type: ConnectX6

Part Number: MCX653106A-HDA_Ax

Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and

200GbE; dual-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6

PSID: MT_0000000225

PCI Device Name: /dev/mst/mt4123_pciconf9

Base MAC: 1c34da4d72f6

Versions: Current Available

FW 20.27.1016 20.27.2008

PXE 3.5.0901 3.5.0901

Page 21: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Known Issues

NVIDIA DGX OS Server Release 4.99.11 Release Notes 21

UEFI 14.20.0019 14.20.0022

Status: Update required

...

...

...

---------

Found 10 device(s) requiring firmware update...

Device #1: Updating FW ...

Initializing image partition - OK

Writing Boot image component - 26%

...

...

$ sudo systemctl restart nvidia-mlnx-config

nvsm show health

Number of logical CPU cores [None]............................ Unknown

Page 22: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Known Issues

NVIDIA DGX OS Server Release 4.99.11 Release Notes 22

Serial Over LAN Does not Work After Cold Resetting the BMC

System May Slow Down When Using mpirun

ipmitool mc reset

cold

a)

ps -ef | grep "/sbin/agetty -o -p -- \u --keep-baud

115200,38400,9600 ttyS0 vt220"

b)

kill <PID>

c)

/sbin/agetty -o -p -- \u --keep-baud 115200,38400,9600 ttyS0

vt220

Page 23: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Known Issues

NVIDIA DGX OS Server Release 4.99.11 Release Notes 23

mpirun

kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!

get_user_pages

cudaHostRegister

/tmp

NOTE: If you performed this workaround on a previous DGX OS software version, you do

not need to do it again after updating to the latest DGX OS version.

/dev/shm

mpirun

DGX System

Slows Down When Using mpirun NVIDIA Enterprise Support

Page 24: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Known Limitations

NVIDIA DGX OS Server Release 4.99.11 Release Notes 24

Page 25: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

NVIDIA DGX OS Server Release 4.99.11 Release Notes 25

https://www.micron.com/products/solid-state-storage/storage-

executive-software

Micron Technology, Inc. Software License Agreement

PLEASE READ THIS LICENSE AGREEMENT ("AGREEMENT") FROM MICRON TECHNOLOGY, INC. ("MTI") CAREFULLY: BY INSTALLING, COPYING OR OTHERWISE USING THIS SOFTWARE AND ANY RELATED PRINTED MATERIALS ("SOFTWARE"), YOU ARE ACCEPTING AND AGREEING TO THE TERMS OF THIS AGREEMENT. IF YOU DO NOT AGREE WITH THE TERMS OF THIS AGREEMENT, DO NOT INSTALL THE SOFTWARE.

LICENSE: MTI hereby grants to you the following rights: You may use and

make one (1) backup copy the Software subject to the terms of this Agreement.

You must maintain all copyright notices on all copies of the Software.

You agree not to modify, adapt, decompile, reverse engineer,

disassemble, or otherwise translate the Software. MTI may make changes

to the Software at any time without notice to you. In addition MTI is under no obligation whatsoever to update, maintain,

or provide new versions or other support for the Software. OWNERSHIP OF MATERIALS: You acknowledge and agree that the Software is proprietary property of MTI (and/or its licensors) and is protected by

United States copyright law and international treaty provisions. Except

as expressly provided herein, MTI does not grant any express or implied

right to you under any patents, copyrights, trademarks, or trade secret

information. You further acknowledge and agree that all right, title,

and interest in and to the Software, including associated proprietary

Page 26: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Appendix A. Third Party License Notice

NVIDIA DGX OS Server Release 4.99.11 Release Notes 26

rights, are and shall remain with MTI (and/or its licensors). This Agreement does not convey to you an interest in or to the Software, but

only a limited right to use and copy the Software in accordance with

the terms of this Agreement. The Software is licensed to you and not

sold. DISCLAIMER OF WARRANTY: THE SOFTWARE IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. MTI EXPRESSLY DISCLAIMS ALL WARRANTIES EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, NONINFRINGEMENT OF THIRD PARTY RIGHTS, AND ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR

ANY PARTICULAR PURPOSE. MTI DOES NOT WARRANT THAT THE SOFTWARE WILL

MEET YOUR REQUIREMENTS, OR THAT THE OPERATION OF THE SOFTWARE WILL BE UNINTERRUPTED OR ERROR-FREE. FURTHERMORE, MTI DOES NOT MAKE ANY REPRESENTATIONS REGARDING THE USE OR THE RESULTS OF THE USE OF THE SOFTWARE IN TERMS OF ITS CORRECTNESS, ACCURACY, RELIABILITY, OR OTHERWISE. THE ENTIRE RISK ARISING OUT OF USE OR PERFORMANCE OF THE SOFTWARE REMAINS WITH YOU. IN NO EVENT SHALL MTI, ITS AFFILIATED COMPANIES OR THEIR SUPPLIERS BE LIABLE FOR ANY DIRECT, INDIRECT, CONSEQUENTIAL, INCIDENTAL, OR SPECIAL DAMAGES (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION, OR LOSS OF INFORMATION) ARISING OUT OF YOUR USE OF OR INABILITY TO USE THE SOFTWARE, EVEN IF MTI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Because some jurisdictions prohibit the exclusion or

limitation of liability for consequential or incidental damages, the

above limitation may not apply to you.

TERMINATION OF THIS LICENSE: MTI may terminate this license at any time

if you are in breach of any of the terms of this Agreement. Upon

termination, you will immediately destroy all copies the Software.

GENERAL: This Agreement constitutes the entire agreement between MTI

and you regarding the subject matter hereof and supersedes all previous

oral or written communications between the parties. This Agreement

shall be governed by the laws of the State of Idaho without regard to

its conflict of laws rules.

CONTACT: If you have any questions about the terms of this Agreement,

please contact MTI's legal department at (208) 368-4500.

By proceeding with the installation of the Software, you agree to the

terms of this Agreement. You must agree to the terms in order to

install and use the Software.

http://www.mellanox.com/

Copyright (c) 2006 Mellanox Technologies.

All rights reserved.

Page 27: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

Appendix A. Third Party License Notice

NVIDIA DGX OS Server Release 4.99.11 Release Notes 27

Redistribution and use in source and binary forms, with or without

modification, are permitted provided that the following conditions are

met:

1. Redistributions of source code must retain the above copyright

notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright

notice, this list of conditions and the following disclaimer in the

documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS

IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED

TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A

PARTICULAR PURPOSE ARE DISCLAIMED.

IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR

ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL

DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS

OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)

HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,

STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING

IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE

POSSIBILITY OF SUCH DAMAGE.

Page 28: DA-08260-49911 v02 | August 2020 - Nvidia · 2020. 8. 19. · NVIDIA DGX OS Server 4.99.11 Release Notes NVIDIA DGX OS Server Release 4.99.11 Release Notes 7 Component Version Ubuntu

www.nvidia.com

Notice