View
3
Download
0
Category
Preview:
Citation preview
SC’16 Technical Training
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html.
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
3D XPoint, Intel, the Intel logo, Intel Core, Intel Xeon Phi, Optane and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
© 2016 Intel Corporation
2
3
Introductions Lustre Overview Roadmap Deep Dive10:30 Break Lustre on ZFS UpdateLunch (provided)Intel® Omni-Path with LustreKnights Landing with LustreIntroduction to Intel® HPC OrchestratorLustre Performance Tuning Review
Intel® Scalable System FrameworkA Holistic Solution for All HPC Needs
Small Clusters Through Supercomputers
Compute and Data-Centric Computing
Standards-Based Programmability
On-Premise and Cloud-Based
Intel® Xeon® Processors
Intel® Xeon Phi™ Processors
Intel® FPGAs and Server Solutions
Intel® Solutions for Lustre*
Intel® Optane™ Technology
3D XPoint™ Technology
Intel® SSDs
Intel® Omni-Path Architecture
Intel® Silicon Photonics
Intel® Ethernet
Intel® HPC Orchestrator
Intel® Software Tools
Intel® Cluster Ready Program
Intel Supported SDVis
ComputeFabric
Memory / Storage
Software
4
Lets go around the room!
5
December 2015: Intel’s Analysis of Top 100 Systems (top100.org)
71%
18%
4% 7%
Lustre GPFS NFS Other
9 of Top10 Sites
71% of Top100
Most Adopted PFS
Most Scalable PFS
Open Source GPL v2
Commercial Packaging
Vibrant Community
6
1 Source: Chris Morrone, Lead of OpenSFS Lustre Working Group, April 2016
Intel65%
8%
6%
6%
3%3%
2%2%
1%2%
Intel ORNL* Seagate* Cray* DDN*
Atos* LLNL* CEA* IU Other
Intel65%
18%
4%2%
2%2%1%1%1%
Intel ORNL Cray Atos Seagate
DDN IU CEA Other
Commit per Organization Lines of codes per organization
7
* Other names and brands may be claimed as the property of others.
Bioscience Government research and defense Large-scale manufacturingMechanical, computer-aided design & computer-aided engineering systems
Genomic data analysis, modeling and simulations
Weather and climate Energy FinanceFraud detection, Monte Carlo simulations,
risk management analysisHighly complex CGI rendering Seismic processing, reservoir modeling /
characterization, sensor data analysis
8
Government funded research. Surveillance, Signal Processing, encryption etc.
Intel® Scalable System Framework for HPC
* Other names and brands may be claimed as the property of others.
Intel® FOUNDATION Edition for Lustre* software
Delivers the latest functions and features, fully supported by Intel
Ideal for organizations that prefer to design and deploy their own open
source configurations
Intel® ENTERPRISE Edition for Lustre* software
Maximum performance with minimal complexity and cost for multi-
petabyte file system. Management with Intel® Manager for Lustre*
software
Intel® CLOUD Edition for Lustre* software
Cost-effective access to parallel storage on Amazon Web Services*
(AWS) and Microsoft Azure* to boost cloud-computing
9
Read/WhiteHeat Map
OSTBalance
MetadataOperations
Read/WhiteBandwidth
10
* Other names and brands may be claimed as the property of others.
11
* Other names and brands may be claimed as the property of others.
12
Intel Manager for Lustre
* Other names and brands may be claimed as the property of others.
ManagementNetwork
High Performance Data Network(Infiniband*, 10GbE)
MetadataServers(1-10s)
Object Storage Servers
(10s-1000s)
Lustre Clients (1 – 100,000+)
Object StorageTargets (OSTs)
Object StorageTargets (OSTs)MetadataTarget (MDT)
ManagementTarget (MGT)
Native Lustre* Client for Intel® Xeon Phi™ processorIntel® Omni-Path Support
Robin HoodOpenZFS, RAIDz
Hadoop* Adapters
13
HSM
14
15
Lustre w/ZFS – Unique Features
ZFS System Design
Software Installation
Lustre ZFS HA Overview
16
Raidz2: Data+2 parity data protection scheme
Raidz3: Data+3 parity data protection scheme
Vdev: Collection of devices (eg: raidz2 9+2 Vdev)
Zpool: Collection of vdevs
Zpools become Lustre OSTs
You can have many devs in a zpool
L2arc cache: ZFS Read Cache
17
Incredible reliability
– Data is always consistent on disk; silent data corruption is detected and corrected; smart rebuild strategy
Compression
– Maximize usable capacity for increased ROI
Snapshot – support built into Lustre
– Consistent snapshot across all the storage targets without stopping the file system.
Hybrid Storage Pool
– Data is tiered automatically across DRAM, SSD/NVMe and HDD accelerating random & small file read performance
Manageability
– Powerful storage pool management makes it easy to assemble and maintain Lustre storage targets from individual devices
18
Silent Data Corruption is a real world issue: “Data ~= Dada”
Causes:
Interface Design
Manufacturing Defects
Cable Defects
Heat/Power/Vibrations
Software defects
Netapp Study* : 1.5 Million Drives: 41 Months:400,000 Errors
* https://atg.netapp.com/wp-content/uploads/2008/03/corruption-fast08.pdf
19
On Write:
Write data + checksum
On Read:
Read data and re-compute checksum then compare to original
On Error:
If running zRaid discard read and recalculate from VDEV
Notify user and continue on
20
21
Enable more space allocation to usersminimizes hardware costsmore data in the same footprint
Increase the file transfer rateIncrease throughput by up to 25% See Laval University’s presentation fromHP CAST 2015: http://www.hp-cast.org/
Compression effects on genomics filesText based output of genomic sequence systemsHuman genome can generate 600GB file size
22
How Can Lustre* Snapshots Be Used?
Undo/undelete/recover file(s) from the snapshot
Removed file by mistake, application failure causes data invalid
Quickly backup the filesystem before system upgrade
Upgrade Lustre/kernel may hit some trouble and need to roll back
Prepare a consistent frozen data view for backup tools
Ensure system is consistent for the whole backup
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation
ZFS-based Lustre* Snapshot Overview
23
lctl snapshot commands
Lustre kernel
lctl APILustre
control
Userspace
ZFS
controlMGS
MDSsLustre kernel ZFS tools set
OSSsLustre kernel ZFS tools set
ZFS snapshot created on each target with a new fsname
Mount as separate read-only Lustre filesystem on client(s)
Architecture details: http://wiki.lustre.org/Lustre_Snapshots
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation
24
Global Write Barrier
“Freeze” the system during creating snapshot pieces on every target.
Write barrier on MDTs only
No orphans, no dangling references
New lctl commands for the global write barrier
lctl barrier_freeze <fsname> [timeout (seconds)]
lctl barrier_thaw <fsname>
lctl barrier_stat <fsname>
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation
25
Two Phase Global Write Barrier Setup
MGS
MDT_0
OST_0
OST_m
OST_x
.
.
.
.
.
.
.
.
.
lctl barrier_freeze
MGS action MDT action
5. Get barrier lock from MGS,
sync/commit local transactions
1. start FREEZE1
2. Get barrier lock from MGS, block
client modifications, flush RPCs
4. Wait all FREEZE1 done;
start FREEZE2
6. Notify MGS FREEZE2 done7. Wait all FREEZE2 done
Barrier done
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation
MDT_n
3. Notify MGS FREEZE1 done
User action
0. lctl barrier_freeze
26
Fork/Erase Configuration Logs
Snapshot is independent from the original filesystem
New filesystem name (fsname) is assigned to the snapshot
Fsname is part of the configuration logs names
Fsname exists in the configuration logs entries
New lctl commands for fork/erase configuration logs
lctl fork_lcfg <fsname> <new_fsname>
lctl erase_lcfg <fsname>
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation
27
Mount Snapshot Read-only
Any modification of ZFS snapshot can trigger backend failure/assertion
Open ZFS dataset as readonly mode
NOT start cross-servers sync thread,
pre-create thread, quota thread
Skip sequence file initialization, orphan
cleanup, recovery
Ignore last_rcvd modification
Deny to create transaction
Forbid LFSCK
…
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation
28
Userspace Interfaces – lctl snapshot_xxx
Functionality Usage
Create snapshot lctl snapshot_create [-b | --barrier] [-c | --comment comment]<-F | --fsname fsname] [-h | --help] <-n | --name ssname>[-r | --rsh remote_shell][-t | --timeout timeout]
Destroy snapshot lctl snapshot_destroy [-f | --force] <-F | --fsname fsname>[-h | --help] <-n | --name ssname> [-r | --rsh remote_shell]
Modify snapshot
attributes
lctl snapshot_modify [-c | --comment comment] <-F | --fsname fsname>
[-h | --help] <-n | --name ssname> [-N | --new new_ssname][-r | --rsh remote_shell]
List the snapshots lctl snapshot_list [-d | --detail] <-F | --fsname fsname>[-h | --help] [-n | --name ssname] [-r | --rsh remote_shell]
Mount snapshot lctl snapshot_mount <-F | --fsname fsname> [-h | --help]<-n | --name ssname> [-r | --rsh remote_shell]
Umount snapshot lctl snapshot_umount <-F | --fsname fsname> [-h | --help]<-n | --name ssname> [-r | --rsh remote_shell]
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation
Write Barrier Scalability
29
CPU: Intel® Xeon® E5620 @2.40GHz
– 4 cores * 2, HT
RAM: 64GB DDR3
Network: InfiniBand QDR
Storage: SATA disk arrays
2 MDTs per MDS
4 OSTs per OSS
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation
14.4
17.2 18.4 17.2
16.818.2
20.2
18.2
0
5
10
15
20
25
1 2 4 8
ba
rrie
r_fr
ee
ze
tim
e (
se
co
nd
s)
MDTs count
Write Barrier Scalability
idle
busy
Snapshot I/O Scalability
30
22.8 21.4
28.6 26
27.2
27.6
30.8
26.6
1 1.4 1.2 1.2
1.8 1.4 1.6 1.6
0
5
10
15
20
25
30
35
1 2 4 8
sn
ap
sh
ot_
cre
ate
tim
e (
se
co
nd
s)
MDTs count
Snapshot Scalability with MDTs
20.821
25.3
21.2
25.4
29.2
30
32.6
1.8 1.4 1.6 1.61.4 1.6 2 2
0
5
10
15
20
25
30
35
2 4 8 16
sn
ap
sh
ot_
cre
ate
tim
e (
se
co
nd
s)
OSTs count
Snapshot Scalability with OSTs
idle+barrier busy+barrier
idle-barrier busy-barrier
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation
I/O Performance With Snapshots
31
Limited impact on metadata performance
– Measured via mds-survey on single MDT
– Slight benefit as changed blocks not freed
No significant impact on I/O performance
– Measure via obdfilter-survey on one OST
Not Lustre* specific, ZFS is COW based
159261490314817
13500
0
4000
8000
12000
16000
20000
destroy create
Ob
jec
ts/s
ec
on
d
Metadata Performance Impact
withsnasphot
withoutsnapshot
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation
32
Next Steps for Snapshot Feature
Phase I: scheduled for Community Lustre 2.10/EE 3.0 release landing
Phase II: Lustre* integrated snapshot
– Depends on users’ requirements vs. other Lustre features, performance, etc.
– More controllable and relatively independent solution
– Reuse Phase I global write barrier
– Integrate snapshot creation/mount/unmount into OSD
– Identify files/objects in each snapshot as part of File Identifier (FID)
Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation
L2ARC Cache is supported
Read Cache
Local NVMe/SSD
1 L2ARC per Zpool
Read Test:
3.8 Million 64K Files
16 Clients
16HDD Raidz2 Zpool
1 DCP-3700 NVMe
33
ZFS Manages:
--Disks: Zraid2, Zraid3, Mirror….
ZFS Allows:
OST and Disk Management in 1 place
Large OST:Example
4 x (9+2 raidz2 Vdev)
1 OST per OSS possible
Single storage management interface
34
Changes for using ZFS more efficiently
– Improved file create performance
– Snapshots of whole file system
Changes to core ZFS code
– Inode quota accounting
– Multi-mount protection for safety
– System and fault monitoring improvements
– Large dnodes for improved extended attribute performance
– Reduce CPU usage with hardware-assisted checksums, compression
– Declustered parity & distributed hot spaces to improve re-silvering
– Metadata allocation class to store all metadata on SSD/NVRAM
35
Path to Exascale
CORAL and future follow-on architectures are scoped with ZFS.
LLNL Sequoia1 (55PB File System)
Cheaper, less complex, higher performance file system for Sequoia
With Intel, Lustre and ZFS continue to advance
Collaborate with OpenZFS community on new features.
Improve metadata performance: LAD’16 Talk
36
1 http://computation.llnl.gov/projects/zfs-lustre
Native Encryption
Built-in encryption for
data at rest to provide
enhanced storage
security.
Persistent Read Cache
Update of existing L2ARC
read cache to persist data
across reboots.
37
Performance Enhancements
ZFS improvements for increased
metadata performance.
Fault Management
Enhanced fault monitoring and
management architecture for ZFS.
D-RAID
De-clustered RAIDZ provides
massively improved rebuild
performance after a drive failure.
Parity acceleration – Using AVX instructions to
accelerate parity calculation
Intel
IPCC
OpenZFS
The CDDL (the license of OpenZFS) and GPLv2 (the license of Linux) are considered incompatible by the FSF (the authors of the GPL; see https://www.gnu.org/licenses/license-list.html#CDDL), but does not prohibit end users from using OpenZFS with Linux together in ways that don’t invoke that incompatibility. Intel does not distribute compiled binaries of OpenZFS kernel modules for Linux. Intel provides DKMS packages which help our customers automatically build OpenZFSmodules from source for use on their own systems. Consider seeking legal advice for any activities that might be considered “distribution” under GPLv2.
38
39
1x 12Gb SAS port ~ 4GB/s Block level
1x 2 12Gb SAS port ~ 6GB/s Block level (8x PCI limitation)
2 x 2 12Gb SAS port ~ 12GB/s Block Level
Decide on how many spare drives
Understand internal JBOD SAS layout
Developed specify strategy for Alignment
Which SAS Port will control what group of Drives
40
Zraid 9+2 or larger data drives counts for best write performance
60 drive JBOD ~ 13+2 x 4
90 drive JBOD ~ 9+2 x 8 (plus 2 Hot Spares)
84 drive JBOD ~ 12+2 x 7 (imbalanced)
Consider Spares
Use 1M Record Size on OSTs
Important for performance
How to connect enough SAS Cables?
41
Multi-Path:
Configure Priority Path Failover Groups
Round Robin Kills performance
Align Vdevs to specific Paths
Not documented by Redhat
Partners Own Multipath configuration
Zoning:
Zone JBOD on Vdev Alignments
Cable Pull requires HA Failover
42
ECC Memory Mandatory
CPU has more duties then LDISKFS
Parity
Compression
CRC
ZFS Adaptive Read Cache needs Memory
128GB+ Recommended
Obdfilter-survey will run a few GB/s Less then IOR
Very helpful for Development
43
Install Process FE/Community
yum -y install kernel-devel dkms-2.2.0.3-30.git.7c3e7c5.el7.noarch.rpm
#install and configure Fabric (MOFED/IFS)
cd archive/artifacts/RPMS/x86_64/
yum -y install lib*.rpm lustre-osd-zfs-mount*.x86_64.rpm
yum -y install spl-dkms-*.noarch.rpm zfs-dkms-*.rpm lustre-dkms-*.noarch.rpm
yum -y install lustre-2.7.*.rpm zfs-0.6.5*.rpm spl-0.6.5*.rpm lustre-osd-zfs*.rpm
Intel EE 3.1: Add Server in IML GUI
Lustre Master: Build ZFS and Lustre from SRC (Not Covered)
44
Zpool create ost1 raidz2 /dev/mapper/mpath1 …….
Zpool add ost1 raidz2 /dev/mapper/mpath12 …….
Zpool status
Shows all zpools and current status
Zpool export ost1
Exports Zpool (make avaible for HA import)
Zpool import ost1
Import named Zpool
Zpools become Lustre Targets
45
Drive Naming Considerations
ZFS Services and HA
Corosync and Pacemaker
ZFS Lustre resource type
Lustre* and OpenZFS* Installation and Configuration Guide
Complete instructions
46
/dev/sbX won’t work……
Change on Reboot / Other HA Server
Single SAS Path
/dev/disk/by-id/
Defaults to first path found
/dev/disk/by-path
Maybe but zoning is better
/dev/mapper/mpath
Sync Mpath settings between Servers
47
Zpool visible on 2 Nodes (via zpool status)
Quickly corrupt the file system
No Multi-mount protection (MPP feature in Development)
Extreme Care is required
Partial MPP via “hostid”
#genhostid
#reboot
Requires “zpool import –f “ to over ride
48
ZFS works to Cache pools so on reboot they are present
Disable this behavior for HA
zpool create -f -o ashift=12 -o cachefile=none ostN driveA….
rm /etc/zfs/zpool.cache
systemctl disable zfs.target
Test:
Create and Export Zpool
Import on other HA node
Reboot
49
Intel EE 3.1 (some detail still TBD)
Go into IML and do Create Filesystem with Zpools as Targets
Intel FE/Community
Create Lustre FS as normal with ZFS Syntax
mkfs.lustre --ost --backfstype=zfs --fsname=$FSNAME \
--servicenode=$OSS1@tcp --servicenode=$OSS2@tcp \
--index=$index --mgsnode=$MGS_NID <Zpool_Name>/OSTX
zfs set recordsize=1M <Zpool_Name>/OSTX
zfs set compression=on <Zpool_Name>/OSTX
50
For non HA testing
Mount and use targets as normal
mount –t lustre <pool_name>/ostx /mnt/ostx
Pacemaker Lustre ZFS Resouce Setup
Install LustreZFS File
Get file from LU-8455
#cp LustreZFS /usr/lib/ocf/resource.d/heartbeat/
#chmod +755 /usr/lib/ocf/resource.d/heartbeat/LustreZFS
Creates ZFS Lustre Target Resource Types for HA Framework
51
Configure 2nd Direct Interface Between HA Pairs
Ring0 = Management network
Ring1 = Direct Connect between HA Nodes
#pcs cluster auth $OSS1 $OSS2 -u hacluster -p $HA_PW --force
#pcs cluster setup --start --name lustre_ha $OSS1,$OSS1RING2 $OSS2,$OSS2RNIG2 --token 17000 --join 100 --force
#pcs cluster enable --all
52
Shot The Other Node In The Head (STONITH)
If automated action one node shuts the other down
IPMI or Power Distribution Unit (PDU)
(IPMI Example)
#pcs stonith create a-ipmi fence_ipmilan ipaddr="$IPMI1" lanplus=true \passwd="$IPMI1PW" login="root" pcmk_host_list="$OSS1HN"
#pcs stonith create b-ipmi fence_ipmilan ipaddr="$IPMI2" lanplus=true \passwd="$IPMI2PW" login="root" pcmk_host_list="$OSS2HN"
#pcs cluster sync
53
Add Lustre ZFS Resources to HA service
Export all Zpool form all servers
Create mount points on both servers
#pcs resource create OSTX ocf:heartbeat:LustreZFS pool=<zpool_name> \volume=OSTX mountpoint="/mnt/OSTX"
#pcs constraint location OSTX prefers $OSS1=10
#pcs constraint location OSTX prefers $OSS2=20
#pcs cluster sync
Repeat for all Lustre Targets
54
Create Zpools that Perform well
Raidz2 9+2 or larger
Use identifiers for drives (Zoning / Multi-Path)
Reproducible placement of OST
Intel EE 3.1+
Add Server in IML -> Create pools on cmd line ->Use IML to create FS
Community / FE builds
Add Lustre Targers to pacemaker as LustreZFS resouce types
Set constraints
Sync cluster
55
56
Recommended