View
222
Download
6
Category
Preview:
Citation preview
Agenda
� Technical Infrastructure
–Cluster layout –Compute
• Sandy bridge
–Management
� Intel Cluster Studio
–Compiler• Optimization
methodology
–MPI/OpenMP• Features and
� Job Scheduler / Cluster Manager
–LSF• Basic Architecture
• Current configuration
• Scheduling policies
• Troubleshooting
© 2013 IBM Corporation
–Management • xCAT
(Provisioning tool)
–Interconnect• FDR Mellanox
–Storage• GPFS
–Software stack
• Features and
optimizations
–Math Library (MKL)–Debugging
• Parallel application
debugging
–Profiling/Tracing• VTUNE trace
analyzer
• Troubleshooting
• Profiling
• Queues and Priorities
• Fault tolerance
• Submission and
Management
–Hands On
Seismic Analysis
Derivative Analysis
Actuarial Analysis
Mechanical/ Electric Design
Process Simulation
DrugDiscovery
Collaborative Research
Bandwidth
High Performance Computing Segments
© 2013 IBM Corporation5
Energy Finance Mfg Life Sciences Media Gov’t
Seismic Analysis
Reservoir Analysis
Asset Liability Management
Portfolio RiskAnalysis
StatisticalAnalysis
Finite Element Analysis
Failure Analysis
Protein Folding
MedicalImaging
Digital Rendering
WeatherAnalysis
High Energy Physics
Bandwidth Consumption
Gaming
Tools Compilers
Applications
Scientific Libraries Message Passing Interface
High performance computing stack
© 2013 IBM Corporation
Operating System
Parallel File System
Job Scheduler Cluster Administration
Hardware
Cluster Overview
� 800 TeraFlops High Performance Computing System IBM iDataPlex cluster, which
features 38,144 Intel Sandy Bridge processors and 149 TB of memory
� The login and compute nodes are populated with two Intel Sandy Bridge 8-core
processors.
� FDR 14 Infiniband interconnect in a Fat Tree configuration as its high-speed
network for MPI messages and IO traffic
� For High performance parallel file system we used GPFS, a most stable and higly
© 2013 IBM Corporation
� For High performance parallel file system we used GPFS, a most stable and higly
reliable for HPC clusters
� compute node has two 8-core processors (16 cores) with its own Red Hat
Enterprise Linux OS, sharing 64 GBytes of memory
� The cluster is intended to be used as a batch-scheduled jobs
� All executions that require large amounts of system resources must be sent to the
compute nodes by batch job submission through job scheduler
IBM System x iDataPlex Compute Building Block
� 72 x IBM System x iDataPlex dx360 M4 server
– 2x E5-2670 SandyBridge-EP 2.6GHz/1600 cache 20MB 8-core
– 8 x 8G DDR3-1600 DIMMs (4GB/core) Total: 64GB/node
– Dual-port Infiniband FDR14 Mezzazine Card
� 4 X Mellanox 36-port Managed FDR14 IB Switch
– 4 Leaf IB Switch
– 18 x compute nodes connected to each leaf switches.
– 18 Uplinks from every leaf switch connects at IB Main Switches
© 2013 IBM Corporation
– 18 Uplinks from every leaf switch connects at IB Main Switches
� Management Network
– 2 x BNT RackSwitch G8052F
– 4 x 1 Gb Connections from each switch acts as uplink for a
flawless flow of management traffic
� IBM System x iDataPlex Rack with with RDHX (water cooling)
� Performance
– –2.60 GHz x 8 Flops/cycle (AVX) = 20.8 GFlops/core
– –16 core x 20.8 GFlops/core = 332.8 GFlops/node
– –72 nodes x 332.8 GFlops/node = 23.96 TFlops/rack
IBM System x iDataPlex dx360 M4 Compute Node
� iDataPlex Rack server
� 1U Node Density 84 Nodes / 84U Rack
� Support SSI Planars (EP & EN)
� Shared Power –Common Form Factor (CFF)
� Shared Cooling –80mm Fans
� HPC Nodes incl 2x 1GbE down and 10GbE
© 2013 IBM Corporation
� HPC Nodes incl 2x 1GbE down and 10GbE
� 40G/QSFP IB Mezz card option
Intel SandyBridge microprocessor
� Architecture (Tock cycle) features:
– Up to 8 cores per socket
– AVX vector units (double peak FP performance)
– Larger and faster caches
– Improved TLB ( Turbo lookaside buffer)
– Higher memory bandwidth per core
– Enhanced Turbo Mode
– Enhanced Hyper Threading mode
© 2013 IBM Corporation
– Enhanced Hyper Threading mode
– …
SandyBridge-EP model
Intel Processors Evolution
� Tock = new micro architecture
– Westmere � SandyBridge
� Tick = new manufacturing process
– (shrink of lithography)
© 2013 IBM Corporation
Sandy
Bridge EP
Intel® Core™ Microarchitecture
MeromMerom
65nm65nm
PenrynPenryn
45nm45nm
Intel® MicroarchitectureCodename Nehalem
NehalemNehalem
45nm45nm
WestmereWestmere
32nm32nm
Intel® MicroarchitectureCodename Sandy Bridge
SandySandy
BridgeBridge
32nm32nm
Ivy Ivy
BridgeBridge
22nm22nm
Intel® MicroarchitectureCodename Haswell
HaswellHaswell
22nm22nm
BroadwellBroadwell
14nm14nm
Intel Tick-Tock Development Model:Sustained Microprocessor Leadership
© 2013 IBM Corporation
TOCK
New
Micro-architecture
65nm65nm
TICK
New
Process Technology
45nm45nm
TOCK
New
Micro-architecture
45nm45nm
TICK
32nm32nm
New
Process Technology
TOCK
32nm32nm
New
Micro-architecture
TICK
22nm22nm
New
Process Technology
TOCK
22nm22nm
New
Micro-architecture
TICK
14nm14nm
New
Process Technology
18
SandyBridge-EP microprocessor
© 2013 IBM Corporation
� In addition, Sandy Bridge also introduces support for AVX (Advanced vector)extensions within an updated execution stack, enabling 256-bit floating point (FP) operations to be decoded and executed as a single micro-operation (uOp).
� The effect of this is a doubling in peak FP capability, sustaining 8 double precision FLOPs/cycle.
SandyBridge-EP microprocessor
� Sandy Bridge processor integrates a high performance, bidirectional ring architecture interconnecting
– CPU cores, Last Level Cache (LLC, or L3), PCIe, QPI, memory controller
– Able to return 32 Bytes of data on each cycle
� each physical LLC segment is loosely
© 2013 IBM Corporation
� each physical LLC segment is loosely associated with a corresponding core
– But cache is also shared among all cores as a logical unit
� The ring and LLC are clocked with the CPU core, so cache and memory
– latencies have dropped as compared to the previous generation architecture
– bandwidths are significantly improved.
Turbo Boost
� Turbo Boost– Allows dynamically increasing
CPU clock-speed on demand• « Dynamic over clocking »
– Frequency will increase in increments of 100 MHz
• When the processor has not reached its thermal and electrical limits
• When the user's workload demands additional
© 2013 IBM Corporation
demands additional performance
• Until a thermal or power limit is reachedorUntil the maximum speed for the number of active cores is reached
� Important note: – On 4 sockets systems (like x3750m4), the 2.4GHz CPU will only achieve 2.8GHz
Turbo upside on a 4S-EP (this is intentionally limited by Intel) – It is lower than the turbo upside for an equivalent 2-socket EP processor (which
would achieve 3.1GHz).
Turbo Boost : Benefit
Power
“Next Gen Turbo Boost”
C0(Turbo)
After idle periods, the system accumulates “energy budget” and can accommodate high power/performance for a few seconds
In Steady State conditions the power stabilizes on TDP
© 2013 IBM Corporation
Time
Sleep or
Low power
“TDP”
Buildup thermal budget during idle periods
Use accumulated
energy budget to
enhance user experience
FLOPS
� FLOPS (or flops or flop/s) is an acronym meaning FLoating point OPerations per Second.
� The FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations.
� teraFLOPs => 10^12
© 2013 IBM Corporation23
� petaFLOPS => 10^15
� exaFLOPS => 10^18
� zettaFLOPS => 10^21
Theoretical Peak (Rpeak) vs. Sustained Peak (Rmax)
Theoretical Peak:
� Theoretical peak is the total flops which a system can perform– Let's take a system : iDataplex 2.3GHz, E5-2670V3, 2 sockets, 12 cores per socket. How do I
calculate the theoretical total flops of this box
� Flops = 24 (cores) * 2.3 (cpu clock speed) * 16 (FLOPs/cycle) = 883.2 Gflops (1 billion is 10^9 i.e. 1 gigaflops)
© 2013 IBM Corporation
� Note: 2.6GHz is 2.6 billion clock cycles per second. Haswell does 16 floating point operations for each clock cycle, whereas the previous generation Sandy bridge / Ivy bridge does 8 floating point operations for each clock cycle.
Sustained Peak:
� This is the maximal achieved performance by the system (i.e. Rmax)
24
InfiniBand Story: from SDR to QDR
� The base data rate for InfiniBand technology has been the “Single Data Rate”
(SDR),
– SDR = 2.5Gbps per lane or bit in a link.
• Standard width of the interface is 4X or 4 bits wide.
• Therefore, the standard SDR bandwidth or speed of a link is 2.5Gbps times
4 bit lanes, or 10Gbps.
– SDR bandwidth = 10 Gbps
© 2013 IBM Corporation
� DDR, or “Double Data Rate”, is 20Gbps per 4x link.
� QDR, or “Quad Data Rate”, is 40Gbps per link.
– QDR technology run an 8/10 bit encoding which yields 80% efficiency for every
bit of data payload sent across the link.
– QDR Effective rate = 32 Gbps
MPI Performance
Infiniband 4x QDR
Infiniband 4x DDR
Ethernet 10G
Latency1.5 – 3.0 µs
(depending on # hops)
Same as QDR Around 30 µs
Bandwidth 40 Gb/s 20 Gb/s 10 Gb/s
© 2013 IBM Corporation
Bandwidth (Theoretical)
40 Gb/s 20 Gb/s 10 Gb/s
Bandwidth (Actual)
4 GB/s(per Node, bi-
directional)2 GB/s 1 GB/s
Notes:
• Infiniband advertised performance is bidirectional
• QDR: 10 Bytes sent for 8-Byte message � Actual peak = 80% theory
GPFS Storage server
IBM System x GPFS Storage Server: Bringing HPC Technology to the Mainstream
•Better, Sustained Performance
- Industry-leading throughput using efficient De-Clustered RAID Techniques
•Better Value
–Leverages System x servers and Commercial JBODS
© 2013 IBM Corporation29
–Leverages System x servers and Commercial JBODS
•Better Data Security
–From the disk platter to the client.
–Enhanced RAID Protection Technology
•Affordably Scalable
–Start Small and Affordably
–Scale via incremental additions
–Add capacity AND bandwidth
•3 Year Warranty –Manage and budget costs •IT-Facility Friendly –Industry-standard 42u 19 inch rack mounts –No special height requirements –Client Racks are OK!
© 2013 IBM Corporation
•And all the Data Management/Life Cycle Capabilities of GPFS – Built in!
LAN
NSD ClientsMassive namespace
support
Architecture : Diagram
© 2013 IBM Corporation
SANSAN
GPFS
Seamless capacity
and performance
scaling
Centrally deployed,
managed, backed up
and grown
NSD Servers
� Optimizes storage utilization by centralizing management
� Provides a flexible scalable alternative to a growing number of NAS appliances
SAN
GPFS
© 2013 IBM Corporation
� Highly available grid computing infrastructure
� Scalable information lifecycle tools to manage growing data volumes
SAN
Parallel Filesystem
GPFS: A file system for high performance computing.as a shared disk, parallel file system for AIX, Linux clusters
Number of files:
• 2 Billion per file system
• 256 file systems
• Max File System Size: 2^99 bytes
Disk IO:
•AIX 134 GB/sec •Linux 66 GB/sec
Number of nodes:
© 2013 IBM Corporation41
Software features: snapshots, replication and multi-site connectivity are included in the GPFS license. There are no license keys besides client and server to add-on, you get all of the features up front.
• Max File System Size: 2^99 bytes
• Max File Size = File system size
Number of nodes:
• 1 to 8192
• GPFS 2.3, or later, architectural file system size limit
– 2^99 bytes
– Current tested limit ~2 PB
• Total number of files per file system
– 4,000,000,000 (four billion - GPFS 3.4 created file system, two billion on 3.2 or earlier
GPFS versions)
• Total number of nodes: 8,192
– A node is in a cluster if:
Architecture Stat
© 2013 IBM Corporation
– A node is in a cluster if:
• The node shows in mmlscluster (shows up in mmlscluster) or
• The node is in a remote cluster and is mounting a file system in the local cluster
• Maximum number of mounted file systems
– 256
– Before GPFS 3.2, 64 file systems
• Maximum disk size
– Limited by disk device driver and O/S
GPFS provides a highly scalable file management infrastructure
� Optimizes storage utilization by centralizing management
� Provides a flexible scalable alternative to a growing number of NAS appliances
What GPFS provides
© 2013 IBM Corporation
NAS appliances
� Highly available grid computing infrastructure
� Scalable information lifecycle tools to manage growing data volumes
• The GPFS kernel extension provides:
– Interfaces to the operating system vnode and VFS.
• Flow:
– Application makes file system calls to the O/S.
– O/S presents calls to the GPFS kernel extension.
• GPFS appears to the application as just another file system.
– GPFS kernel extension will either satisfy requests using information already available
or send a message to the GPFS daemon to complete the request.
Kernel Extension
© 2013 IBM Corporation
or send a message to the GPFS daemon to complete the request.
– The GPFS daemon
• It performs all I/O and buffer management, including read ahead for sequential reads
and write behind operations.
• All I/O is protected by token management to ensure file system consistency.
• Multi-threaded with some threads dedicated to specific functions.
– Examples include space allocation, directory management (insert and removal), and
quotas.
• Disk I/O is initiated on threads of the daemon.
Manager nodes
• Global lock manager
• File system configuration: recovery, adding disks, …
• Disk space allocation manager
• Quota manager
• File metadata manager - maintains file metadata integrity
Node Roles
© 2013 IBM Corporation
File system nodes
• Run user programs, read/write data to/from storage nodes
• Implement virtual file system interface
• Cooperate with manager nodes to perform metadata operations
Storage nodes
• Implement block I/O interface
• Shared access to file system and manager nodes
• Interact with manager nodes for recovery
ILM tools� Storage pools
– A collection of disks or arrays with similar properties that are managed together as a group.
� File placement policies– Determines where the file data is
placed on creation.
� File management policies– Migrates or deletes file based on
business rules
Storage
Pool
Storage
PlacementPolicies
Filesets
© 2013 IBM Corporation© Copyright IBM Corporation 2010, 2013. All Rights Reserved.
business rules
� Filesets– Logical subtrees within a file system
that act as metadata containers for files.
Storage
Pool
Storage
Pool
ManagementPolicies
What is a storage pool?
� Two types of storage pools
– Internal
– External
� Internal: A collection of disks or arrays with similar properties that are managed
together as a group.
– Group storage devices and create classes of storage within a file system
– Match the cost of storage to the value of the data
– Improved performance
© 2013 IBM Corporation© Copyright IBM Corporation 2010, 2013. All Rights Reserved.
– Improved performance
– Improved reliability
� External
– An interface to an external application
What are filesets?
� A fileset is a sub-tree of a file system namespace that provides a means of
partitioning the file system to allow administrative operations.
– In many ways behaves like an independent file system
– Used to define quotas on data blocks and inodes
� A fileset has a root directory.
– All files belonging to the fileset are only accessible through this root directory
– No hard links between filesets are allowed
– Renames are not allowed to cross fileset boundaries
© 2013 IBM Corporation© Copyright IBM Corporation 2010, 2013. All Rights Reserved.
– Renames are not allowed to cross fileset boundaries
Policy-based management
� Two types of policies
– File placement
– File management
� File placement policies
• Determine the initial storage pool for each file’s data
►The data will be striped across all disks in the selected pool
• Also determines the file’s replication factor
© 2013 IBM Corporation© Copyright IBM Corporation 2010, 2013. All Rights Reserved.
� File management policies
• Determines when a file’s data should be migrated
• Determines where the data should go
Policy rules
� Similar syntax to SQL 92 standard
� You can have 1 MB of rule text
� Rule order matters
– Rules are evaluated top to bottom
– Once a rule matches processing ends for that file
� You can use built-in functions. Examples: – Date: Current_Timestamp, DayOfWeek, DAY(), HOUR()
© 2013 IBM Corporation© Copyright IBM Corporation 2010, 2013. All Rights Reserved.
– Date: Current_Timestamp, DayOfWeek, DAY(), HOUR()
– String: LOWER(), UPPER(),LENGTH()
– Numeric: INT(), MOD()
Rule syntax: Placement policy
� SyntaxRULE ['RuleName']
SET POOL 'PoolName'
[LIMIT (OccupancyPercentage)]
[REPLICATE (DataReplication)]
[FOR FILESET (FilesetName[,FilesetName]...)]
[WHERE SqlExpression]
� Can be set on attributes you know about a file when it is created
© 2013 IBM Corporation© Copyright IBM Corporation 2010, 2013. All Rights Reserved.
� Can be set on attributes you know about a file when it is created
– Name, location, user
File management policy processing� Batch process
� Very efficient metadata scans
� When a batch is executed there are three
steps:
– Directory scan
– Rule evaluation
– File operations
� Can operate in parallel over multiple
Scan Files
1
Apply Rules2
© 2013 IBM Corporation© Copyright IBM Corporation 2010, 2013. All Rights Reserved.
� Can operate in parallel over multiple
machines Apply Rules2
Perform File Operations
3
High availability
� Infrastructure– Storage– SAN– Networking– Server HW
© 2013 IBM Corporation© Copyright IBM Corporation 2010, 2013. All Rights Reserved.
� GPFS– Node location– Quorum – Replication– Multi-site
Replication examples
� Full replication– Two failure groups
– Data and metadata
– On failure file system all ok
� Metadata Replication– Replicate only metadata
– On failure data missing file
system stays mounted
Failure Group 1
Failure Group 1
inode
Metadata OK
© 2013 IBM Corporation© Copyright IBM Corporation 2010, 2013. All Rights Reserved.
Failure Group 1
Failure Group 2
Failure Group 2
Failure Group 3
Failure Group 4
inode
Missing Data
Failure Group 3 (Desc Only)
Reliability: Multiple site high availability� Multi-site quorum configuration
� Replicate across sites
� Bandwidth requirements based on application
� Often called: Two sites and a laptop
� Distributed data– Data is distributed across two
sites, third site contains quorum node for availability
Site A
© 2013 IBM Corporation© Copyright IBM Corporation 2010, 2013. All Rights Reserved.
node for availability
� Sites A and B – Contain the core GPFS nodes
and storage– Multiple quorum nodes in each
site
� Site C – Contains a single quorum node– Serves as tie breaker if one of the
other sites becomes inaccessible – File system descriptor NSD Single GPFS System
Site A
Site B
Site C
WAN
� Use mmdelnode to remove a node from a cluster:mmdelnode { -a | -N Node[,Node…] |
NodeFile|NodeClass]
–Cannot be primary or secondary GPFS cluster configuration node (unless removing entire cluster)
–Cannot be an NSD server (unless removing entire cluster) –Can be run from any node remaining in the GPFS cluster
Adminitration : Node Deletion
© 2013 IBM Corporation
–Can be run from any node remaining in the GPFS cluster–GFPS daemon must be stopped on node being deleted
� Deleting some nodes:–Avoid unexpected consequences due to quorum loss
� Deleting a cluster using the mmdelnode command:mmdelnode -a
� Disks are added to a file system using the mmadddisk command:
mmadddisk Device {"DiskDesc[;DiskDesc...]“ | -F DescFile}
[-a] [-r]
[-v{yes|no}] [-N {Node[,Node...] | NodeFile |
NodeClass}]
� Optionally, rebalance the data ( -r) (recommended but can cause performance
impact while rebalancing).
Adding disks
© 2013 IBM Corporation
impact while rebalancing).
� The file system can be mounted or unmounted.
� The NSD must be created before it can be added using mmadddisk.
– Create new disk (mmcrnsd)– Reuse available disk (mmlsnsd –F)
# mmlsnsd -F
File system Disk name Primary node
Backup node
-----------------------------------------------------
(free disk) gpfs3nsd (directly attached)
� Managing disks within a file system
– Disk errors
– Performance evaluation
– Planning for migration
� Modify disk state using the mmchdisk command# mmchdisk
Usage:
mmchdisk Device {resume | start} -a
Changing disk attributes
© 2013 IBM Corporation
mmchdisk Device {resume | start} -a
[-N {Node[,Node...] | NodeFile |
NodeClass}]
or
mmchdisk Device {suspend | resume | stop | start |
change}
{-d "DiskDesc[;DiskDesc...]" | -F DescFile}
[-N {Node[,Node...] | NodeFile |
NodeClass}]
� Example
– Restart disk after fixing storage failure
� A disk can be replaced by a new disk.– Need a free NSD as large or larger than original– Cannot replace stopped disk– Cannot replace disk if only disk in file system– Do not need to unmount file system– No need to re-stripe– File system can be mounted or unmounted
Replacing Disks
© 2013 IBM Corporation
� It is replaced using the mmrpldisk command.
� Usage:mmrpldisk Device DiskName {DiskDesc | -F DescFile}
[-v {yes | no}]
[-N {Node[,Node...] | NodeFile
| NodeClass}]
� Disks are removed from a file system using the mmldedisk command.– Migrates data to remaining disks in file system– Removes disk from file system descriptor– Can be run from any node in cluster
� The mmdeldisk command:– Usage:
mmdeldisk Device {"DiskName[;DiskName...]" | -F
DiskFile} [-a] [-c]
[-r] [-N {Node[,Node...] | NodeFile |
NodeClass}]
Deleting a Disk
© 2013 IBM Corporation
NodeClass}]
� Usage scenarios:– If disk is not failing and still readable by GPFS:
• Suspend the disk (mmchdisk disk_name suspend).• Re-stripe to rebalance all data onto other disks (mmrestripefs –b).• Delete the disk (mmdeldisk).
� If disk is permanently damaged and file system is replicated:– Suspend and stop disk (mmchdisk disk_name suspend; mmchdisk disk_name stop)
– Re-stripe and restore replication for the file system, if possible (mmrestripefs –r)
– Delete the disk from the file system (mmdeldisk)
� mmchfs command
– Usage:mmchfs Device [-A {yes | no | automount}] [-D {posix |
nfs4}] [-E {yes | no}]
[-F
MaxNumInodes[:NumInodesToPreallocate]]
[-k {posix | nfs4 | all}] [-K {no |
whenpossible | always}]
[-m DefaultMetadataReplicas] [-o
File system
© 2013 IBM Corporation
[-m DefaultMetadataReplicas] [-o
MountOptions]
[-Q {yes | no}] [-r
DefaultDataReplicas] [-S {yes | no}]
[-T Mountpoint] [-t DriveLetter] [-V
{full | compat}] [-z {yes | no}]
ormmchfs Device -W NewDeviceName
� Cannot modify
– Blocksize
– Logfile (-L LogFileSize in mmcrfs)
– MaxDataReplicas and MaxMetadataReplicas
– numnodes
� Quotas are set using the mmedquota command.
� Issue mmedquota to explicitly set quotas for a user, groups, or filesets.mmedquota {-u [-p ProtoUser] User... |
-g [-p ProtoGroup] Group... |
-j [-p ProtoFileset] Fileset... |
-d {-u User... | -g Group... | -j Fileset}
|
-t {-u | -g | -j}}
– Confirm using mmrepquota command.
Setting up user quota
© 2013 IBM Corporation
– Confirm using mmrepquota command.
� Example: Edit quota for user user1# mmedquota –u user1
*** Edit quota limits for USR tests
NOTE: block limits will be rounded up to the next
multiple of the block size.
block units may be: K, M, or G.
fs1: blocks in use: 0K, limits (soft = 0K, hard = 0K)
inodes in use: 0, limits (soft = 0, hard = 0)
What is xCAT?
� Extreme Cluster(Cloud) Administration Toolkit– Open Source Linux/AIX/Windows Scale-out Cluster
Management Solution
� Design Principles– Build upon the work of others
• Leverage best practices
– Scripts only (no compiled code)
© 2013 IBM Corporation
– Scripts only (no compiled code)• Portable
• Source
– Vox Populi -- Voice of the People• Community requirements driven
• Do not assume anything
What does xCAT do?
� Remote Hardware Control– Power, Reset, Vitals, Inventory, Event Logs, SNMP alert processing– xCAT can even tell you which light path LEDs are lit up remotely
� Remote Console Management– Serial Console, SOL, Logging / Video Console (no logging)
� Remote Destiny Control– Local/SAN Boot, Network Boot, iSCSI Boot
� Remote Automated Unattended Network Installation
© 2013 IBM Corporation
� Remote Automated Unattended Network Installation– Auto-Discovery
• MAC Address Collection• Service Processor Programming• Remote Flashing
– Kickstart, Autoyast, Imaging, Stateless/Diskless, iSCSI
� Scales! Think 100,000 nodes.
� xCAT will make you lazy. No need to walk to datacenter again.
Functionality
� Remote Hardware Control
– Power, reset, vitals, inventory, event logs, SNMP alert processing
� Remote Console Management
– Serial console, SOL, logging
� Remote Destiny Control
– Local boot, network boot, iSCSI boot
� Parallel Cluster control
© 2013 IBM Corporation
� Parallel Cluster control
– parallel shell, parallel rsync, parallel secure copy, parallel ping
� Remote Automated Unattended Network Installation
– Auto-discovery
• MAC address collection
• Service processor programming
– Remote flashing
– Kickstart, Autoyast, imaging, stateless/diskless
� Easy to Use and it Scales! Think 100000 nodes
– xCAT will make you lazy - no need to walk to datacenter again
Architecture
A single xCAT Management Node (MN) for N number of nodes.– A single node DHCP/TFTP/HTTP/NFS server.– Scales to ~128 nodes.• If staggered boot is used, this can scale to 1024 nodes (tested)
© 2013 IBM Corporation
Scale Infrastructure
� A single xCAT management node with multiple service nodes providing boot services to increasing scaling.
� Can scale to 1000s and 10000s of nodes.
� xCAT already provides this support for large diskfull clusters and it can by applied to stateless as well.
� The number of nodes and network infrastructure will determine the number of DHCP/TFTP/HTTP servers required for a parallel reboot with no DHCP/TFTP/HTTP timeouts.
� The number of DHCP servers does not need to equal the number of TFTP or HTTP servers. TFTP servers NFS mount read-only the /tftpboot and image directories from the
© 2013 IBM Corporation
servers. TFTP servers NFS mount read-only the /tftpboot and image directories from the management node to provide a consistent set of kernel, initrd, and file system images.
node001 node002 ... nodennn
DHCP TFTP HTTP NFS(hybrid)
DHCP TFTP HTTP NFS(hybrid)
nodennn + 1 nodennn + 2 ... nodennn + m
DHCP TFTP HTTP NFS(hybrid)
...
IMNmgmt node
service node01 service node02 service nodeNN
IMN...
Tables and Database
� xCAT stores all information about the nodes and subsystems it manages in a
database.
– XCAT default database is located in /etc/xcat in sqlite tables. XCAT can be
instructed to store the tables in MySQL, PostgreSQL or DB2 as well.
� For most installations you won't need to even fill up half of the tables!
– And for the tables that you do need, in most cases you'll only need to put one
line in the table!
� There are lot of tables but only some tables are for common to Linux and AIX, some
© 2013 IBM Corporation
� There are lot of tables but only some tables are for common to Linux and AIX, some
are for only AIX, some just for monitoring, some for advanced functions (virtual
machines, iSCSI settings), …
� xCAT comes with a rich set of functions for manipulating tables.
Provisioning methods
xCAT
Stateful –
Diskful
Local - HD - Flash
Stateful –
Disk-Elsewhere
San - iSCSi
Stateless –
Disk Optional
Memory RAM - CRAM - NFS
xCAT xCAT
OS
In
sta
lle
r
OS
In
sta
lle
r
Ima
ge
Pu
sh
© 2013 IBM Corporation
HD Memory
Node
OS
In
sta
lle
r
HD Memory HD Memory
SAN/iSCSI/NAS
OS
In
sta
lle
r
Ima
ge
Pu
sh
Node Node
OS
OS
• HD
• Flash
• RAM
• CRAM
OS
Statelite
Byte Addressability
� Bytes are always 8 bits
� Word length typically ranges from 16 to 64 bits.
� Memory location assignments refer to successive byte locations in memory
• Memory is byte-accessible.
� For 8086, a Word is 16-bits (2 bytes)
© 2013 IBM Corporation
Data Type - C
Type Storage size Value range
char 1 byte -128 to 127 or 0 to 255
unsigned char 1 byte 0 to 255
signed char 1 byte -128 to 127
© 2013 IBM Corporation
signed char 1 byte -128 to 127
int 2 or 4 bytes-32,768 to 32,767 or -2,147,483,648 to 2,147,483,647
unsigned int 2 or 4 bytes0 to 65,535 or 0 to 4,294,967,295
short 2 bytes -32,768 to 32,767
unsigned short 2 bytes 0 to 65,535
long 4 bytes-2,147,483,648 to 2,147,483,647
unsigned long 4 bytes 0 to 4,294,967,295
Fortran 90 Type Description Range
integer(1) 8-bit signed integer -128 to +127
integer(2) 16-bit signed integer -32,768 to +32,767
integer [ integer(4) ] 32-bit signed integer-2,147,483,648 to +2,147,483,647
integer(8) 64-bit signed integer +/- 9.22 x 1018
real [ real(4) ] 32-bit floating point+/- (1.1754 x 10-38 to 3.4028 x 1038)
© 2013 IBM Corporation
double precision [ real(8) ] 64-bit floating point+/- (2.2250 x 10-308 to 1.7976 x 10308)
character 8-bit ASCII/ISOASCII 0 (NUL) to 255 (y-umlaut)
logical .true. or .false. false to true
complex [ complex(4) ]Two 32-bit floating point values
Same as real
double complex [ complex(8) ]
Two 64-bit floating point values
Same as double precision
Word Length
�For the 8086, a Word is 16-bits (2 bytes)
© 2013 IBM Corporation
A Signed Integer
b15 b14 … … … … b1 b0
<------------------------------16 bits ----------------------------->
2 Characters
8 bits 8 bits
<----- ASCII character -----> <----- ASCII character ----->
Little-Endian Formatting
� There are two ways that byte addresses can be assigned across words.
– Big-Endian – higher byte addresses are used for the less significant bytes of a
word
– Little-Endian – lower byte addresses are used for the less significant bytes of a
word
– The Intel Architecture uses Little Endian
– The Power Architecture uses Big Endian
© 2013 IBM Corporation
Little-Endian Assignments
�.data
�List BYTE 10, 20, 30, 40, 50, 60
Offset Value
© 2013 IBM Corporation
0005 60
0004 50
0003 40
0002 30
0001 20
0000 10
*note that with byte size data, there is no difference between Big-Endian and Little-Endian
Big-Endian Assignments
�.data
�List BYTE 10, 20, 30, 40, 50, 60
Offset Value
© 2013 IBM Corporation
0005 60
0004 50
0003 40
0002 30
0001 20
0000 10
*note that with byte size data, there is no difference between Big-Endian and Little-Endian
Little-Endian Assignments
�.data
�List WORD 1020h, 3040h, 5060h
Offset Value
© 2013 IBM Corporation
0005 50
0004 60
0003 30
0002 40
0001 10
0000 20
*note that with word size data, the low-order byte is stored in a lower address
Big-Endian Assignments
�.data
�List WORD 1020h, 3040h, 5060h
Offset Value
© 2013 IBM Corporation
0005 60
0004 50
0003 40
0002 30
0001 20
0000 10
*note that with word size data, the low-order byte is stored in a higher address
Little-Endian Assignments
�.data
�List DWORD 10203040h, 50607080h
Offset Value
© 2013 IBM Corporation
0005 70
0004 80
0003 10
0002 20
0001 30
0000 40
*note that with word size data, the low-order byte is stored in a lower address
Big-Endian Assignments
�.data
�List DWORD 10203040h, 50607080h
Offset Value
© 2013 IBM Corporation
0005 60
0004 50
0003 40
0002 30
0001 20
0000 10
*note that with word size data, the low-order byte is stored in a higher address
Environment variable - F_UFMTENDIAN
Examples
� F_UFMTENDIAN=big
– All input/output operations perform conversion from big-endian to little-endian on
READ and from little-endian to big-endian on WRITE.
� F_UFMTENDIAN="little;big:10,20“
� F_UFMTENDIAN=big:10,20
© 2013 IBM Corporation
� F_UFMTENDIAN=big:10,20
� F_UFMTENDIAN=10,20
– The input/output operations perform big-endian to little endian conversion only on
unit numbers 10 and 20.
� F_UFMTENDIAN="big;little:8"
– No conversion operation occurs on unit number 8. On all other units, the
input/output operations perform big-endian to little-endian conversion.
� F_UFMTENDIAN=10-20
– The input/output operations perform big-endian to little-endian conversion on units
10, 11, 12 , ... 19, 20.
Sample code
character*30 file1
file1= '/home/gpfs1/p6.data'
open (unit=10, file=file1, status='old', readonly)
open(unit=20,file='Lab4.output',status='new')
© 2013 IBM Corporation
Compiler flag
� ifort -convert big_endian convert_data.f90
– The Intel compiler has a "-convert big_endian" flag to read and write big-endian
files.
© 2013 IBM Corporation
Recommended