Upload
ttec
View
170
Download
6
Tags:
Embed Size (px)
Citation preview
1 Life
Sci
en
ces
Ris
k A
na
lysi
s S
imu
lati
on
B
ig D
ata
An
aly
tics
H
igh
Pe
rfo
rma
nce
Co
mp
uti
ng
HIGH PERFORMANCE COMPUTING 2015/16Technology Compass
3
1980 marked the beginning of a decade where numerous startups
were created, some of which later transformed into big players in
the IT market. Technical innovations brought dramatic changes
to the nascent computer market. In Tübingen, close to one of Ger-
many’s prime and oldest universities, transtec was founded.
In the early days, transtec focused on reselling DEC computers
and peripherals, delivering high-performance workstations to
university institutes and research facilities. In 1987, SUN/Sparc
and storage solutions broadened the portfolio, enhanced by
IBM/RS6000 products in 1991. These were the typical worksta-
tions and server systems for high performance computing then,
used by the majority of researchers worldwide.
In the late 90s, transtec was one of the first companies to offer
highly customized HPC cluster solutions based on standard Intel
architecture servers, some of which entered the TOP 500 list of
the world’s fastest computing systems.
Thus, given this background and history, it is fair to say that tran-
stec looks back upon a more than 30 years’ experience in scientif-
ic computing; our track record shows more than 750 HPC instal-
lations. With this experience, we know exactly what customers’
demands are and how to meet them. High performance and ease
of management – this is what customers require today.
HPC systems are for sure required to peak-perform, as their name
indicates, but that is not enough: they must also be easy to han-
dle. Unwieldy design and operational complexity must be avoid-
ed or at least hidden from administrators and particularly users
of HPC computer systems.
This brochure focusses on where transtec HPC solutions excel.
transtec HPC solutions use the latest and most innovative tech-
nology. Bright Cluster Manager as the technology leader for uni-
fied HPC cluster management, leading-edge Moab HPC Suite for
job and workload management, Intel Cluster Ready certification
as an independent quality standard for our systems, Panasas HPC
storage systems for highest performance and real ease of man-
agement required of a reliable HPC storage system. Again, with
these components, usability, reliability and ease of management
are central issues that are addressed, even in a highly heteroge-
neous environment. transtec is able to provide customers with
well-designed, extremely powerful solutions for Tesla GPU com-
puting, as well as thoroughly engineered Intel Xeon Phi systems.
Intel’s InfiniBand Fabric Suite makes managing a large InfiniBand
fabric easier than ever before, and Numascale provides excellent
Technology for AMD-based large-SMP Systems – transtec mas-
terly combines excellent and well-chosen components that are
already there to a fine-tuned, customer-specific, and thoroughly
designed HPC solution.
Your decision for a transtec HPC solution means you opt for most
intensive customer care and best service in HPC. Our experts will
be glad to bring in their expertise and support to assist you at
any stage, from HPC design to daily cluster operations, to HPC
Cloud Services.
Last but not least, transtec HPC Cloud Services provide custom-
ers with the possibility to have their jobs run on dynamically pro-
vided nodes in a dedicated datacenter, professionally managed
and individually customizable. Numerous standard applications
like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes like
Gromacs, NAMD, VMD, and others are pre-installed, integrated
into an enterprise-ready cloud management environment, and
ready to run.
Have fun reading the transtec HPC Compass 2015/16!
High Performance Computing for Everyone
Director of HPC Solutions at transtec AG, discussing
the wide area of use of High Performance Comput-
ing (HPC), the benefits of turnkey solutions and the
increasing popularity of Big Data Analytics.
Interview with Dr. Oliver Tennert
5
Dr. Tennert, in which industries have you seen an increase in
demand for high performance computing?
From our perspective, it is difficult to differentiate between specific
segments. In general, the high performance computing market is a
globally expanding sector. The US American IT analyst IDC predicts
an annual average growth of approximately 7% over the next 5
years across all sectors even in traditional HPC-dominated indus-
tries such as CAE (Computer-Aided Engineering) and Life Sciences.
For which type of applications is HPC usually deployed? Could
you give us a concrete example of its application?
High performance computing is basically divided into two areas
of application: simulation on the one hand and data analysis on
the other. Take the automotive industry for example.
During the vehicle design and development stage, a great deal of
effort is invested in crash simulations while the number of ma-
terial-intensive crash tests, which used to be performed regular-
ly, have been minimized and are only conducted for verification
purposes.
The analysis of sequenced genome data for hereditary diseases
is a prime example of how data analysis is used effectively – or
from the field of molecular genetics – the measurement of evolu-
tionary relations between different types or the effect of viruses
on the germline.
High performance computing is often criticized under the as-
pect that a HPC installation is far too expensive and complex
for “ordinary” organisations. Which arguments would you use
against these accusations?
[Grinning:] In the past we have sold HPC solutions to many cus-
tomers who would describe themselves as “ordinary” and who
have been extremely satisfied with the results. But you are in
fact right: when using a complex HPC system, it involves hiding
this complexity away from users and administrators. When you
take a look under the hood of a car, you will see that it is also a
complex machine. However drivers do not have to concern them-
selves with features such as transmission technology and ABS
electronics when driving the vehicle.
It is therefore important to ensure you implement a straightfor-
ward and enterprise-ready management and service concept.
transtec offers an intelligent portfolio of service and manage-
ment concepts which are compatible with the paradigms of
“easy to manage” and “easy to use”.
What are the challenges when launching, installing and operat-
ing an HPC system? What should administrators bear in mind to
ensure that the HPC system remains manageable?
When looking for an HPC systems provider, customers should
make sure that the provider has precisely understood their needs
and requirements. This is what differentiates a solutions provider
6
High Performance Computing for Everyone
Interview with Dr. Oliver Tennert
such as transtec from a product-centric hardware supplier or
manufacturer. The optimum sizing of the entire solution and the
above mentioned easy manageability has a direct impact on the
customer’s or user’s productivity.
Once the system has been delivered, customers want to start pro-
ductive operation as soon as possible. transtec provides custom-
ers with turnkey HPC solutions, which means: depending on the
customer’s requirements, our specialists can manage the complete
installation and integration into the customer’s environment.
What is the role of HPC when deploying cloud computing and
big data analyses in large organisations?
When deploying cloud computing, it is important to make a dis-
tinction: that what we today call a “private cloud” is essentially
– leaving technological details to one side – a tried-and-tested
concept in which companies can consolidate and provide ac-
cess to their computing capacities from one central location.
The use of “public cloud” providers such as Amazon has signifi-
cantly dropped in light of previous NSA reports. Security aspects
have always been a serious obstacle for cloud computing and, in
terms of HPC, the situation becomes all the more complicated
for an industrial organisation as high performance computing is
an essential element in the organisation’s value chain. The data
the organisation uses for work is a vital asset and critical to the
company.
Big Data Analytics, in other words the ability to process and analyse
large data volumes, is essentially nothing new. Recently however,
the amount of data has increased at a significantly faster pace
than the computing capacities available.
7
For example, with the use of biochemi-cal methods in “next
generation sequencing”, complete genome sequences can
today be generated and prepared for further analysis
at a far faster rate than ever before. The need for
HPC solutions has recently become increas-
ingly more evident.
While we are on the topic of big data: To
what extent are high-performance, modern
in-memory computing technologies competing
with HPC systems?
This depends on whether you see these technologies in direct
competition to each other! Shared in-memory databases and the
associated computing technologies are innovative concepts for
the particular application, for example real-time analytics or data
mining. We believe that both are contemporary technologies like
many others that enable the analysis of large data volumes in real-
time. From transtec’s point of view, these are HPC technologies
and some of our customers are already using them on HPC
solutions we have provided.
What are expected of today’s supercomputers? What does the
future hold for the HPC segment – and what follows on from
Petaflop and Petabyte?
When we talk of supercomputers, we are referring to the top 500 list
of the fastest computers in the world. “Flop” is used as a measure
of performance, in other words the number of floating point
operations per second. “Tianhe-2” from China currently holds
the number one position with an amazing 34 Petaflops, in other
words 34 billion Flops. However rank 500 scoring 133 Teraflops
(133 million Flops) is still an impressive figure. That is rough-
ly the equivalent of the consolidated performance of
2,000 well configured PCs.
Exabyte follows Petabyte. Over the last
few years it has been widely discussed
on the HPC market what kind of sys-
tem would be needed to manage these
data volumes, particularly in terms of capacity,
power supply and cooling. Significant technology
advancements will expectedly be required before this
type of supercomputer can be produced. Sheer mass alone
will not be able to manage these volumes.
Keyword - security: What is the best way to protect HPC systems
in an age of the NSA scandal and ever increasing cybercrime?
Basically the same applies to HPC systems as it does to company
notebooks and workstations where IT security is concerned:
access by unauthorised parties must be prohibited by deploying
secure IT structures. As described above, enormous volumes of
critical organisation data are stored on an HPC system which is
why efficient data management with effective access control are
essential in HPC environments.
The network connections between the data centre and user
workstations must also be secured on a physical level. This
especially applies when the above-mentioned “private cloud”
scenarios are set up across various locations to allow them to
work together regardless of the particular country or time zone.
„Petabyte
is followed by
Exabyte, although before
that, several technological
jumps need to
take place.“
8
Technology Compass
Table of Contents and Introduction
High Performance Computing ........................................ 10Performance Turns Into Productivity ................................................... 12
Flexible Deployment With xCAT ............................................................... 14
Service and Customer Care From A to Z ............................................. 16
Vector Supercomputing ...................................................... 18The NEC SX Architecture ............................................................................... 20
History of the SX Series.................................................................................. 25
Advanced Cluster Management Made Easy ........... 34Easy-to-use, Complete and Scalable ...................................................... 36
Cloud Bursting With Bright ......................................................................... 44
Intelligent HPC Workload Management ................... 46Moab HPC Suite – Enterprise Edition .................................................... 48
Moab HPC Suite – Grid Option ................................................................... 52
Optimizing Accelerators with Moab HPC Suite .............................. 54
Moab HPC Suite – Application Portal Edition .................................. 58
Moab HPC Suite – Remote Visualization Edition ........................... 60
Remote Visualization and Workflow Optimization ....................................................... 68NICE EnginFrame: A Technical Computing Portal ......................... 70
Desktop Cloud Virtualization .................................................................... 74
Cloud Computing .............................................................................................. 78
NVIDIA GRID .......................................................................................................... 80
Virtual GPU Technology ................................................................................. 82
9
Intel Cluster Ready ................................................................ 84A Quality Standard for HPC Clusters...................................................... 86
Intel Cluster Ready builds HPC Momentum ..................................... 90
The transtec Benchmarking Center ....................................................... 94
Parallel NFS ................................................................................. 96The New Standard for HPC Storage ....................................................... 98
Whats´s New in NFS 4.1? ........................................................................... 100
Panasas HPC Storage ................................................................................... 102
BeeGFS – A Parallel File System ..........................................................116A Parallel File System ................................................................................... 118
NVIDIA GPU Computing .....................................................126What is GPU Computing? .......................................................................... 128
Kepler GK110/210 GPU Computing Architecture ........................ 130
A Quick Refresher on CUDA ...................................................................... 140
Intel Xeon Phi Coprocessor .............................................142The Architecture.............................................................................................. 144
An Outlook on Knights Landing and Omni Path ......................... 152
InfiniBand High-Speed Interconnect ........................156The Architecture.............................................................................................. 158
Intel MPI Library 4.0 Performance ........................................................ 160
Intel Fabric Suite 7 ......................................................................................... 164
Numascale .................................................................................168NumaConnect Background ...................................................................... 170
Technology ......................................................................................................... 172
Redefining Scalable OpenMP and MPI .............................................. 178
Big Data ................................................................................................................ 182
Cache Coherence ............................................................................................ 184
NUMA and ccNUMA ....................................................................................... 186
IBM Platform HPC ..................................................................188Enterprise-Ready Cluster & Workload Management ............... 190
What’s New in IBM Platform LSF 8 ....................................................... 194
IBM Platform MPI 8.1 .................................................................................... 202
General Parallel File System (GPFS) ............................204What is GPFS? ................................................................................................... 206
What´s New in GPFS Version 3.5 ........................................................... 220
Glossary .......................................................................................222
High Performance Computing (HPC) has been with us from the very beginning of the
computer era. High-performance computers were built to solve numerous problems
which the “human computers” could not handle.
The term HPC just hadn’t been coined yet. More important, some of the early principles
have changed fundamentally.
High Performance Computing – Performance Turns Into Productivity
Life
Sci
en
ces
Ris
k A
na
lysi
s S
imu
lati
on
B
ig D
ata
An
aly
tics
H
igh
Pe
rfo
rma
nce
Co
mp
uti
ng
12
Performance Turns Into ProductivityHPC systems in the early days were much different from those
we see today. First, we saw enormous mainframes from large
computer manufacturers, including a proprietary operating
system and job management system. Second, at universities
and research institutes, workstations made inroads and scien-
tists carried out calculations on their dedicated Unix or VMS
workstations. In either case, if you needed more computing
power, you scaled up, i.e. you bought a bigger machine. Today
the term High-Performance Computing has gained a fundamen-
tally new meaning. HPC is now perceived as a way to tackle
complex mathematical, scientific or engineering problems. The
integration of industry standard, “off-the-shelf” server hardware
into HPC clusters facilitates the construction of computer net-
works of such power that one single system could never achieve.
The new paradigm for parallelization is scaling out.
Computer-supported simulations of realistic processes (so-called
Computer Aided Engineering – CAE) has established itself as a third
key pillar in the field of science and research alongside theory
and experimentation. It is nowadays inconceivable that an air-
craft manufacturer or a Formula One racing team would operate
without using simulation software. And scientific calculations,
such as in the fields of astrophysics, medicine, pharmaceuticals
and bio-informatics, will to a large extent be dependent on super-
computers in the future. Software manufacturers long ago rec-
ognized the benefit of high-performance computers based on
powerful standard servers and ported their programs to them
accordingly.
The main advantages of scale-out supercomputers is just that:
they are infinitely scalable, at least in principle. Since they are
based on standard hardware components, such a supercomput-
er can be charged with more power whenever the computational
capacity of the system is not sufficient any more, simply by add-
ing additional nodes of the same kind. A cumbersome switch to a
different technology can be avoided in most cases.
High Performance Computing
Performance Turns Into Productivity
“transtec HPC solutions are meant to provide customers with unpar-
alleled ease-of-management and ease-of-use. Apart from that, decid-
ing for a transtec HPC solution means deciding for the most intensive
customer care and the best service imaginable.”
Dr. Oliver Tennert Director Technology Management & HPC Solutions
S
N
EW
SESW
NENW
tech BOX
13
Variations on the theme: MPP and SMP Parallel computations exist in two major variants today. Ap-
plications running in parallel on multiple compute nodes are
frequently so-called Massively Parallel Processing (MPP) appli-
cations. MPP indicates that the individual processes can each
utilize exclusive memory areas. This means that such jobs are
predestined to be computed in parallel, distributed across the
nodes in a cluster. The individual processes can thus utilize the
separate units of the respective node – especially the RAM, the
CPU power and the disk I/O.
Communication between the individual processes is implement-
ed in a standardized way through the MPI software interface
(Message Passing Interface), which abstracts the underlying
network connections between the nodes from the processes.
However, the MPI standard (current version 2.0) merely requires
source code compatibility, not binary compatibility, so an off-the-
shelf application usually needs specific versions of MPI libraries
in order to run. Examples of MPI implementations are OpenMPI,
MPICH2, MVAPICH2, Intel MPI or – for Windows clusters – MS-MPI.
If the individual processes engage in a large amount of commu-
nication, the response time of the network (latency) becomes im-
portant. Latency in a Gigabit Ethernet or a 10GE network is typi-
cally around 10 µs. High-speed interconnects such as InfiniBand,
reduce latency by a factor of 10 down to as low as 1 µs. Therefore,
high-speed interconnects can greatly speed up total processing.
The other frequently used variant is called SMP applications.
SMP, in this HPC context, stands for Shared Memory Processing.
It involves the use of shared memory areas, the specific imple-
mentation of which is dependent on the choice of the underlying
operating system. Consequently, SMP jobs generally only run on
a single node, where they can in turn be multi-threaded and thus
be parallelized across the number of CPUs per node. For many
HPC applications, both the MPP and SMP variant can be chosen.
Many applications are not inherently suitable for parallel execu-
tion. In such a case, there is no communication between the in-
dividual compute nodes, and therefore no need for a high-speed
network between them; nevertheless, multiple computing jobs
can be run simultaneously and sequentially on each individual
node, depending on the number of CPUs.
In order to ensure optimum computing performance for these
applications, it must be examined how many CPUs and cores de-
liver the optimum performance.
We find applications of this sequential type of work typically in
the fields of data analysis or Monte-Carlo simulations.
BOX
14
High Performance Computing
Flexible Deployment With xCAT
xCAT as a Powerful and Flexible Deployment Tool
xCAT (Extreme Cluster Administration Tool) is an open source
toolkit for the deployment and low-level administration of HPC
cluster environments, small as well as large ones.
xCAT provides simple commands for hardware control, node
discovery, the collection of MAC addresses, and the node deploy-
ment with (diskful) or without local (diskless) installation. The
cluster configuration is stored in a relational database. Node
groups for different operating system images can be defined.
Also, user-specific scripts can be executed automatically at in-
stallation time.
xCAT Provides the Following Low-Level Administrative
Features
� Remote console support
� Parallel remote shell and remote copy commands
� Plugins for various monitoring tools like Ganglia or Nagios
� Hardware control commands for node discovery, collecting
MAC addresses, remote power switching and resetting of
nodes
� Automatic configuration of syslog, remote shell, DNS, DHCP,
and ntp within the cluster
� Extensive documentation and man pages
For cluster monitoring, we install and configure the open source
tool Ganglia or the even more powerful open source solution Na-
gios, according to the customer’s preferences and requirements.
15
Local Installation or Diskless Installation
We offer a diskful or a diskless installation of the cluster nodes.
A diskless installation means the operating system is hosted par-
tially within the main memory, larger parts may or may not be
included via NFS or other means. This approach allows for de-
ploying large amounts of nodes very efficiently, and the cluster
is up and running within a very small timescale. Also, updating
the cluster can be done in a very efficient way. For this, only the
boot image has to be updated, and the nodes have to be reboot-
ed. After this, the nodes run either a new kernel or even a new
operating system. Moreover, with this approach, partitioning the
cluster can also be very efficiently done, either for testing pur-
poses, or for allocating different cluster partitions for different
users or applications.
Development Tools, Middleware, and Applications
According to the application, optimization strategy, or underly-
ing architecture, different compilers lead to code results of very
different performance. Moreover, different, mainly commercial,
applications, require different MPI implementations. And even
when the code is self-developed, developers often prefer one MPI
implementation over another.
According to the customer’s wishes, we install various compilers,
MPI middleware, as well as job management systems like Para-
station, Grid Engine, Torque/Maui, or the very powerful Moab
HPC Suite for the high-level cluster management.
BOX
individual Presalesconsulting
application-,customer-,
site-specificsizing of
HPC solution
burn-in testsof systems
benchmarking of different systems
continualimprovement
software& OS
installation
applicationinstallation
onsitehardwareassembly
integrationinto
customer’senvironment
customertraining
maintenance,support &
managed services
individual Presalesconsulting
application-,customer-,
site-specificsizing of
HPC solution
burn-in testsof systems
benchmarking of different systems
continualimprovement
software& OS
installation
applicationinstallation
onsitehardwareassembly
integrationinto
customer’senvironment
customertraining
maintenance,support &
managed services
16
High Performance Computing
Services and Customer Care from A to Z
transtec HPC as a ServiceYou will get a range of applications like LS-Dyna, ANSYS,
Gromacs, NAMD etc. from all kinds of areas pre-installed,
integrated into an enterprise-ready cloud and workload
management system, and ready-to run. Do you miss your
application?
Ask us: [email protected]
transtec Platform as a ServiceYou will be provided with dynamically provided compute
nodes for running your individual code. The operating sys-
tem will be pre-installed according to your requirements.
Common Linux distributions like RedHat, CentOS, or SLES
are the standard. Do you need another distribution?
Ask us: [email protected]
transtec Hosting as a ServiceYou will be provided with hosting space inside a profession-
ally managed and secured datacenter where you can have
your machines hosted, managed, maintained, according to
your requirements. Thus, you can build up your own private
cloud. What range of hosting and maintenance services do
you need?
Tell us: [email protected]
HPC @ transtec: Services and Customer Care from A to Ztranstec AG has over 30 years of experience in scientific computing
and is one of the earliest manufacturers of HPC clusters. For nearly a
decade, transtec has delivered highly customized High Performance
clusters based on standard components to academic and industry
customers across Europe with all the high quality standards and the
customer-centric approach that transtec is well known for. Every
transtec HPC solution is more than just a rack full of hardware – it is
a comprehensive solution with everything the HPC user, owner, and
operator need. In the early stages of any customer’s HPC project,
transtec experts provide extensive and detailed consulting to the
customer – they benefit from expertise and experience. Consulting
is followed by benchmarking of different systems with either spe-
cifically crafted customer code or generally accepted benchmarking
routines; this aids customers in sizing and devising the optimal and
detailed HPC configuration. Each and every piece of HPC hardware
that leaves our factory undergoes a burn-in procedure of 24 hours or
Services and Customer Care from A to Z
17
more if necessary. We make sure that any hardware shipped meets
our and our customers’ quality requirements. transtec HPC solutions
are turnkey solutions. By default, a transtec HPC cluster has every-
thing installed and configured – from hardware and operating sys-
tem to important middleware components like cluster management
or developer tools and the customer’s production applications.
Onsite delivery means onsite integration into the customer’s pro-
duction environment, be it establishing network connectivity to the
corporate network, or setting up software and configuration parts.
transtec HPC clusters are ready-to-run systems – we deliver, you
turn the key, the system delivers high performance. Every HPC proj-
ect entails transfer to production: IT operation processes and poli-
cies apply to the new HPC system. Effectively, IT personnel is trained
hands-on, introduced to hardware components and software, with
all operational aspects of configuration management.
transtec services do not stop when the implementation projects
ends. Beyond transfer to production, transtec takes care. transtec
offers a variety of support and service options, tailored to the cus-
tomer’s needs. When you are in need of a new installation, a major
reconfiguration or an update of your solution – transtec is able to
support your staff and, if you lack the resources for maintaining the
cluster yourself, maintain the HPC solution for you.
From Professional Services to Managed Services for daily opera-
tions and required service levels, transtec will be your complete HPC
service and solution provider. transtec’s high standards of perfor-
mance, reliability and dependability assure your productivity and
complete satisfaction. transtec’s offerings of HPC Managed Services
offer customers the possibility of having the complete management
and administration of the HPC cluster managed by transtec service
specialists, in an ITIL compliant way. Moreover, transtec’s HPC on De-
mand services help provide access to HPC resources whenever they
need them, for example, because they do not have the possibility
of owning and running an HPC cluster themselves, due to lacking
infrastructure, know-how, or admin staff.
transtec HPC Cloud ServicesLast but not least transtec’s services portfolio evolves as custom-
ers‘ demands change. Starting this year, transtec is able to provide
HPC Cloud Services. transtec uses a dedicated datacenter to provide
computing power to customers who are in need of more capacity
than they own, which is why this workflow model is sometimes
called computing-on-demand. With these dynamically provided
resources, customers with the possibility to have their jobs run on
HPC nodes in a dedicated datacenter, professionally managed and
secured, and individually customizable. Numerous standard ap-
plications like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes
like Gromacs, NAMD, VMD, and others are pre-installed, integrated
into an enterprise-ready cloud and workload management environ-
ment, and ready to run.
Alternatively, whenever customers are in need of space for host-
ing their own HPC equipment because they do not have the space
capacity or cooling and power infrastructure themselves, transtec
is also able to provide Hosting Services to those customers who’d
like to have their equipment professionally hosted, maintained, and
managed. Customers can thus build up their own private cloud!
Are you interested in any of transtec’s broad range of HPC related
services? Write us an email to [email protected]. We’ll be happy to
hear from you!
Initially, in the 80s and 90s, the notion of supercomputing was almost a synonym for
vector-computing. And this concept is back now in different flavors. Now almost every
architecture is using SIMD, the acronym for “single instruction multiple data”.
Vector Supercomputing – The NEC SX Architecture
Life
Sci
en
ces
Ris
k A
na
lysi
s S
imu
lati
on
B
ig D
ata
An
aly
tics
H
igh
Pe
rfo
rma
nce
Co
mp
uti
ng
20
History of HPC, and what we learnDuring the last years systems based on x86 architecture domi-
nated the HPC market, with GPUs and many-core systems re-
cently making their inroads. But this period is coming to an end,
and it is again time for a differentiated complementary HPC-
targeted system. NEC will develop such a system.
The Early Days
When did HPC really start? Probably one would think about the
time when Seymour Cray developed high-end machines at CDC
and then started his own company Cray Research, with the first
product being the Cray-1 in 1976. At that time supercomputing was
very special, systems were expensive, and access was extremely
limited for researchers and students. But the business case
was obvious in many fields, scientifically as well as commercially.
These Cray machines were vector-computers, initially featuring
only one CPU, as parallelization was far from being mainstream,
even hardly a matter of research. And code development was
easier than today, a code of 20000 lines was already considered
“complex” at that time, and there were not so many standard
codes and algorithms available anyway. For this reason code-de-
velopment could adapt to any kind of architecture to achieve op-
timal performance.
In 1981 NEC started marketing its first vector supercomputer, the
SX-2, a single-CPU-vector-architecture like the Cray-1. The SX-2
was a milestone, it was the first CPU to exceed one GFlops of peak
performance.
The Attack of the “Killer-Micros”
In the 1990s the situation changed drastically, and what we see
today is a direct consequence. The “killer-micros” took over, new
parallel, later “massively parallel” systems (MPP) were built based
on micro-processor architectures which were predominantly
Vector Supercomputing
The NEC SX Architecture
21
targeted to commercial sectors, or in the case of Intel, AMD or
Motorola to PCs, to products which were successfully invading
the daily life, partially more than even experts had anticipated.
Clearly one could save a lot of cost by using, in the worst case
slightly adapting these CPUs for high-end-systems, and by fo-
cusing investments on other ingredients, like interconnect tech-
nology. At the same time increasingly Linux was adopted and of-
fered a cheap generally available OS platform to allow everybody
to use HPC for a reasonable price.
These systems were cheaper because of two reasons: the develop-
ment of standard CPUs was sponsored by different businesses,
and another expensive part, the memory architecture with many
CPUs addressing the same flat memory space, was replaced by sys-
tems comprised of many distinct so-called nodes, complete sys-
tems by themselves, plus some kind of interconnect technology,
which could also be taken from standard equipment partially.
The impact on the user was the necessity to code for these “dis-
tributed memory systems”. Standards for coding evolved, PVM
and later MPI, plus some proprietary paradigms. MPI is dominat-
ing still today.
In the high-end the special systems were still dominating, be-
cause they had architectural advantages and the codes were run-
ning with very good efficiency on such systems. Most codes were
not parallel at all or at least not well parallelized, let alone on
systems with distributed memory. The skills and experiences to
generate parallel versions of codes were lacking in those compa-
nies selling high-end-machines and also on the customers’ side.
So it took some time for the MPPs to take over. But it was inevi-
table, because this situation, sometimes called “democratization
of HPC”, was positive for both the commercial and the scientific
developments. From that time on researchers and students in
academia and industry got sufficient access to resources they
could not have dreamt of before.
The x86 Dominance
One could dispute whether the x86 architecture is the most
advanced or elegant, undoubtedly it became the dominant
one. It was developed for mass-markets, this was the strate-
gy, and HPC was a side-effect, sponsored by the mass-market.
This worked successfully and the proportion of x86 Linux clus-
ters in the Top500 list increased year by year. Because of the
mass-markets it made sense economically to invest a lot into
the development of LSI technology (large-scale integration).
The related progress, which led to more and more transistors
on the chips and increasing frequencies, helped HPC applica-
tions, which got more, faster and cheaper systems with each
generation. Software did not have to adapt a lot between gen-
erations and the software developments rather focused initial-
ly on the parallelization for distributed memory, later on new
functionalities and features. This worked for quite some years.
Systems of sufficient size to produce relevant results for typical
applications are cheap, and most codes do not really scale to a
level of parallelism to use thousands of processes. Some light-
house projects do, after a lot of development which is not easily
affordable, but the average researcher or student is often com-
pletely satisfied with something between a few and perhaps a
few hundred nodes of a standard Linux cluster.
In the meantime already a typical Intel-based dual-socket-node
of a standard cluster has 24 cores, soon a bit more. So paralleliza-
tion is increasingly important. There is “Moore’s law”, invented by
Gordon E. Moore, a co-founder of the Intel Corporation, in 1965.
Originally it stated that the number of transistors in an integrated
circuit doubles about every 18 months, leading to an exponential
increase. This statement was often misinterpreted that the per-
formance of a CPU would double every 18 months, which is mis-
leading. We observe ever increasing core-counts, the transistors
are just there, so why not? That applications might not be able to
22
use these many cores is another question, in principle the poten-
tial performance is there.
The “standard benchmark” in the HPC-scene is the so-called HPL,
High Performance LINPACK, and the famous Top500 list is built
based on this. This HPL benchmark is only one of three parts of
the original LINPACK benchmark, this is often forgotten! But the
other parts did not show such a good progress in performance,
their scalability is limited, so they could not serve to spawn com-
petition even between nations, which leads to a lot of prestige.
Moreover there are at most a few real codes in the world with
similar characteristics as the HPL. In other words: it does not tell
a lot. This is a highly political situation and probably quite some
money is wasted which would better be invested in brain-ware.
Anyway it is increasingly doubtful that LSI technology will ad-
vance at the same pace in the future. The grid-constant of Silicon
is 5.43 Å. LSI technology is now using 14 nm processes, the Intel
Broadwell generation, which is about 25 times the grid-constant,
and with 7nm LSIs at the long-term horizon. Some scientists as-
sume there is still some room to improve even more, but once
the features on the LSI have a thickness of only a few atoms one
cannot get another factor of ten. Sometimes the impact of some
future “disruptive technology” is claimed …
Another obstacle is the cost to build a chip fab for a new genera-
tion of LSI technology. Certainly there are ways to optimize both
production and cost, and there will be bright ideas, but with cur-
rent generations a fab costs at least many billion $. There have
been ideas of slowing down the innovation cycle on that side to
merely be able to recover the investment.
Memory Bandwidth
For many HPC applications it is not the compute power of the
Vector Supercomputing
The NEC SX Architecture
23
CPU which counts, it is the speed with which the CPU can
exchange data with the memory, the so-called memory band-
width. If it is neglected the CPUs will wait for data, the perfor-
mance will decrease. To describe the balance between CPU
performance and memory bandwidth one often uses the ratio
between operations and the amount of bytes that can be trans-
ferred while they are running, the “byte-per-flop-ratio”. Tradi-
tionally vector machines were always great in this regard, and
this is to continue as far as current plans are known. And this is
the reason why a vector machine can realize a higher fraction
of its theoretical performance on a real application than a sca-
lar machine.
Power Consumption
One other but really overwhelming problem is the increasing
power consumption of systems. Looking at different genera-
tions of Linux clusters one can assume that a dual-socket-node,
depending on how it is equipped with additional features, will
consume between 350 Watt and 400 Watt.
This only slightly varies between different generations, perhaps it
will slightly increase. Even in Europe sites are talking about elec-
tricity budgets in excess of 1 MWatt, even beyond 10 MWatts. One
also has to cool the system, which adds something like 10-30%
to the electricity cost! The cost to run such a system over 5 years
will be in the same order of magnitude as the purchase price!
The power consumption will grow with the frequency of the CPU,
is the dominant reason why frequencies are stagnating, if not de-
creasing between product generations. There are more cores, but
individual cores of an architecture are getting slower, not faster!
This has to be compensated on the application side by increased
scalability, which is not always easy to achieve. In the past us-
ers easily got a higher performance for almost every application
with each new generation, this is no longer the case.
Architecture Development
So the basic facts clearly indicate limitations of the LSI technology
and they are known since quite some time. Consequently this
leads to attempts to get additional performance by other means.
And the developments in the market clearly indicate that high-
end users are willing to consider the adaptation of codes.
To utilize available transistors CPUs with more complex func-
tionality are designed, utilizing SIMD instructions or providing
more operations per cycle than just an add and a multiply. SIMD
is “Single Instruction Multiple Data”. Initially at Intel the SIMD-in-
structions provided two results in 64-bit-precision per cycle rath-
er than one, in the meantime with AVX it is even four, on Xeon Phi
already eight.
This makes a lot of sense. The electricity consumed by the “ad-
ministration” of the operations, code fetch and decode, configu-
ration of paths on the CPU etc. is a significant portion of the pow-
er consumption. If the energy is used not only for one operation
but for two or even four, the energy per operation is obviously
reduced.
In essence and in order to overcome the performance-bottleneck
the architectures developed into the direction which to a larger
extent was already realized in the vector-supercomputers dec-
ades ago!
There is almost no architecture left without such features. But
SIMD-operations require application code to be vectorized. Be-
fore code developments and programming paradigms did not
care for fine-grain data-parallelism, which is the basis of appli-
cability of SIMD. Thus vector-machines could not show their
strength, on scalar code they cannot beat a scalar CPU. But strictly
speaking, there is no scalar CPU left, standard x86-systems do not
achieve their optimal performance on purely scalar code as well!
24
Alternative Architectures?
During recent years GPUs and many-core architectures are in-
creasingly used in the HPC market. Although these systems are
somewhat diverse in detail, in any case the degree of parallelism
needed is much higher than in standard CPUs. This points into
the same direction, a low-level parallelism is used to address the
bottlenecks. In the case of Nvidia GPUs the individual so-called
CUDA cores, relatively simple entities, are steered by a more
complex central processor, realizing a high level of SIMD paral-
lelism. In the case of Xeon Phi the individual cores are also less
powerful, but they are used in parallel using coding paradigms
which operate on the level of loop nests, an approach which was
already used on vector supercomputers during the 1980s!
So these alternative architectures not only go into the same
direction, the utilization of fine-grain parallelism, they also in-
dicate that coding styles which suite real vector machines will
dominate the future software developments. Note that at last
some of these systems have an economic backing by mass mar-
kets, gaming!
Vector Supercomputing
The NEC SX Architecture
S
N
EW
SESW
NENW
tech BOX
25
History of the SX SeriesAs already stated NEC started marketing the first commercially
available generation of this series, the SX-2, in 1981. By now the
latest generation is the SX-ACE.
In 1998 Prof. Tadashi Watanabe, the developer of the SX series
and the major driver of NEC’s HPC-business, received the Eck-
hard-Mauchly-medal, a renowned US-award for these achieve-
ments, and in 2006 also the “Seymour Cray Award”. Prof. Watanabe
was a very pragmatic developer, a knowledgeable scientist and a
visionary manager, all at the same time, and a great person to
talk to. From the beginning the SX-architecture was different
from the competing Cray-vectors, it already integrated some of
the features that RISC CPUs at that time were using, but com-
bined them with vector registers and pipelines.
For example the SX could reorder instructions in hardware on
the fly, which Cray machines could not do. In effect this made
the system much easier to use. The single processor SX-2 was the
fastest CPU at that time, the first one to break the 1 GFlops bar-
rier. Naturally the compiler technology for automatic vectoriza-
tion and optimization was important.
With the SX-3 NEC developed a high-end shared-memory system.
It had up to four very strong CPUs. The resolution of memory ac-
cess conflicts by CPUs working in parallel, and the consistency of
data throughout the whole shared-memory system are non-triv-
ial, and the development of such a memory architecture was a
major breakthrough.
With the SX-4 NEC implemented a high-end system based on CMOS.
It was the direct competitor to the Cray T90, with a huge advantage
with regard to price and power consumption. The T90 was using
ECL technology and achieved 1.8 GFlops peak performance. The
SX-4 had a lower frequency because of CMOS, but because of the
higher density of the LSIs it featured a higher level of parallelism
on the CPU, leading to 2 GFlops per CPU with up to 32 CPUs.
The memory architecture to support those was a major leap
ahead. On this system one could typically reach an efficiency
of 20-40% of peak performance measured throughout complete
applications, far from what is achieved on standard components
today. With this system NEC also implemented the first genera-
tion of its own proprietary interconnect, the so-called IXS, which
26
allowed connecting several of these strong nodes into one highly
powerful system. The IXS featured advanced functions like a
global barrier and some hardware fault-tolerance, things which
other companies tackled years later.
With the SX-5 this direction was continued, it was a very suc-
cessful system pushing the limits of CMOS-based CPUs and mul-
ti-node configurations. At that time slowly the efforts on the us-
ers’ side to parallelize their codes using “message passing”, PVM
and MPI at that time, started to surface, so the need for a huge
shared memory system slowly disappeared. Still there were quite
some important codes in the scene which were only parallel us-
ing a shared-memory paradigm, initially “microtasking”, which
was vendor-specific, “autotasking”, NEC and some other vendors
could automatically parallelize code, and later the standard
OpenMP. So there was still sufficient need for a huge a highly ca-
pable shared-memory-system.
But later it became clear that more and more codes could utilize
distributed memory systems, and moreover the memory-archi-
tecture of huge shared memory systems became increasingly
complex and that way costly. A memory is made of “banks”, in-
dividual pieces of memory which are used in parallel. In order
to provide the data at the necessary speed for a CPU a certain
minimum amount of banks per CPU is required, because a bank
needs some cycles to recover after every access. Consequently
the number of banks had to scale with the number of CPUs, so
that the complexity of the network connecting CPUs and banks
was growing with the square of the number of CPUs. And if the
CPUs became stronger they needed more banks, which made it
even worse. So this concept became too expensive.
Vector Supercomputing
History of the SX Series
27
At the same time NEC had a lot of experience in mixing MPI- and
OpenMP parallelization in applications, and it turned out that for
the very fast CPUs the shared-memory parallelization was typi-
cally not really efficient for more than four or eight threads. So
the concept was adapted accordingly, which lead to the SX-6 and
the famous Earth-Simulator, which kept the #1-spot of the Top500
for some years.
The Earth Simulator was a different machine, but based
on the same technology level and architecture as the SX-6.
These systems were the first implementation ever of a vector-CPU
on one single chip. The CPU by itself was not the challenge, but
the interface to the memory to support the bandwidth needed
for this chip, basically the “pins”. This kind of packaging tech-
no-logy was always a strong point in NEC’s technology portfolio.
The SX-7 was a somewhat special machine, it was based on the SX-6
technology level, but featured 32 CPUs on one shared memory.
The SX-8 then was a new implementation again with some ad-
vanced features which directly addressed the necessities of cer-
tain code-structures frequently encountered in applications. The
communication between the world-wide organization of appli-
cation-analysts and the hard- and software developers in Japan
had proven very fruitful, the SX-8 was very successful in the market.
With the SX-9 the number of CPUs per node was pushed up again
to 16 because of certain demands in the market. Just as an in-
dication for the complexity of the memory architecture: the
system had 16384 banks! The peak-performance of one node was
16 TFlops.
Keep in mind that the efficiency achieved on real applications
was still in the range of 10% in bad cases, up to 20% or even more,
depending on the code. One should not compare this peak perfor-
mance with some standard chip today, this would be misleading.
The Current Product: SX-ACEIn November 2014 NEC announced the next generation of the
SX-vector-product, the SX-ACE. Some systems are already in-
stalled and accepted, also in Europe.
The main design target was the reduction of power-consump-
tion per sustained performance, and measurements have prov-
en this has been achieved. In order to do so and in line with the
observation that the shared-memory parallelization with vector
CPUs will be useful on four cores, perhaps eight, but normally
not beyond that, the individual node consists of a single-chip-
four core-processor, which inherits memory-controllers and the
interface to the communication interconnect, a next generation
of the proprietary IXS.
That way a node is a small board, leading to a very compact de-
sign. The complicated network between the memory and the
CPUs, which took about 70% of the power-consumption of an SX-
9, was eliminated. And naturally a new generation of LSI-and PCB
technology also leads to reductions.
With the memory bandwidth of 256 GB/s this machine realizes
one “byte per flop”, a measure of the balance between the pure
compute power and the memory bandwidth of a chip. In compar-
ison nowadays typical standard processors have in the order of
one eights of this!
28
Memory bandwidth is increasingly a problem, this is known in
the market as the “memory wall”, and is a hot topic since more
than a decade. Standard components are built for a different
market, where memory bandwidth is not as important as it is for
HPC applications. Two of these node boards are put into a mod-
ule, eight such modules are combined in one cage, and four of
the cages, i.e. 64 nodes, are contained in one rack. The standard
configuration of 512 nodes consists of eight racks.
Vector Supercomputing
History of the SX Series
29
NEC already installed several SX-ACE systems in Europe, and
more installations are ongoing. The advantages of the system,
the superior memory bandwidth, the strong CPU with good effi-
ciency, will provide the scientists with performance levels on cer-
tain codes which otherwise could not be achieved. Moreover the
system has proven to be superior in terms of power consumption
per sustained application performance, and this is a real hot top-
ic nowadays.
S
N
EW
SESW
NENW
tech BOX
do i = 1, n
a(i) = b(i) + c(i)
end do
30
Vector Supercomputing
History of the SX Series
LSI technologies have improved drastically during the last
two decades, but the ever increasing power consumption of
systems now poses serious limitations. Tasks like fetching in-
structions, decoding them or configuring paths on the chip
consume energy, so it makes sense to execute these activities
not for one operation only, but for a set of such operations.
This is meant by SIMD.
Data ParallelismThe underlying structure to enable the usage of SIMD-instruc-
tions is called “data parallelism”. For example the following loop
is data-parallel, the individual loop iterations could be executed
in any arbitrary order, and the results would stay the same:
The complete independence of the loop iterations allows executing
them in parallel, and this is mapped to hardware with SIMD support.
Naturally one could also use other means to parallelize such a
loop, e.g. for a very long loop one could execute chunks of loop
iterations on different processors of a shared-memory machine
using OpenMP. But this will cause overhead, so it will not be effi-
cient for small trip counts, while a SIMD feature in hardware can
deal with such a fine granularity.
As opposed to the example above the following loop, a recursion,
is not data parallel:
do i = 1, n-1
a(i+1) = a(i) + c(i)
end do
31
The time or number of cycles extends to the right hand side,
assuming five stages (vertical direction), the red squares indi-
cate which stage is busy at the given cycle. Obviously this is not
efficient, because white squares indicate transistors which con-
sume power without actively producing results. And the com-
pute-performance is not optimal, it takes five cycles to compute
one result.
Similarly to the assembly line in the automotive industry this is
what we want to achieve:
After the “latency”, five cycles, just the length of the pipeline, we
get a result every cycle. Obviously it is easy to feed the pipeline
with sufficient independent input data in the case of different
loop iterations of a data-parallel loop.
If this is not possible one could also use different operations
within the execution of a complicated loop body within one
Obviously a change of the order of the loop iterations, for ex-
ample the special case of starting with a loop-index of n-1 and
decreasing it afterwards, will lead to different results. Therefore
these iterations cannot be executed in parallel.
PipeliningThe following description is very much simplified, but explains
the basic principles: Operations like add and multiply consist of
so-called segments, which are executed one after the other in a
so-called pipeline. While each segment acts on the data for some
kind of intermediate step, only after the execution of the whole
pipeline the final result is available.
It is similar to an assembly-line in the automotive industry. Now,
the segments do not need to deal with the same operation at a
given point in time, they could be kept active if sufficient inputs
are fed into the pipeline. Again, in the automotive industry the
different stages of the assembly line are working on different
cars, and nobody would wait for a car to be completed before
starting the next one in the first stage of the production, that
way keeping almost all stages inactive. Such a situation would
be resembled by the following picture:
32
given loop iteration. But typically operations in such a loop body
need results from previous operations of the same iteration, so
one would rather arrive at a utilization of the pipeline like this:
Some operations can be calculated almost in parallel, others
have to wait for previous calculations to finish because they
need input. In total the utilization of the pipeline is better but
not optimal.
This was the situation before the trend to SIMD computing was
revitalized. Before that the advances of LSI technology always
provided an increased performance with every new generation
of product. But then the race for increasing frequency stopped,
and now architectural enhancements are important again.
Single Instruction Multiple Data (SIMD)Hardware realizations of SIMD have been present since the first
Cray-1, but this was a vector architecture. Nowadays the instruc-
tion sets SSE and AVX in their different generations and flavors
use it, and similarly for example CUDA coding is also a way to uti-
lize data parallelism.
The “width” of the SIMD-instruction, the amount of loop itera-
tions to execute in parallel by one instruction, is kept relatively
small in the case of SSE (two for double precision) and AVX (four
for double precision). The implementations of SIMD in stand-
ard “scalar” processors typically involve add-and-multiply units
which can produce several results per clock-cycle. They also fea-
ture registers of the same “width” to keep several input parame-
ters or several results.
Vector Supercomputing
History of the SX Series
33
iterations are independent, means if the loop is data parallel. It
would not work if one computation would need a result of anoth-
er computation as input while this is just being processed.
In fact the SX-vector-machines always were combining both
ideas, the vector-register and parallel functional units, so-called
“pipe-sets”, four in this picture:
There is another advantage of the vector architecture in the SX,
and this proves to be very important. The scalar portion of the
CPU is acting independently, and while the pipelines are kept
busy with input from the vector registers, the inevitable scalar
instructions which come with every loop can be executed. This
involves checking of loop counters, updating of address pointers
and the like. That way these instructions essentially do not con-
sume time. The amount of cycles available to “hide” those is the
length of the vector-registers divided by the number of pipe sets.
In fact the white space in the picture above can partially be hid-
den as well, because new vector instructions can be issued in ad-
vance to immediately take over once the vector register with the
input data for the previous instruction is completely processed.
The instruction-issue-unit has the complete status information
of all available resources, specifically pipes and vector registers.
In a way the pipelines are “doubled” for SSE, but both are activat-
ed by the same single instruction:
Vector ComputingThe difference between those SIMD architectures and SIMD with
a real vector architecture like the NEC SX-ACE, the only remaining
vector system, lies in the existence of vector registers.
In order to provide sufficient input to the functional units the
vector registers have proven to be very efficient. In a way they are
a very natural match to the idea of a pipeline – or assembly line.
Again thinking about the automotive industry, there one would
have a stock of components to constantly feed the assembly line.
The following viewgraph resembles this situation:
When an instruction is issued it takes some cycles for the first
result to appear, but after that one would get a new result every
cycle, until the vector register, which has a certain length, is
completely processed. Naturally this can only work if the loop
High Performance Computing (HPC) clusters serve a crucial role at research-intensive
organizations in industries such as aerospace, meteorology, pharmaceuticals, and oil
and gas exploration.
The thing they have in common is a requirement for large-scale computations. Bright’s
solution for HPC puts the power of supercomputing within the reach of just about any
organization.
Advanced Cluster Management Made Easy
Engi
nee
rin
g L
ife
Scie
nce
s A
uto
mo
tive
P
rice
Mo
del
lin
g A
ero
spac
e C
AE
Dat
a A
nal
ytic
s
36
The Bright AdvantageBright Cluster Manager for HPC makes it easy to build and oper-
ate HPC clusters using the server and networking equipment of
your choice, tying them together into a comprehensive, easy to
manage solution.
Bright Cluster Manager for HPC lets customers deploy complete
clusters over bare metal and manage them effectively. It pro-
vides single-pane-of-glass management for the hardware, the
operating system, HPC software, and users. With Bright Cluster
Manager for HPC, system administrators can get clusters up and
running quickly and keep them running reliably throughout their
life cycle – all with the ease and elegance of a fully featured, en-
terprise-grade cluster manager.
Bright Cluster Manager was designed by a group of highly expe-
rienced cluster specialists who perceived the need for a funda-
mental approach to the technical challenges posed by cluster
management. The result is a scalable and flexible architecture,
forming the basis for a cluster management solution that is ex-
tremely easy to install and use, yet is suitable for the largest and
most complex clusters.
Scale clusters to thousands of nodesBright Cluster Manager was designed to scale to thousands of
nodes. It is not dependent on third-party (open source) software
that was not designed for this level of scalability. Many advanced
features for handling scalability and complexity are included:
Management Daemon with Low CPU Overhead
The cluster management daemon (CMDaemon) runs on every
node in the cluster, yet has a very low CPU load – it will not cause
any noticeable slow-down of the applications running on your
cluster.
Because the CMDaemon provides all cluster management func-
tionality, the only additional daemon required is the workload
manager (or queuing system) daemon. For example, no daemons
Advanced Cluster Management Made Easy
Easy-to-use, complete and scalable
The cluster installer takes you through the installation process and offers advanced options such as “Express” and “Remote”.
37
for monitoring tools such as Ganglia or Nagios are required. This
minimizes the overall load of cluster management software on
your cluster.
Multiple Load-Balancing Provisioning Nodes
Node provisioning – the distribution of software from the head
node to the regular nodes – can be a significant bottleneck in
large clusters. With Bright Cluster Manager, provisioning capabil-
ity can be off-loaded to regular nodes, ensuring scalability to a
virtually unlimited number of nodes.
When a node requests its software image from the head node,
the head node checks which of the nodes with provisioning ca-
pability has the lowest load and instructs the node to download
or update its image from that provisioning node.
Synchronized Cluster Management Daemons
The CMDaemons are synchronized in time to execute tasks in
exact unison. This minimizes the effect of operating system jit-
ter (OS jitter) on parallel applications. OS jitter is the “noise” or
“jitter” caused by daemon processes and asynchronous events
such as interrupts. OS jitter cannot be totally prevented, but for
parallel applications it is important that any OS jitter occurs at
the same moment in time on every compute node.
Built-In Redundancy
Bright Cluster Manager supports redundant head nodes and re-
dundant provisioning nodes that can take over from each other
in case of failure. Both automatic and manual failover configura-
tions are supported.
Other redundancy features include support for hardware and
software RAID on all types of nodes in the cluster.
Diskless Nodes
Bright Cluster Manager supports diskless nodes, which can be
useful to increase overall Mean Time Between Failure (MTBF),
particularly on large clusters. Nodes with only InfiniBand and no
Ethernet connection are possible too.
Bright Cluster Manager 7 for HPCFor organizations that need to easily deploy and manage HPC
clusters Bright Cluster Manager for HPC lets customers deploy
complete clusters over bare metal and manage them effectively.
It provides single-pane-of-glass management for the hardware,
the operating system, the HPC software, and users.
With Bright Cluster Manager for HPC, system administrators can
quickly get clusters up and running and keep them running relia-
bly throughout their life cycle – all with the ease and elegance of
a fully featured, enterprise-grade cluster manager.
Features
� Simple Deployment Process – Just answer a few questions
about your cluster and Bright takes care of the rest. It installs
all of the software you need and creates the necessary config-
uration files for you, so you can relax and get your new cluster
up and running right – first time, every time.
S
N
EW
SESW
NENW
tech BOX
38
Advanced Cluster Management Made Easy
Easy-to-use, complete and scalable Head Nodes & Regular NodesA cluster can have different types of nodes, but it will always
have at least one head node and one or more “regular” nodes.
A regular node is a node which is controlled by the head node
and which receives its software image from the head node or
from a dedicated provisioning node.
Regular nodes come in different flavors.
Some examples include:
� Failover Node – A node that can take over all functionalities
of the head node when the head node becomes unavailable.
� Compute Node – A node which is primarily used for compu-
tations.
� Login Node – A node which provides login access to users of
the cluster. It is often also used for compilation and job sub-
mission.
� I/O Node – A node which provides access to disk storage.
� Provisioning Node – A node from which other regular nodes
can download their software image.
� Workload Management Node – A node that runs the central
workload manager, also known as queuing system.
� Subnet Management Node – A node that runs an InfiniBand
subnet manager.
� More types of nodes can easily be defined and configured
with Bright Cluster Manager. In simple clusters there is only
one type of regular node – the compute node – and the head
node fulfills all cluster management roles that are described
in the regular node types mentioned above.
39
All types of nodes (head nodes and regular nodes) run the same
CMDaemon. However, depending on the role that has been as-
signed to the node, the CMDaemon fulfills different tasks. This
node has a ‘Login Role’, ‘Storage Role’ and ‘PBS Client Role’.
The architecture of Bright Cluster Manager is implemented by
the following key elements:
� The Cluster Management Daemon (CMDaemon)
� The Cluster Management Shell (CMSH)
� The Cluster Management GUI (CMGUI)
Connection Diagram
A Bright cluster is managed by the CMDaemon on the head node,
which communicates with the CMDaemons on the other nodes
over encrypted connections.
The CMDaemon on the head node exposes an API which can also
be accessed from outside the cluster. This is used by the cluster
management Shell and the cluster management GUI, but it can
also be used by other applications if they are programmed to ac-
cess the API. The API documentation is available on request.
The diagram below shows a cluster with one head node, three
regular nodes and the CMDaemon running on each node.
Elements of the Architecture
40
� Can Install on Bare Metal – With Bright Cluster Manager there
is nothing to pre-install. You can start with bare metal serv-
ers, and we will install and configure everything you need. Go
from pallet to production in less time than ever.
� Comprehensive Metrics and Monitoring – Bright Cluster Man-
ager really shines when it comes to showing you what’s going
on in your cluster. It’s beautiful graphical user interface lets
you monitor resources and services across the entire cluster
in real time.
� Powerful User Interfaces – With Bright Cluster Manager you
don’t just get one great user interface, we give you two. That
way, you can choose to provision, monitor, and manage your
HPC clusters with a traditional command line interface or our
beautiful, intuitive graphical user interface.
� Optimize the Use of IT Resources by HPC Applications –Bright
ensures HPC applications get the resources they require
according to the policies of your organization. Your cluster
will prioritize workloads according to your business priorities.
� HPC Tools and Libraries Included – Every copy of Bright Cluster
Manager comes with a complete set of HPC tools and libraries
so you will be ready to develop, debug, and deploy your HPC
code right away.
Advanced Cluster Management Made Easy
Easy-to-use, complete and scalable
“The building blocks for transtec HPC solutions must be chosen
according to our goals ease-of-management and ease-of-use. With
Bright Cluster Manager, we are happy to have the technology leader
at hand, meeting these requirements, and our customers value that.”
Jörg WalterHPC Solution Engineer
Monitor your entire cluster at an glance with Bright Cluster Manager
41
Quickly build and deploy an HPC cluster When you need to get an HPC cluster up and running, don’t waste
time cobbling together a solution from various open source
tools. Bright’s solution comes with everything you need to set up
a complete cluster from bare metal and manage it with a beauti-
ful, powerful user interface.
Whether your cluster is on-premise or in the cloud, Bright is a
virtual one-stop-shop for your HPC project.
Build a cluster in the cloud
Bright’s dynamic cloud provisioning capability lets you build an
entire cluster in the cloud or expand your physical cluster into
the cloud for extra capacity. Bright allocates the compute re-
sources in Amazon Web Services automatically, and on demand.
Build a hybrid HPC/Hadoop cluster
Does your organization need HPC clusters for technical comput-
ing and Hadoop clusters for Big Data? The Bright Cluster Man-
ager for Apache Hadoop add-on enables you to easily build and
manage both types of clusters from a single pane of glass. Your
system administrators can monitor all cluster operations, man-
age users, and repurpose servers with ease.
Driven by researchBright has been building enterprise-grade cluster management
software for more than a decade. Our solutions are deployed in
thousands of locations around the globe. Why reinvent the
wheel when you can rely on our experts to help you implement
the ideal cluster management solution for your organization.
Clusters in the cloud
Bright Cluster Manager can provision and manage clusters that
are running on virtual servers inside the AWS cloud as if they
were local machines. Use this feature to build an entire cluster
in AWS from scratch or extend a physical cluster into the cloud
when you need extra capacity.
Use native services for storage in AWS – Bright Cluster Manager
lets you take advantage of inexpensive, secure, durable, flexible
and simple storage services for data use, archiving, and backup
in the AWS cloud.
Make smart use of AWS for HPC – Bright Cluster Manager can
save you money by instantiating compute resources in AWS only
when they are needed. It uses built-in intelligence to create in-
stances only after the data is ready for processing and the back-
log in on-site workloads requires it.
You decide which metrics to monitor. Just drag a component into the display area an Bright Cluster Manager instantly creates a graph of the data you need.
BOX
42
Advanced Cluster Management Made Easy
Easy-to-use, complete and scalable High performance meets efficiencyInitially, massively parallel systems constitute a challenge to
both administrators and users. They are complex beasts. Anyone
building HPC clusters will need to tame the beast, master the
complexity and present users and administrators with an easy-
to-use, easy-to-manage system landscape.
Leading HPC solution providers such as transtec achieve this
goal. They hide the complexity of HPC under the hood and match
high performance with efficiency and ease-of-use for both users
and administrators. The “P” in “HPC” gains a double meaning:
“Performance” plus “Productivity”.
Cluster and workload management software like Moab HPC
Suite, Bright Cluster Manager or Intel Fabric Suite provide the
means to master and hide the inherent complexity of HPC sys-
tems. For administrators and users, HPC clusters are presented
as single, large machines, with many different tuning parame-
ters. The software also provides a unified view of existing clus-
ters whenever unified management is added as a requirement by
the customer at any point in time after the first installation. Thus,
daily routine tasks such as job management, user management,
queue partitioning and management, can be performed easily
with either graphical or web-based tools, without any advanced
scripting skills or technical expertise required from the adminis-
trator or user.
S
N
EW
SESW
NENW
tech BOX
43
Bright Cluster Manager unleashes and manages the unlimited power of the cloudCreate a complete cluster in Amazon EC2, or easily extend your
onsite cluster into the cloud. Bright Cluster Manager provides a
“single pane of glass” to both your on-premise and cloud resourc-
es, enabling you to dynamically add capacity and manage cloud
nodes as part of your onsite cluster.
Cloud utilization can be achieved in just a few mouse clicks, with-
out the need for expert knowledge of Linux or cloud computing.
With Bright Cluster Manager, every cluster is cloud-ready, at no
extra cost. The same powerful cluster provisioning, monitoring,
scheduling and management capabilities that Bright Cluster
Manager provides to onsite clusters extend into the cloud.
The Bright advantage for cloud utilization
� Ease of use: Intuitive GUI virtually eliminates user learning
curve; no need to understand Linux or EC2 to manage system.
Alternatively, cluster management shell provides powerful
scripting capabilities to automate tasks.
� Complete management solution: Installation/initialization,
provisioning, monitoring, scheduling and management in
one integrated environment.
� Integrated workload management: Wide selection of work-
load managers included and automatically configured with
local, cloud and mixed queues.
� Single pane of glass; complete visibility and control: Cloud
compute nodes managed as elements of the on-site cluster; vi-
sible from a single console with drill-downs and historic data.
� Efficient data management via data-aware scheduling: Auto-
matically ensures data is in position at start of computation;
delivers results back when complete.
� Secure, automatic gateway: Mirrored LDAP and DNS services
inside Amazon VPC, connecting local and cloud-based nodes
for secure communication over VPN.
� Cost savings: More efficient use of cloud resources; support
for spot instances, minimal user intervention.
Two cloud scenarios
Bright Cluster Manager supports two cloud utilization scenarios:
“Cluster-on-Demand” and “Cluster Extension”.
Scenario 1: Cluster-on-Demand
The Cluster-on-Demand scenario is ideal if you do not have a clus-
ter onsite, or need to set up a totally separate cluster. With just a
few mouse clicks you can instantly create a complete cluster in
the public cloud, for any duration of time.
44
Advanced Cluster Management Made Easy
Cloud Bursting With Bright
Scenario 2: Cluster Extension
The Cluster Extension scenario is ideal if you have a cluster onsite
but you need more compute power, including GPUs. With Bright
Cluster Manager, you can instantly add EC2-based resources to
your onsite cluster, for any duration of time. Extending into the
cloud is as easy as adding nodes to an onsite cluster – there are
only a few additional, one-time steps after providing the public
cloud account information to Bright Cluster Manager.
The Bright approach to managing and monitoring a cluster in
the cloud provides complete uniformity, as cloud nodes are
managed and monitored the same way as local nodes:
� Load-balanced provisioning
� Software image management
� Integrated workload management
� Interfaces – GUI, shell, user portal
� Monitoring and health checking
� Compilers, debuggers, MPI libraries, mathematical libraries
and environment modules
Bright Computing
45
Bright Cluster Manager also provides additional features that are
unique to the cloud.
Amazon spot instance support – Bright Cluster Manager enables
users to take advantage of the cost savings offered by Amazon’s
spot instances. Users can specify the use of spot instances, and
Bright will automatically schedule as available, reducing the cost
to compute without the need to monitor spot prices and sched-
ule manually.
Hardware Virtual Machine (HVM) – Bright Cluster Manager auto-
matically initializes all Amazon instance types, including Cluster
Compute and Cluster GPU instances that rely on HVM virtualization.
Amazon VPC – Bright supports Amazon VPC setups which allows
compute nodes in EC2 to be placed in an isolated network, there-
by separating them from the outside world. It is even possible to
route part of a local corporate IP network to a VPC subnet in EC2,
so that local nodes and nodes in EC2 can communicate easily.
Data-aware scheduling – Data-aware scheduling ensures that
input data is automatically transferred to the cloud and made
accessible just prior to the job starting, and that the output data
is automatically transferred back. There is no need to wait for the
data to load prior to submitting jobs (delaying entry into the job
queue), nor any risk of starting the job before the data transfer is
complete (crashing the job). Users submit their jobs, and Bright’s
data-aware scheduling does the rest.
Bright Cluster Manager provides the choice:
“To Cloud, or Not to Cloud”
Not all workloads are suitable for the cloud. The ideal situation
for most organizations is to have the ability to choose between
onsite clusters for jobs that require low latency communication,
complex I/O or sensitive data; and cloud clusters for many other
types of jobs.
Bright Cluster Manager delivers the best of both worlds: a pow-
erful management solution for local and cloud clusters, with
the ability to easily extend local clusters into the cloud without
compromising provisioning, monitoring or managing the cloud
resources.
One or more cloud nodes can be configured under the Cloud Settings tab.
While all HPC systems face challenges in workload demand, workflow constraints,
resource complexity, and system scale, enterprise HPC systems face more stringent
challenges and expectations.
Enterprise HPC systems must meet mission-critical and priority HPC workload
demands for commercial businesses, business-oriented research, and academic
organizations
Intelligent HPCWorkload Management
47H
igh
Tro
uh
pu
t C
om
pu
tin
g
CA
D B
ig D
ata
An
aly
tics
S
imu
lati
on
Ae
rosp
ace
A
uto
mo
tive
48
Solutions to Enterprise High-Performance Computing ChallengesThe Enterprise has requirements for dynamic scheduling, provi-
sioning and management of multi-step/multi-application servic-
es across HPC, cloud, and big data environments. Their workloads
directly impact revenue, product delivery, and organizational ob-
jectives.
Enterprise HPC systems must process intensive simulation and
data analysis more rapidly, accurately and cost-effectively to
accelerate insights. By improving workflow and eliminating job
delays and failures, the business can more efficiently leverage
data to make data-driven decisions and achieve a competitive
advantage. The Enterprise is also seeking to improve resource
utilization and manage efficiency across multiple heterogene-
ous systems. User productivity must increase to speed the time
to discovery, making it easier to access and use HPC resources
and expand to other resources as workloads demand.
Enterprise-Ready HPC Workload ManagementMoab HPC Suite – Enterprise Edition accelerates insights by uni-
fying data center resources, optimizing the analysis process,
and guaranteeing services to the business. Moab 8.1 for HPC
systems and HPC cloud continually meets enterprise priorities
through increased productivity, automated workload uptime,
and consistent SLAs. It uses the battle-tested and patented Moab
intelligence engine to automatically balance the complex, mis-
sion-critical workload priorities of enterprise HPC systems. En-
terprise customers benefit from a single integrated product that
brings together key Enterprise HPC capabilities, implementation
services, and 24/7 support.
It is imperative to automate workload workflows to speed time
to discovery. The following new use cases play a key role in im-
proving overall system performance:
Intelligent HPC Workload Management
Moab HPC Suite – Enterprise Edition
Moab is the “brain” of an HPC system, intelligently optimizing workload throughput while balancing service levels and priorities.
S
N
EW
SESW
NENW
tech BOX
49
Intelligent HPC Workload Management Across Infrastructure and Organizational ComplexityMoab HPC Suite – Enterprise Edition acts as the “brain” of a high
performance computing (HPC) system to accelerate and auto-
mate complex decision making processes. The patented Moab in-
telligence engine provides multi-dimensional policies that mimic
real-world decision making.
These multi-dimensional, real-world policies are essential in
scheduling workload to optimize job speed, job success and re-
source utilization. Moab HPC Suite – Enterprise Edition integrates
decision-making data from and automates actions through your
system’s existing mix of resource managers.
This enables all the dimensions of real-time granular resource
attributes, resource state, as well as current and future resource
commitments to be modeled against workload/application re-
quirements. All of these dimensions are then factored into the
most efficient scheduling decisions. Moab also dramatically sim-
plifies the management tasks and processes across these com-
plex, heterogeneous environments. Moab works with many of
the major resource management and industry standard resource
monitoring tools and covers mixed hardware, network, storage
and licenses.
Moab HPC Suite – Enterprise Edition multi-dimensional policies
are also able to factor in organizational priorities and complex-
ities when scheduling workload. Moab ensures workload is
processed according to organizational priorities and commit-
ments and that resources are shared fairly across users, groups
and even multiple organizations. This enables organizations to
automatically enforce service guarantees and effectively manage
organizational complexities with simple policy-based settings.
Moab HPC Suite – Enterprise Edition is architected to integrate
on top of your existing resource managers and other types of
management tools in your environment to provide policy-based
scheduling and management of workloads. It makes the optimal
complex decisions based on all of the data it integrates from the
various resource managers and then orchestrates the job and
management actions through those resource managers.
Moab’s flexible management integration makes it the ideal
choice to integrate with existing or new systems as well as to
manage your high performance computing system as it grows
and expands in the future.
50
� Elastic Computing – unifies data center resources by assisting
the business management resource expansion through burst-
ing to private/public clouds and other data center resources
utilizing OpenStack as a common platform
� Performance and Scale – optimizes the analysis process by
dramatically increasing TORQUE and Moab throughput and
scalability across the board
� Tighter cooperation between Moab and Torque – harmoniz-
ing these key structures to reduce overhead and improve
communication between the HPC scheduler and the re-
source manager(s)
� More Parallelization – increasing the decoupling of Moab
and TORQUE’s network communication so Moab is less
dependent on TORQUE’S responsiveness, resulting in a 2x
speed improvement and shortening the duration of the av-
erage scheduling iteration in half
� Accounting Improvements – allowing toggling between
multiple modes of accounting with varying levels of en-
forcement, providing greater flexibility beyond strict allo-
cation options. Additionally, new up-to-date accounting
balances provide real-time insights into usage tracking.
� Viewpoint Admin Portal – guarantees services to the busi-
ness by allowing admins to simplify administrative reporting,
workload status tracking, and job resource viewing
Accelerate InsightsMoab HPC Suite – Enterprise Edition accelerates insights by in-
creasing overall system, user and administrator productivity,
achieving more accurate results that are delivered faster and at
a lower cost from HPC resources. Moab provides the scalability,
90-99 percent utilization, and simple job submission required to
maximize productivity. This ultimately speeds the time to discov-
ery, aiding the enterprise in achieving a competitive advantage
due to its HPC system. Enterprise use cases and capabilities in-
clude the following:
Intelligent HPC Workload Management
Moab HPC Suite – Enterprise Edition
“With Moab HPC Suite, we can meet very demanding customers’
requirements as regards unified management of heterogeneous
cluster environments, grid management, and provide them with flex-
ible and powerful configuration and reporting options. Our custom-
ers value that highly.”
Thomas GebertHPC Solution Architect
51
� OpenStack integration to offer virtual and physical resource
provisioning for IaaS and PaaS
� Performance boost to achieve 3x improvement in overall opti-
mization performance
� Advanced workflow data staging to enable improved cluster
utilization, multiple transfer methods, and new transfer types
� Advanced power management with clock frequency control
and additional power state options, reducing energy costs by
15-30 percent
�Workload-optimized allocation policies and provisioning to
get more results out of existing heterogeneous resources and
reduce costs, including topology-based allocation
� Unify workload management across heterogeneous clusters
by managing them as one cluster, maximizing resource availa-
bility and administration efficiency
� Optimized, intelligent scheduling packs workloads and back-
fills around priority jobs and reservations while balancing
SLAs to efficiently use all available resources
� Optimized scheduling and management of accelerators, both
Intel Xeon Phi and GPGPUs, for jobs to maximize their utiliza-
tion and effectiveness
� Simplified job submission and management with advanced
job arrays and templates
� Showback or chargeback for pay-for-use so actual resource
usage is tracked with flexible chargeback rates and reporting
by user, department, cost center, or cluster
� Multi-cluster grid capabilities manage and share workload
across multiple remote clusters to meet growing workload
demand or surges
Guarantee Services to the BusinessJob and resource failures in enterprise HPC systems lead to de-
layed results, missed organizational opportunities, and missed
objectives. Moab HPC Suite – Enterprise Edition intelligently
automates workload and resource uptime in the HPC system to
ensure that workloads complete successfully and reliably, avoid
failures, and guarantee services are delivered to the business.
52
The Enterprise benefits from these features:
� Intelligent resource placement prevents job failures with
granular resource modeling, meeting workload requirements
and avoiding at-risk resources
� Auto-response to failures and events with configurable ac-
tions to pre-failure conditions, amber alerts, or other metrics
and monitors
�Workload-aware future maintenance scheduling that helps
maintain a stable HPC system without disrupting workload
productivity
� Real-world expertise for fast time-to-value and system uptime
with included implementation, training and 24/7 support re-
mote services
Auto SLA EnforcementMoab HPC Suite – Enterprise Edition uses the powerful Moab in-
telligence engine to optimally schedule and dynamically adjust
workload to consistently meet service level agreements (SLAs),
guarantees, or business priorities. This automatically ensures
that the right workloads are completed at the optimal times,
taking into account the complex number of using departments,
priorities, and SLAs to be balanced. Moab provides the following
benefits:
� Usage accounting and budget enforcement that schedules
resources and reports on usage in line with resource sharing
agreements and precise budgets (includes usage limits, usage
reports, auto budget management, and dynamic fair share
policies)
� SLA and priority policies that make sure the highest priority
workloads are processed first (includes Quality of Service and
hierarchical priority weighting)
� Continuous plus future scheduling that ensures priorities and
guarantees are proactively met as conditions and workload
changes (i.e. future reservations, pre-emption)
Intelligent HPC Workload Management
Moab HPC Suite – Grid Option
53
Grid- and Cloud-Ready Workload ManagementThe benefits of a traditional HPC environment can be extended
to more efficiently manage and meet workload demand with
the multi-cluster grid and HPC cloud management capabilities in
Moab HPC Suite – Enterprise Edition. It allows you to:
� Self-service job submission portal enables users to easily sub-
mit and track HPC cloud jobs at any time from any location
with little training required
� Pay-for-use show back or chargeback so actual resource usage
is tracked with flexible chargeback rates reporting by user, de-
partment, cost center, or cluster
� On-demand workload-optimized node provisioning dynami-
cally provisions nodes based on policies, to better meet work-
load needs.
� Manage and share workload across multiple remote clusters
to meet growing workload demand or surges by adding on the
Moab HPC Suite – Grid Option.
S
N
EW
SESW
NENW
tech BOX
54
Choose the Moab HPC Suite That’s Right for YouThis following chart provides a high-level comparison of the dif-
ference in key use cases and capabilities that are enabled be-
tween the different editions of Moab HPC Suite to intelligently
manage and optimize HPC workload efficiency. Customers can
use this chart to help determine which edition will best address
their HPC workload management challenges, maximize their
workload and resource productivity, and help them meet their
service level performance goals. To explore any use cases or ca-
pability compared in the chart further, see the HPC Workload
Management Use Cases.
� Basic Edition is best for static, non-dynamic basic clusters of
all sizes who want to optimize workload throughput and uti-
lization while balancing priorities
� Enterprise Edition is best for large clusters with dynamic,
mission-critical workloads and complex service levels and
priorities to balance. It is also for organizations who want to
integrate or who require HPC Cloud capabilities for their HPC
cluster systems. Enterprise Edition provides additional uses
cases these organizations need such as: Accounting for us-
age of cluster resources for usage budgeting and pay-for-use
showback or chargeback
� Workload optimized node provisioning
� Heterogeneous, multi-cluster workload management
� Workload-aware power management
Intelligent HPC Workload Management
Optimizing Accelerators with
Moab HPC Suit
55
� Application Portal Edition is suited for customers who need an
application-centric job submission experience to simplify and
extend high performance computing to more users without re-
quiring specialized HPC training. It provides all the benefits of
the Enterprise Edition plus meeting these requirements with
a technical computing application portal that supports most
common applications in manufacturing, oil & gas, life sciences,
academic, government and other industries.
� Remote Visualization Edition is suited for customers with
many users, internally or external to their organization, of
3D/CAE/CAD and other visualization applications who want
to reduce the hardware, networking and management costs
of running their visualization workloads. This edition also
provides the added benefits of improved workforce produc-
tivity and collaboration while optimizing the utilization and
guaranteeing shared resource access for users moving from
technical workstations to more efficient shared resources in
a technical compute cloud. It provides all the benefits of the
Enterprise Edition plus providing these additional capabili-
ties and benefits.
BOX
56
transtec HPC solutions are designed for maximum flexibility
and ease of management. We not only offer our customers
the most powerful and flexible cluster management solution
out there, but also provide them with customized setup and
sitespecific configuration. Whether a customer needs a dynam-
ical Linux-Windows dual-boot solution, unifi ed management
of different clusters at different sites, or the fi ne-tuning of the
Moab scheduler for implementing fi ne-grained policy confi
guration – transtec not only gives you the framework at hand,
but also helps you adapt the system according to your special
needs. Needless to say, when customers are in need of special
trainings, transtec will be there to provide customers, adminis-
trators, or users with specially adjusted Educational Services.
Having a many years’ experience in High Performance Comput-
ing enabled us to develop effi cient concepts for installing and
deploying HPC clusters. For this, we leverage well-proven 3rd
party tools, which we assemble to a total solution and adapt
and confi gure according to the customer’s requirements.
We manage everything from the complete installation and
configuration of the operating system, necessary middleware
components like job and cluster management systems up to the
customer’s applications.
Intelligent HPC Workload Management
Optimizing Accelerators with
Moab HPC Suit
57
Moab HPC Suite Use Case Basic Enterprise Application Portal Remote Visualization
Productivity Acceleration
Massive multifaceted scalability x x x x
Workload optimized resource allocation x x x x
Optimized, intelligent scheduling of work-load and resources
x x x x
Optimized accelerator scheduling & manage-ment (Intel MIC & GPGPU)
x x x x
Simplified user job submission & mgmt.- Application Portal
x x xx
x
Administrator dashboard and management reporting
x x x x
Unified workload management across het-erogeneous cluster(s)
xSingle cluster only
x x x
Workload optimized node provisioning x x x
Visualization Workload Optimization x
Workload-aware auto-power management x x x
Uptime Automation
Intelligent resource placement for failure prevention
x x x x
Workload-aware maintenance scheduling x x x x
Basic deployment, training and premium (24x7) support service
Standard support only
x x x
Auto response to failures and events xLimited no
re-boot/provision
x x x
Auto SLA enforcement
Resource sharing and usage policies x x x x
SLA and oriority policies x x x x
Continuous and future reservations scheduling x x x x
Usage accounting and budget enforcement x x x
Grid- and Cloud-Ready
Manage and share workload across multiple clusters in wich area grid
xw/grid Option
xw/grid Option
xw/grid Option
xw/grid Option
User self-service: job request & management portal
x x x x
Pay-for-use showback and chargeback x x x
Workload optimized node provisioning & re-purposing
x x x
Multiple clusters in single domain
58
Intelligent HPC Workload Management
Moab HPC Suite –
Application Portal Edition
Moab HPC Suite – Application Portal EditionMoab HPC Suite – Application Portal Edition provides easy sin-
gle-point access to common technical applications, data, resourc-
es, and job submissions to accelerate project completion and colla-
boration for end users while reducing computing costs. It enables
more of your internal staff and partners to collaborate from any-
where using HPC applications and resources without any special-
ized HPC training while your intellectual property stays secure.
The optimization policies Moab provides significantly reduces
the costs and complexity of keeping up with compute demand
by optimizing resource sharing, utilization and higher through-
put while still balancing different service levels and priorities.
Remove the Barriers to Harnessing HPCOrganizations of all sizes and types are looking to harness the
potential of high performance computing (HPC) to enable and
accelerate more competitive product design and research. To do
this, you must find a way to extend the power of HPC resources to
designers, researchers and engineers not familiar with or trained
on using the specialized HPC technologies. Simplifying the user
access to and interaction with applications running on more
efficient HPC clusters removes the barriers to discovery and in-
novation for all types of departments and organizations. Moab
HPC Suite – Application Portal Edition provides easy single-point
access to common technical applications, data, HPC resources,
and job submissions to accelerate research analysis and collab-
oration process for end users while reducing computing costs.
HPC Ease of Use Drives Productivity Moab HPC Suite – Application Portal Edition improves ease of
use and productivity for end users using HPC resources with sin-
gle-point access to their common technical applications, data,
resources, and job submissions across the design and research
process. It simplifies the specification and running of jobs on HPC
resources to a few simple clicks on an application specific tem-
plate with key capabilities including:
59
� Application-centric job submission templates for common
manufacturing, energy, life-sciences, education and other in-
dustry technical applications
� Support for batch and interactive applications, such as sim-
ulations or analysis sessions, to accelerate the full project or
design cycle
� No special HPC training required to enable more users to start
accessing HPC resources with the intuitive portal
� Distributed data management avoids unnecessary remote file
transfers for users, storing data optimally on the network for
easy access and protection, and optimizing file transfer to/
from user desktops if needed
Accelerate Collaboration While Maintaining Control More and more projects are done across teams and with partner
and industry organizations. Moab HPC Suite – Application Portal
Edition enables you to accelerate this collaboration between
individuals and teams while maintaining control as additional
users, inside and outside the organization, can easily and secure-
ly access project applications, HPC resources and data to speed
project cycles. Key capabilities and benefits you can leverage are:
� Encrypted and controllable access, by class of user, services,
application or resource, for remote and partner users improves
collaboration while protecting infrastructure and intellectual
property
� Enable multi-site, globally available HPC services available an-
ytime, anywhere, accessible through many different types of
devices via a standard web browser
� Reduced client requirements for projects means new project
members can quickly contribute without being limited by
their desktop capabilities
Reduce Costs with Optimized Utilization and Sharing Moab HPC Suite – Application Portal Edition reduces the costs of
technical computing by optimizing resource utilization and the
sharing of resources including HPC compute nodes, application
licenses, and network storage. The patented Moab intelligence
engine and its powerful policies deliver value in key areas of
utilization and sharing:
� Maximize HPC resource utilization with intelligent, opti-
mized scheduling policies that pack and enable more work
to be done on HPC resources to meet growing and changing
demand
� Optimize application license utilization and access by sharing
application licenses in a pool, re-using and optimizing usage
with allocation policies that integrate with license managers
for lower costs and better service levels to a broader set of
users
� Usage budgeting and priorities enforcement with usage
accounting and priority policies that ensure resources are
shared in-line with service level agreements and project priorities
� Leverage and fully utilize centralized storage capacity instead
of duplicative, expensive, and underutilized individual work-
station storage
Key Intelligent Workload Management Capabilities:
� Massive multi-point scalability
�Workload-optimized allocation policies and provisioning
� Unify workload management cross heterogeneous clusters
The Application Portal Edition easily extends the power of HPC to your users and projects to drive innovation and competentive advantage.
60
Intelligent HPC Workload Management
Moab HPC Suite –
Remote Visualization Edition
� Optimized, intelligent scheduling
� Optimized scheduling and management of accelerators (both
Intel MIC and GPGPUs)
� Administrator dashboards and reporting tools
�Workload-aware auto power management
� Intelligent resource placement to prevent job failures
� Auto-response to failures and events
�Workload-aware future maintenance scheduling
� Usage accounting and budget enforcement
� SLA and priority polices
� Continuous plus future scheduling
Drive Your Innovation and Competitive AdvantageLeverage Adaptive Computing’s HPC expertise and award win-
ning Moab product to access or expand the opportunities HPC
can deliver your organization. Your engineers, designers, and sci-
entists will no longer have to be HPC experts to use the power
of HPC to take their projects to whole new levels of speed, inno-
vation, and competitive advantage. Moab HPC Suite- Application
Portal Edition gives you the benefits of enhanced user produc-
tivity and tools with optimized workload management for HPC
system efficiency to reduce costs as well.
Managing the World’s Top Systems, Ready to Manage YoursMoab manages the largest, most scale-intensive and complex
high performance computing environments in the world. This in-
cludes many of the Top500, leading commercial, and innovative
research HPC systems. Adaptive Computing is the largest suppli-
er of HPC workload management software. Adaptive’s Moab HPC
Suite and TORQUE rank #1 and #2 as the most used scheduling/
workload management software at HPC sites and Adaptive is the
#1 vendor used across unique HPC sites.
Moab HPC Suite – Remote Visualization EditionMoab HPC Suite – Remote Visualization Edition significantly
61
reduces the costs of visualization technical computing while
improving the productivity, collaboration and security of the de-
sign and research process. It enables you to create easy, central-
ized access to shared visualization resources in a technical com-
pute cloud to reduce the hardware, network and management
costs of visualization technical computing instead of expensive
and underutilized individual workstations.
The solution centralizes applications, data and compute togeth-
er with users only transferring pixels instead of data to do their
remote simulations and analysis. This enables a wider range of
users to collaborate and be more productive at any time, on the
same data, from anywhere without any data transfer time lags
or security issues. Moab’s optimized workload management
policies maximize resource utilization and guarantee shared re-
source access to users. With Moab HPC Suite – Remote Visuali-
zation Edition you can turn your user’s data and ideas into new
products and competitive advantage cheaper and faster than
ever before.
Removing Inefficiencies in Current 3D Visualization MethodsUsing 3D and 2D visualization, valuable data is interpreted and
new innovations are discovered. There are numerous inefficien-
cies in how current technical computing methods support users
and organizations in achieving these discoveries in a cost effec-
tive way. High-cost workstations and their software are often
not fully utilized by the limited user(s) that have access to them.
They are also costly to manage and they don’t keep pace long
with workload technology requirements. In addition, the work-
station model also requires slow transfers of the large data sets
between them. This is costly and inefficient in network band-
width, user productivity, and team collaboration while posing
increased data and security risks. There is a solution that re-
moves these inefficiencies using new technical cloud computing
models. Moab HPC Suite – Remote Visualization Edition signifi-
cantly reduces the costs of visualization technical computing
while improving the productivity, collaboration and security of
the design and research process.
Reduce the Costs of Visualization Technical ComputingMoab HPC Suite – Remote Visualization Edition significantly re-
duces the hardware, network and management costs of visual-
ization technical computing. It enables you to create easy, cen-
tralized access to shared visualization compute, application and
data resources. These compute, application, and data resources
reside in a technical compute cloud in your data center instead
of in expensive and underutilized individual workstations.
� Reduce hardware costs by consolidating expensive individual
technical workstations into centralized visualization servers
for higher utilization by multiple users; reduce additive or up-
grade technical workstation or specialty hardware purchases,
such as GPUs, for individual users.
� Reduce management costs by moving from remote user work-
stations that are difficult and expensive to maintain, upgrade,
and back-up to centrally managed visualization servers that
require less admin overhead
� Decrease network access costs and congestion as significant-
ly lower loads of just compressed, visualized pixels are mov-
ing to users, not full data sets
� Reduce energy usageas centralized visualization servers
consume less energy for the user demand met
� Reduce data storage costs by consolidating data into common
storage node(s) instead of under-utilized individual storage
Improve Collaboration, Productivity and SecurityWith Moab HPC Suite – Remote Visualization Edition, you can im-
prove the productivity, collaboration and security of the design
and research process by only transferring pixels instead of data
to users to do their simulations and analysis. This enables a wider
range of users to collaborate and be more productive at any time,
62
on the same data, from anywhere without any data transfer time
lags or security issues. Users also have improved immediate ac-
cess to specialty applications and resources – like GPUs – they
might need for a project so they are no longer limited by person-
al workstation constraints or to a single working location.
� Improve workforce collaboration by enabling multiple users
to access and collaborate on the same interactive application
data at the same time from almost any location or device,
with little or no training needed
� Eliminate data transfer time lags for users by keeping data
close to the visualization applications and compute resources
instead of transferring back and forth to remote workstations;
only smaller compressed pixels are transferred
� Improve security with centralized data storage so data no
longer gets transferred to where it shouldn’t, gets lost, or gets
inappropriately accessed on remote workstations, only pixels
get moved
Maximize Utilization and Shared Resource GuaranteesMoab HPC Suite – Remote Visualization Edition maximizes your
resource utilization and scalability while guaranteeing shared
resource access to users and groups with optimized workload
management policies. Moab’s patented policy intelligence en-
gine optimizes the scheduling of the visualization sessions
across the shared resources to maximize standard and specialty
resource utilization, helping them scale to support larger vol-
umes of visualization workloads. These intelligent policies also
guarantee shared resource access to users to ensure a high ser-
vice level as they transition from individual workstations.
� Guarantee shared visualization resource access for users
with priority policies and usage budgets that ensure they
receive the appropriate service level. Policies include ses-
sion priorities and reservations, number of users per node,
fairshare usage policies, and usage accounting and budgets
across multiple groups or users.
Intelligent HPC Workload Management
Moab HPC Suite –
Remote Visualization Edition
63
�Maximize resource utilization and scalability by packing vis-
ualization workload optimally on shared servers using Moab
allocation and scheduling policies. These policies can include
optimal resource allocation by visualization session charac-
teristics (like CPU, memory, etc.), workload packing, number of
users per node, and GPU policies, etc.
� Improve application license utilization to reduce software
costs and improve access with a common pool of 3D visuali-
zation applications shared across multiple users, with usage
optimized by intelligent Moab allocation policies, integrated
with license managers.
� Optimized scheduling and management of GPUs and other
accelerators to maximize their utilization and effectiveness
� Enable multiple Windows or Linux user sessions on a single
machine managed by Moab to maximize resource and power
utilization.
� Enable dynamic OS re-provisioning on resources, managed
by Moab, to optimally meet changing visualization applica-
tion workload demand and maximize resource utilization and
availability to users.
Innovation and Competitive Advantage at a Lower CostLeverage Adaptive Computing’s HPC and cloud expertise and
award winning Moab product to more efficiently access and ac-
celerate the opportunities technical computing can deliver your
organization. Your engineers, designers, and scientists will be
able to move from data and ideas to competitive produces and
new discoveries faster and at a lower cost than ever before. Moab
HPC Suite- Remote Visualization Edition gives you the benefits
of more efficient, next generation visualization methods with
enhanced user productivity and collaboration as well.
The Remote Visualization Edition makes visualization more cost effective and productive with an innovative technical compute cloud.
64
Optimizing Accelerators with Moab HPC SuiteHarnessing the Power of Intel Xeon Phi and GPGPUs
The mathematical acceleration delivered by many-core General
Purpose Graphics Processing Units (GPGPUs) offers significant
performance advantages for many classes of numerically inten-
sive applications. Parallel computing tools such as NVIDIA’s CUDA
and the openCL framework have made harnessing the power of
these technologies much more accessible to developers, result-
ing in the increasing deployment of hybrid GPGPU-based sys-
tems and introducing significant challenges for administrators
and workload management systems.
The new Intel Xeon Phi coprocessors offer breakthrough perfor-
mance for highly parallel applications with the benefit of leverag-
ing existing code and flexibility in programming models for fast-
er application development. Moving an application to Intel Xeon
Phi coprocessors has the promising benefit of requiring substan-
tially less effort and lower power usage. Such a small investment
can reap big performance gains for both data-intensive and
numerical or scientific computing applications that can make
the most of the Intel Many Integrated Cores (MIC) technology.
So it is no surprise that organizations are quickly embracing
these new coprocessors in their HPC systems. The main goal is to
create breakthrough discoveries, products and research that im-
prove our world. Harnessing and optimizing the power of these
new accelerator technologies is key to doing this as quickly as
possible. Moab HPC Suite helps organizations easily integrate ac-
celerator and coprocessor technologies into their HPC systems,
optimizing their utilization for the right work-loads and prob-
lems they are trying to solve.
Moab HPC Suite automatically schedules hybrid systems incor-
porating Intel Xeon Phi coprocessors and GPGPU accelerators,
optimizing their utilization as just another resource type with
Intelligent HPC Workload Management
Optimizing Accelerators with
Moab HPC Suite
65
policies. Organizations can choose the accelerators that best
meet their different workload needs.
Managing Hybrid Accelerator SystemsHybrid accelerator systems add a new dimension of manage-
ment complexity when allocating workloads to available re-
sources. In addition to the traditional needs of aligning workload
placement with software stack dependencies, CPU type, memo-
ry, and interconnect requirements, intelligent workload manage-
ment systems now need to consider:
Workload’s ability to exploit Xeon Phi or GPGPU technology
Additional software dependencies and reduce costs, including
topology-based allocation.
Current health and usage status of available Xeon Phis or GPG-
PUs Resource configuration for type and number of Xeon Phis or
GPGPUs Moab HPC Suite auto detects which types of accelera-
tors are where in the system to reduce the management effort
and costs as these processors are introduced and maintained to-
gether in an HPC system. This gives customers maximum choice
and performance in selecting the accelerators that work best for
each of their workloads, whether Xeon Phi, GPGPU or a hybrid
mix of the two.
Optimizing Intel Xeon Phi and GPGPUs Utilization System and application diversity is one of the first issues work-
load management software must address in optimizing acceler-
ator utilization. The ability of current codes to use GPGPUs and
Xeon Phis effectively, ongoing cluster expansion, and costs usu-
ally mean only a portion of a system will be equipped with one
or a mix of both types of accelerators. Moab HPC Suite has a two-
fold role in reducing the complexity of their administration and
optimizing their utilization. First, it must be able to automatical-
ly detect new GPGPUs or Intel Xeon Phi coprocessors in the envi-
ronment and their availability without the cumbersome burden
of manual administrator configuration. Second, and most impor-
tantly, it must accurately match Xeon Phi-bound or GPGPU-en-
abled workloads with the appropriately equipped resources in
addition to managing contention for those limited resources
according to organizational policy. Moab’s powerful allocation
and prioritization policies can ensure the right jobs from users
and groups get to the optimal accelerator resources at the right
time. This keeps the accelerators at peak utilization for the right
priority jobs.
Moab’s policies are a powerful ally to administrators in auto de-
termining the optimal Xeon Phi coprocessor or GPGPU to use
and ones to avoid when scheduling jobs. These allocation poli-
cies can be based on any of the GPGPU or Xeon Phi metrics such
as memory (to ensure job needs are met), temperature (to avoid
hardware errors or failures), utilization, or other metrics. Use the
following metrics in policies to optimize allocation of Intel Xeon
Phi coprocessors:
� Number of Cores
� Number of Hardware Threads
� Physical Memory
� Free Physical Memory
� Swap Memory
� Free Swap Memory
� Max Frequency (in MHz)
Use the following metrics in policies to optimize allocation of
GPGPUs and for management diagnostics:
� Error counts; single/double-bit, commands to reset counts
� Temperature
� Fan speed
� Memory; total, used, utilization
� Utilization percent
� Metrics time stamp
66
Improve GPGPU Job Speed and SuccessThe GPGPU drivers supplied by vendors today allow multiple jobs
to share a GPGPU without corrupting results, but sharing GPG-
PUs and other key GPGPU factors can significantly impact the
speed and success of GPGPU jobs and the final level of service
delivered to the system users and the organization.
Moab HPC Suite optimizes accelerator utilization with policies
that ensure the right ones get used for the right user and group
jobs at the right time.
Moab HPC Suite optimizes accelerator utilization with policies that ensure the right ones get used for the right user and group jobs at the right time.
Intelligent HPC Workload Management
Moab HPC Suite –
Remote Visualization Edition
67
Consider the example of a user estimating the time for a GPGPU
job based on exclusive access to a GPGPU, but the workload man-
ager allowing the GPGPU to be shared when the job is scheduled.
The job will likely exceed the estimated time and be terminated
by the workload manager unsuccessfully, leaving a very unsatis-
fied user and the organization, group or project they represent.
Moab HPC Suite provides the ability to schedule GPGPU jobs to
run exclusively and finish in the shortest time possible for certain
types, or classes of jobs or users to ensure job speed and success.
Optimize Performance and InnovationAdaptive Computing’s HPC expertise and award winning Moab
product help you optimize the utilization and performance from
your accelerators so you get from calculations to innovation
faster:
Simplify management:
auto detection of GPGPUs or Intel Xeon Phi coprocessors
Maximize utilization:
policies favor GPGPU or Xeon Phi-oriented jobs over other jobs
for accelerator-equipped resources
Maximize job success:
policies allocate the optimal accelerator, in the optimal mode,
at given time
Maximize service level:
policies designate priority scheduling for users, groups, classes
using limited accelerator resources
Remote Visualization and Workflow OptimizationIt’s human nature to want to ‘see’ the results from simulations, tests, and analy-
ses. Up until recently, this has meant ‘fat’ workstations on many user desktops. This
approach provides CPU power when the user wants it – but as dataset size increases,
there can be delays in downloading the results.
Also, sharing the results with colleagues means gathering around the workstation – not
always possible in this globalized, collaborative workplace.
Life
Sci
en
ces
CA
E
Hig
h P
erf
orm
an
ce C
om
pu
tin
g
Big
Da
ta A
na
lyti
cs
Sim
ula
tio
n
CA
D
70
Aerospace, Automotive and ManufacturingThe complex factors involved in CAE range from compute-in-
tensive data analysis to worldwide collaboration between de-
signers, engineers, OEM’s and suppliers. To accommodate these
factors, Cloud (both internal and external) and Grid Computing
solutions are increasingly seen as a logical step to optimize IT
infrastructure usage for CAE. It is no surprise then, that automo-
tive and aerospace manufacturers were the early adopters of
internal Cloud and Grid portals.
Manufacturers can now develop more “virtual products” and
simulate all types of designs, fluid flows, and crash simulations.
Such virtualized products and more streamlined collaboration
environments are revolutionizing the manufacturing process.
With NICE EnginFrame in their CAE environment engineers can
take the process even further by connecting design and simula-
tion groups in “collaborative environments” to get even greater
benefits from “virtual products”. Thanks to EnginFrame, CAE en-
gineers can have a simple intuitive collaborative environment
that can care of issues related to:
� Access & Security - Where an organization must give access
to external and internal entities such as designers, engi-
neers and suppliers.
� Distributed collaboration - Simple and secure connection of
design and simulation groups distributed worldwide.
� Time spent on IT tasks – By eliminating time and resources spent
using cryptic job submission commands or acquiring knowl-
edge of underlying compute infrastructures, engineers can
spend more time concentrating on their core design tasks.
EnginFrame’s web based interface can be used to access the
compute resources required for CAE processes. This means ac-
cess to job submission & monitoring tasks, input & output data
associated with industry standard CAE/CFD applications for
Fluid-Dynamics, Structural Analysis, Electro Design and Design
Remote Visualization
NICE EnginFrame:
A Technical Computing Portal
71
Collaboration (like Abaqus, Ansys, Fluent, MSC Nastran, PAM-
Crash, LS-Dyna, Radioss) without cryptic job submission com-
mands or knowledge of underlying compute infrastructures.
EnginFrame has a long history of usage in some of the most
prestigious manufacturing organizations worldwide including
Aerospace companies like AIRBUS, Alenia Space, CIRA, Galileo
Avionica, Hamilton Sunstrand, Magellan Aerospace, MTU and
Automotive companies like Audi, ARRK, Brawn GP, Bridgestone,
Bosch, Delphi, Elasis, Ferrari, FIAT, GDX Automotive, Jaguar-Lan-
dRover, Lear, Magneti Marelli, McLaren, P+Z, RedBull, Swagelok,
Suzuki, Toyota, TRW.
Life Sciences and HealthcareNICE solutions are deployed in the Life Sciences sector at compa-
nies like BioLab, Partners HealthCare, Pharsight and the M.D.An-
derson Cancer Center; also in leading research projects like DEISA
or LitBio in order to allow easy and transparent use of computing
resources without any insight of the HPC infrastructure.
The Life Science and Healthcare sectors have some very strict
requirement when choosing its IT solution like EnginFrame, for
instance:
� Security - To meet the strict security & privacy requirements
of the biomedical and pharmaceutical industry any solution
needs to take account of multiple layers of security and au-
thentication.
� Industry specific software - ranging from the simplest custom
tool to the more general purpose free and open middlewares.
EnginFrame’s modular architecture allows for different Grid
middlewares and software including leading Life Science appli-
cations like Schroedinger Glide, EPFL RAxML, BLAST family, Tav-
erna, and R) to be exploited, R Users can compose elementary
services into complex applications and “virtual experiments”,
run, monitor, and build workflows via a standard Web browser.
EnginFrame also has highly tunable resource sharing and fine-
grained access control where multiple authentication systems
(like Active Directory, Kerberos/LDAP) can be exploited simulta-
neously.
Growing complexity in globalized teamsHPC systems and Enterprise Grids deliver unprecedented time-
to-market and performance advantages to many research and
corporate customers, struggling every day with compute and
data intensive processes.
This often generates or transforms massive amounts of jobs and
data, that needs to be handled and archived efficiently to deliv-
er timely information to users distributed in multiple locations,
with different security concerns.
Poor usability of such complex systems often negatively im-
pacts users’ productivity, and ad-hoc data management often
increases information entropy and dissipates knowledge and
intellectual property.
The solutionSolving distributed computing issues for our customers, we
understand that a modern, user-friendly web front-end to HPC
and Grids can drastically improve engineering productivity, if
properly designed to address the specific challenges of the Tech-
nical Computing market
NICE EnginFrame overcomes many common issues in the areas
of usability, data management, security and integration, open-
ing the way to a broader, more effective use of the Technical
Computing resources.
The key components of the solution are:
� A flexible and modular Java-based kernel, with clear separa-
tion between customizations and core services
72
� Powerful data management features, reflecting the typical
needs of engineering applications
� Comprehensive security options and a fine grained authori-
zation system
� Scheduler abstraction layer to adapt to different workload
and resource managers
� Responsive and competent Support services
End users can typically enjoy the following improvements:
� User-friendly, intuitive access to computing resources, using
a standard Web browser
� Application-centric job submission
� Organized access to job information and data
� Increased mobility and reduced client requirements
Remote Visualization
NICE EnginFrame:
A Technical Computing Portal
73
On the other side, the Technical Computing Portal delivers signif-
icant added-value for system administrators and IT:
� Reduced training needs to enable users to access the
resources
� Centralized configuration and immediate deployment of
services
� Comprehensive authorization to access services and
information
� Reduced support calls and submission errors
Coupled with our Remote Visualization solutions, our custom-
ers quickly deploy end-to-end engineering processes on their
Intranet, Extranet or Internet.
74
Remote Visualization
Desktop Cloud Visualization
Solving distributed computing issues for our customers, it is
easy to understand that a modern, user-friendly web front-end
to HPC and grids can drastically improve engineering productiv-
ity, if properly designed to address the specific challenges of the
Technical Computing market.
Remote VisualizationIncreasing dataset complexity (millions of polygons, interacting
components, MRI/PET overlays) means that as time comes to
upgrade and replace the workstations, the next generation of
hardware needs more memory, more graphics processing, more
disk, and more CPU cores. This makes the workstation expen-
sive, in need of cooling, and noisy.
Innovation in the field of remote 3D processing now allows com-
panies to address these issues moving applications away from
the Desktop into the data center. Instead of pushing data to the
application, the application can be moved near the data.
Instead of mass workstation upgrades, remote visualization
allows incremental provisioning, on-demand allocation, better
management and efficient distribution of interactive sessions
and licenses. Racked workstations or blades typically have low-
er maintenance, cooling, replacement costs, and they can ex-
tend workstation (or laptop) life as “thin clients”.
The solution
Leveraging their expertise in distributed computing and web-
based application portals, NICE offers an integrated solution to
access, load balance and manage applications and desktop ses-
sions running within a visualization farm. The farm can include
both Linux and Windows resources, running on heterogeneous
hardware.
75
The core of the solution is the EnginFrame visualization plug-in,
that delivers web-based services to access and manage applica-
tions and desktops published in the farm. This solution has been
integrated with:
� NICE Desktop Cloud Visualization (DCV)
� HP Remote Graphics Software (RGS)
� RealVNC
� TurboVNC and VirtualGL
� Nomachine NX
Coupled with these third party remote visualization engines
(which specialize in delivering high frame-rates for 3D graphics),
the NICE offering for Remote Visualization solves the issues of
user authentication, dynamic session allocation, session man-
agement and data transfers.
End users can enjoy the following improvements:
� Intuitive, application-centric web interface to start, control
and re-connect to a session
� Single sign-on for batch and interactive applications
� All data transfers from and to the remote visualization farm
are handled by EnginFrame
� Built-in collaboration, to share sessions with other users
� The load and usage of the visualization cluster is monitored
in the browser
The solution also delivers significant added-value for the sys-
tem administrators:
� No need of SSH / SCP / FTP on the client machine
� Easy integration into identity services, Single Sign-On (SSO),
Enterprise portals
� Automated data life cycle management
� Built-in user session sharing, to facilitate support
� Interactive sessions are load balanced by a scheduler (LSF,
Moab or Torque) to achieve optimal performance and re-
source usage
� Better control and use of application licenses
� Monitor, control and manage users’ idle sessions
Desktop Cloud VisualizationNICE Desktop Cloud Visualization (DCV) is an advanced technol-
ogy that enables Technical Computing users to remote access
2D/3D interactive applications over a standard network.
Engineers and scientists are immediately empowered by taking
full advantage of high-end graphics cards, fast I/O performance
and large memory nodes hosted in “Public or Private 3D Cloud”,
rather then waiting for the next upgrade of the workstations. The
DCV protocol adapts to heterogeneous networking infrastruc-
tures like LAN, WAN and VPN, to deal with bandwidth and latency
constraints. All applications run natively on the remote machines,
that could be virtualized and share the same physical GPU.
76
In a typical visualization scenario, a software application sends
a stream of graphics commands to a graphics adapter through
an input/output (I/O) interface. The graphics adapter renders
the data into pixels and outputs them to the local display as
a video signal. When using NICE DCV, the scene geometry and
graphics state are rendered on a central server, and pixels are
sent to one or more remote displays.
This approach requires the server to be equipped with one or
more GPUs, which are used for the OpenGL rendering, while the
client software can run on “thin” devices.
NICE DCV architecture consist of:
� DCV Server, equipped with one or more GPUs, used for
OpenGL rendering
� one or more DCV end stations, running on “thin clients”, only
used for visualization
� etherogeneous networking infrastructures (like LAN, WAN
and VPN), optimized balancing quality vs frame rate
NICE DCV Highlights:
� enables high performance remote access to interactive
2D/3D software applications on low bandwidth/high latency
� supports multiple etherogeneous OS (Windows, Linux)
� enables GPU sharing
� supports 3D acceleration for OpenGL applications running
on virtual machines
� Supports multiple user collaboration via session sharing
� Enables attractive Return-on-Investment through resource
sharing and consolidation to data centers (GPU, memory, CPU)
� Keeps the data secure in the data center, reducing data load
and save time
� Enables right sizing of system allocation based on user’s
dynamic needs
Remote Visualization
Desktop Cloud Visualization
“The amount of data resulting from e.g. simulations in CAE or other
engineering environments can be in the Gigabyte range. It is obvious
that remote post-processing is one of the most urgent topics to be
tackled. NICE EnginFrame provides exactly that, and our customers
are impressed that such great technology enhances their workflow so
significantly.”
Robin KieneckerHPC Sales Specialist
77
� Facilitates application deployment: all applications, updates
and patches are instantly available to everyone, without any
changes to original code
Business BenefitsThe business benefits for adopting NICE DCV can be summarized
in to four categories:
Category Business Benefits
Productivity � Increase business efficiency
� Improve team performance by
ensuring real-time collaboration
with colleagues and partners in
real time, anywhere.
� Reduce IT management costs
by consolidating workstation
resources to a single point-of-
management
� Save money and time on applica-
tion deployment
� Let users work from anywhere if
there is an Internet connection
Business Continuity � Move graphics processing and
data to the datacenter - not on
laptop/desktop
� Cloud-based platform support
enables you to scale the visua-
lization solution „on-demand“
to extend business, grow new
revenue, manage costs.
Data Security � Guarantee secure and auditable
use of remote resources (appli-
cations, data, infrastructure,
licenses)
� Allow real-time collaboration
with partners while protecting In-
tellectual Property and resources
� Restrict access by class of user,
service, application, and resource
Training Effectiveness � Enable multiple users to follow
application procedures alongside
an instructor in real-time
� Enable collaboration and session
sharing among remote users (em-
ployees, partners, and affiliates)
NICE DCV is perfectly integrated into EnginFrame Views, lever-
aging 2D/3D capabilities over the web, including the ability to
share an interactive session with other users for collaborative
working.
BOX
78
Cloud Computing With transtec and NICECloud Computing is a style of computing in which dynamically
scalable and often virtualized resources are provided as a service
over the Internet. Users need not have knowledge of, expertise in,
or control over the technology infrastructure in the “cloud” that
supports them. The concept incorporates technologies that have
the common theme of reliance on the Internet for satisfying the
computing needs of the users. Cloud Computing services usually
provide applications online that are accessed from a web brows-
er, while the software and data are stored on the servers.
Companies or individuals engaging in Cloud Computing do not
own the physical infrastructure hosting the software platform
in question. Instead, they avoid capital expenditure by renting
usage from a third-party provider (except for the case of ‘Private
Cloud’ - see below). They consume resources as a service, paying
instead for only the resources they use. Many Cloud Comput-
ing offerings have adopted the utility computing model, which
is analogous to how traditional utilities like electricity are con-
sumed, while others are billed on a subscription basis.
The main advantage offered by Cloud solutions is the reduction
of infrastructure costs, and of the infrastructure’s maintenance.
By not owning the hardware and software, Cloud users avoid
capital expenditure by renting usage from a third-party provider.
Customers pay for only the resources they use.
The advantage for the provider of the Cloud, is that sharing com-
puting power among multiple tenants improves utilization rates,
as servers are not left idle, which can reduce costs and increase
efficiency.
Remote Visualization
Cloud Computing With transtec and NICE
79
Public cloud
Public cloud or external cloud describes cloud computing in the
traditional mainstream sense, whereby resources are dynam-
ically provisioned on a fine-grained, self-service basis over the
Internet, via web applications/web services, from an off-site
third-party provider who shares resources and bills on a fine-
grained utility computing basis.
Private cloud
Private cloud and internal cloud describe offerings deploying
cloud computing on private networks. These solutions aim to
“deliver some benefits of cloud computing without the pitfalls”,
capitalizing on data security, corporate governance, and relia-
bility concerns. On the other hand, users still have to buy, deploy,
manage, and maintain them, and as such do not benefit from
lower up-front capital costs and less hands-on management.
Architecture
The majority of cloud computing infrastructure, today, consists
of reliable services delivered through data centers and built on
servers with different levels of virtualization technologies. The
services are accessible anywhere that has access to networking
infrastructure. The Cloud appears as a single point of access
for all the computing needs of consumers. The offerings need
to meet the quality of service requirements of customers and
typically offer service level agreements, and, at the same time
proceed over the typical limitations.
The architecture of the computing platform proposed by NICE
(fig. 1) differs from the others in some interesting ways:
� you can deploy it on an existing IT infrastructure, because it
is completely decoupled from the hardware infrastructure
� it has a high level of modularity and configurability, resulting
in being easily customizable for the user’s needs
� based on the NICE EnginFrame technology, it is easy to build
graphical web-based interfaces to provide several applica-
tions, as web services, without needing to code or compile
source programs
� it utilises the existing methodology in place for authentica-
tion and authorization
80
Further, because the NICE Cloud solution is built on advanced
IT technologies, including virtualization and workload manage-
ment, the execution platform is dynamically able to allocate,
monitor, and configure a new environment as needed by the
application, inside the Cloud infrastructure. The NICE platform
offers these important properties:
� Incremental Scalability: the quantity of computing and
storage resources, provisioned to the applications, changes
dynamically depending on the workload
� Reliability and Fault-Tolerance: because of the virtualiza-
tion of the hardware resources and the multiple redundant
hosts, the platform adjusts the resources needed from the
applications, without disruption during disasters or crashes
� Service Level Agreement: the use of advanced systems for
the dynamic allocation of the resources, allows the guaran-
tee of service level, agreed across applications and services;
� Accountability: the continuous monitoring of the resourc-
es used by each application (and user), allows the setup of
services that users can access in a pay-per-use mode, or sub-
scribing to a specific contract. In the case of an Enterprise
Cloud, this feature allows costs to be shared among the cost
centers of the company.
transtec has a long-term experience in Engineering environ-
ments, especially from the CAD/CAE sector. This allow us to
provide customers from this area with solutions that greatly en-
hance their workflow and minimizes time-to-result.
This, together with transtec’s offerings of all kinds of services,
allows our customers to fully focus on their productive work,
and have us do the environmental optimizations.
Remote Visualization
NVIDIA GRID Boards
81
Graphics accelerated virtual desktops and applicationsNVIDIA GRID for large corporations offers the ability to offload
graphics processing from the CPU to the GPU in virtualised envi-
ronments, allowing the data center manager to deliver true PC
graphics-rich experiences to more users for the first time.
Benefits of NVIDIA GRID for IT:
� Leverage industry-leading remote visualization solutions like
NICE DCV
� Add your most graphics-intensive users to your virtual solu-
tion
� Improve the productivity of all users
Benefits of NVIDIA GRID for users:
� Highly responsive windows and rich multimedia experiences
� Access to all critical applications, including the most 3D-intensive
� Access from anywhere, on any device
NVIDIA GRID BoardsNVIDIA’s Kepler architecture-based GRID boards are specifically
designed to enable rich graphics in virtualised environments.
GPU VirtualisationGRID boards feature NVIDIA Kepler-based GPUs that, for the first
time, allow hardware virtualisation of the GPU. This means mul-
tiple users can share a single GPU, improving user density while
providing true PC performance and compatibility.
Low-Latency Remote DisplayNVIDIA’s patented low-latency remote display technology great-
ly improves the user experience by reducing the lag that users
feel when interacting with their virtual machine.
With this technology, the virtual desktop screen is pushed
directly to the remoting protocol.
H.264 EncodingThe Kepler GPU includes a high-performance H.264 encoding
engine capable of encoding simultaneous streams with supe-
rior quality. This provides a giant leap forward in cloud server
efficiency by offloading the CPU from encoding functions and
allowing the encode function to scale with the number of GPUs
in a server.
Maximum User DensityGRID boards have an optimised multi-GPU design that helps to
maximise user density. The GRID K1 board features 4 GPUs and
16 GB of graphics memory, allowing it to support up to 100 users
on a single board.
Power EfficiencyGRID boards are designed to provide data center-class power
efficiency, including the revolutionary new streaming multi-
processor, called “SMX”. The result is an innovative, proven solu-
tion that delivers revolutionary performance-per-watt for the
enterprise data centre
24/7 ReliabilityGRID boards are designed, built, and tested by NVIDIA for 24/7
operation. Working closely with leading server vendors ensures
GRID cards perform optimally and reliably for the life of the sys-
tem.
Widest Range of Virtualisation SolutionsGRID cards enable GPU-capable virtualisation solutions like
XenServer or Linux KVM, delivering the flexibility to choose from
a wide range of proven solutions.
82
GRID K1 Board GRID K2 Board
Number of GPUs 4 x entry Kepler GPUs2 x high-end Kepler
GPUs
Total NVIDIA CUDA cores 768 3072
Total memory size 16 GB DDR3 8 GB GDDR5
Max power 130 W 225 W
Board length 10.5” 10.5”
Board height 4.4” 4.4”
Board width Dual slot Dual slot
Display IO None None
Aux power 6-pin connector 8-pin connector
PCIe x16 x16
PCIe generation Gen3 (Gen2 compatible) Gen3 (Gen2 compatible)
Cooling solution Passive Passive
Technical SpecificationsGRID K1 Board Specifications
GRID K2 Board Specifications
Virtual GPU TechnologyNVIDIA GRID vGPU brings the full benefit of NVIDIA hardware-ac-
celerated graphics to virtualized solutions. This technology pro-
vides exceptional graphics performance for virtual desktops
equivalent to local PCs when sharing a GPU among multiple users.
NVIDIA GRID vGPU is the industry’s most advanced technology
for sharing true GPU hardware acceleration between multiple vir-
tual desktops – without compromising the graphics experience.
Application features and compatibility are exactly the same as
they would be at the desk.
Remote Visualization
Virtual GPU Technology
83
With GRID vGPU technology, the graphics commands of each vir-
tual machine are passed directly to the GPU, without translation
by the hypervisor. This allows the GPU hardware to be time-sliced
to deliver the ultimate in shared virtualised graphics performance.
vGPU Profiles Mean Customised, Dedicated Graphics MemoryTake advantage of vGPU Manager to assign just the right amount
of memory to meet the specific needs of each user. Every virtual
desktop has dedicated graphics memory, just like they would
at their desk, so they always have the resources they need to
launch and use their applications.
vGPU Manager enables up to eight users to share each physical
GPU, assigning the graphics resources of the available GPUs to
virtual machines in a balanced approach. Each NVIDIA GRID K1
graphics card has up to four GPUs, allowing 32 users to share a
single graphics card.
NVIDIA GRID Graphics Board
Virtual GPU Profile Application Certifications
Graphics Memory Max Displays Per User
Max Resolution Per Display
Max Users Per Graphics Board
Use Case
GRID K2K260Q 2,048 MB 4 2560×1600 4
Designer/Power User
K240Q 1,024 MB 2 2560×1600 8Designer/
Power User
K220Q 512 MB 2 2560×1600 16Designer/
Power User
K200 256 MB 2 1900×1200 16Knowledge
Worker
GRID K1K140Q 1,024 MB 2 2560×1600 16 Power User
K120Q 512 MB 2 2560×1600 32 Power User
K100 256 MB 2 1900×1200 32Knowledge
Worker
Up to 8 users supported per physical GPU depending in vGPU profiles
Intel Cluster Ready – A Quality Standard for HPC ClustersIntel Cluster Ready is designed to create predictable expectations for users and
providers of HPC clusters, primarily targeting customers in the commercial and
industrial sectors. These are not experimental “test-bed” clusters used for computer
science and computer engineering research, or high-end “capability” clusters closely
targeting their specific computing requirements that power the high-energy
physics at the national labs or other specialized research organizations.
Intel Cluster Ready seeks to advance HPC clusters used as computing resources in pro-
duction environments by providing cluster owners with a high degree of confidence
that the clusters they deploy will run the applications their scientific and engineering
staff rely upon to do their jobs. It achieves this by providing cluster hardware, software,
and system providers with a precisely defined basis for their products to meet their
customers’ production cluster requirements.
Life
Sci
en
ces
Ris
k A
na
lysi
s S
imu
lati
on
B
ig D
ata
An
aly
tics
H
igh
Pe
rfo
rma
nce
Co
mp
uti
ng
86
Intel Cluster Ready
A Quality Standard for HPC Clusters
What are the Objectives of ICR?The primary objective of Intel Cluster Ready is to make clusters
easier to specify, easier to buy, easier to deploy, and make it eas-
ier to develop applications that run on them. A key feature of ICR
is the concept of “application mobility”, which is defined as the
ability of a registered Intel Cluster Ready application – more cor-
rectly, the same binary – to run correctly on any certified Intel
Cluster Ready cluster. Clearly, application mobility is important
for users, software providers, hardware providers, and system
providers.
� Users want to know the cluster they choose will reliably run
the applications they rely on today, and will rely on tomorrow
� Application providers want to satisfy the needs of their cus-
tomers by providing applications that reliably run on their
customers’ cluster hardware and cluster stacks
� Cluster stack providers want to satisfy the needs of their cus-
tomers by providing a cluster stack that supports their cus-
tomers’ applications and cluster hardware
� Hardware providers want to satisfy the needs of their cus-
tomers by providing hardware components that supports
their customers’ applications and cluster stacks
� System providers want to satisfy the needs of their custom-
ers by providing complete cluster implementations that reli-
ably run their customers’ applications
Without application mobility, each group above must either try
to support all combinations, which they have neither the time
nor resources to do, or pick the “winning combination(s)” that
best supports their needs, and risk making the wrong choice.
The Intel Cluster Ready definition of application portability
supports all of these needs by going beyond pure portability,
(re-compiling and linking a unique binary for each platform), to
application binary mobility, (running the same binary on multi-
ple platforms), by more precisely defining the target system.
“The Intel Cluster Checker allows us to certify that our transtec HPC
clusters are compliant with an independent high quality standard.
Our customers can rest assured: their applications run as they expect.”
Marcus WiedemannHPC Solution Engineer
So
ftw
are
Sta
ck
Fabric
System
CFDRegisteredApplications
A SingleSolutionPlatform
Fabrics
Certified Cluster Platforms
Crash Climate QCD
Intel MPI Library (run-time)Intel MKL Cluster Edition (run-time)
Linux Cluster Tools(Intel Selected)
InfiniBand (OFED)
Intel Xeon Processor Platform
Gigabit Ethernet 10Gbit Ethernet
BioOptional
Value Addby
IndividualPlatform
Integrator
IntelDevelopment
Tools(C++, Intel Trace
Analyzer and Collector, MKL, etc.)
...
Intel OEM1 OEM2 PI1 PI2 ...
87
A further aspect of application mobility is to ensure that regis-
tered Intel Cluster Ready applications do not need special pro-
gramming or alternate binaries for different message fabrics.
Intel Cluster Ready accomplishes this by providing an MPI imple-
mentation supporting multiple fabrics at runtime; through this,
registered Intel Cluster Ready applications obey the “message
layer independence property”. Stepping back, the unifying con-
cept of Intel Cluster Ready is “one-to-many,” that is
� One application will run on many clusters
� One cluster will run many applications
How is one-to-many accomplished? Looking at Figure 1, you see
the abstract Intel Cluster Ready “stack” components that al-
ways exist in every cluster, i.e., one or more applications, a clus-
ter software stack, one or more fabrics, and finally the underly-
ing cluster hardware. The remainder of that picture (to the right)
shows the components in greater detail.
Applications, on the top of the stack, rely upon the various APIs,
utilities, and file system structure presented by the underly-
ing software stack. Registered Intel Cluster Ready applications
are always able to rely upon the APIs, utilities, and file system
structure specified by the Intel Cluster Ready Specification; if an
application requires software outside this “required” set, then
Intel Cluster Ready requires the application to provide that soft-
ware as a part of its installation.
To ensure that this additional per-application software doesn’t
conflict with the cluster stack or other applications, Intel
Cluster Ready also requires the additional software to be in-
stalled in application-private trees, so the application knows
how to find that software while not interfering with other
applications. While this may well cause duplicate software to
be installed, the reliability provided by the duplication far out-
weighs the cost of the duplicated files.
ICR Stack
88
A prime example supporting this comparison is the removal of a
common file (library, utility, or other) that is unknowingly need-
ed by some other application – such errors can be insidious to
repair even when they cause an outright application failure.
Cluster platforms, at the bottom of the stack, provide the APIs,
utilities, and file system structure relied upon by registered ap-
plications. Certified Intel Cluster Ready platforms ensure the
APIs, utilities, and file system structure are complete per the Intel
Cluster Ready Specification; certified clusters are able to provide
them by various means as they deem appropriate. Because of the
clearly defined responsibilities ensuring the presence of all soft-
ware required by registered applications, system providers have
a high confidence that the certified clusters they build are able
to run any certified applications their customers rely on. In ad-
dition to meeting the Intel Cluster Ready requirements, certified
clusters can also provide their added value, that is, other features
and capabilities that increase the value of their products.
How Does Intel Cluster Ready Accomplish its Objectives?At its heart, Intel Cluster Ready is a definition of the cluster as a
parallel application platform, as well as a tool to certify an ac-
tual cluster to the definition. Let’s look at each of these in more
detail, to understand their motivations and benefits.
A definition of the cluster as parallel application platform
The Intel Cluster Ready Specification is very much written as the
requirements for, not the implementation of, a platform upon
which parallel applications, more specifically MPI applications,
can be built and run. As such, the specification doesn’t care
whether the cluster is diskful or diskless, fully distributed or
single system image (SSI), built from “Enterprise” distributions
or community distributions, fully open source or not. Perhaps
more importantly, with one exception, the specification doesn’t
have any requirements on how the cluster is built; that one
Intel Cluster Ready
A Quality Standard for HPC Clusters
Cluster Definition & Configuration XML File
STDOUT + LogfilePass/Fail Results & Diagnostics
Cluster Checker Engine
Output
Test Module
Parallel Ops Check
Res
ult
Co
nfi
g
Output
Res
ult
AP
I
Test Module
Parallel Ops Check
Co
nfi
gN
od
e
No
de
No
de
No
de
No
de
No
de
No
de
No
de
89
exception is that compute nodes must be built with automated
tools, so that new, repaired, or replaced nodes can be rebuilt
identically to the existing nodes without any manual interac-
tion, other than possibly initiating the build process.
Some items the specification does care about include:
� The ability to run both 32- and 64-bit applications, including
MPI applications and X-clients, on any of the compute nodes
� Consistency among the compute nodes’ configuration, capa-
bility, and performance
� The identical accessibility of libraries and tools across
the cluster
� The identical access by each compute node to permanent
and temporary storage, as well as users’ data
� The identical access to each compute node from the head node
� The MPI implementation provides fabric independence
� All nodes support network booting and provide a remotely
accessible console
The specification also requires that the runtimes for specific In-
tel software products are installed on every certified cluster:
� Intel Math Kernel Library
� Intel MPI Library Runtime Environment
� Intel Threading Building Blocks
This requirement does two things. First and foremost, main-
line Linux distributions do not necessarily provide a sufficient
software stack to build an HPC cluster – such specialization is
beyond their mission. Secondly, the requirement ensures that
programs built with this software will always work on certified
clusters and enjoy simpler installations. As these runtimes are
directly available from the web, the requirement does not cause
additional costs to certified clusters. It is also very important
to note that this does not require certified applications to use
these libraries nor does it preclude alternate libraries, e.g., other
MPI implementations, from being present on certified clusters.
Quite clearly, an application that requires, e.g., an alternate MPI,
must also provide the runtimes for that MPI as a part of its in-
stallation.
A tool to certify an actual cluster to the definition
The Intel Cluster Checker, included with every certified Intel
Cluster Ready implementation, is used in four modes in the life
of a cluster:
� To certify a system provider’s prototype cluster as a valid im-
plementation of the specification
� To verify to the owner that the just-delivered cluster is a “true
copy” of the certified prototype
� To ensure the cluster remains fully functional, reducing ser-
vice calls not related to the applications or the hardware
� To help software and system providers diagnose and correct
actual problems to their code or their hardware.
While these are critical capabilities, in all fairness, this greatly
understates the capabilities of Intel Cluster Checker. The tool
will not only verify the cluster is performing as expected. To
do this, per-node and cluster-wide static and dynamic tests are
made of the hardware and software.
Intel Cluster Checker
90
The static checks ensure the systems are configured consist-
ently and appropriately. As one example, the tool will ensure
the systems are all running the same BIOS versions as well as
having identical configurations among key BIOS settings. This
type of problem – differing BIOS versions or settings – can be
the root cause of subtle problems such as differing memory con-
figurations that manifest themselves as differing memory band-
widths only to be seen at the application level as slower than ex-
pected overall performance. As is well-known, parallel program
performance can be very much governed by the performance
of the slowest components, not the fastest. In another static
check, the Intel Cluster Checker will ensure that the expected
tools, libraries, and files are present on each node, identically
located on all nodes, as well as identically implemented on all
nodes. This ensures that each node has the minimal software
stack specified by the specification, as well as identical soft-
ware stack among the compute nodes.
A typical dynamic check ensures consistent system perfor-
mance, e.g., via the STREAM benchmark. This particular test en-
sures processor and memory performance is consistent across
compute nodes, which, like the BIOS setting example above, can
be the root cause of overall slower application performance. An
additional check with STREAM can be made if the user config-
ures an expectation of benchmark performance; this check will
ensure that performance is not only consistent across the clus-
ter, but also meets expectations.
Going beyond processor performance, the Intel MPI Bench-
marks are used to ensure the network fabric(s) are performing
properly and, with a configuration that describes expected
performance levels, up to the cluster provider’s performance
expectations. Network inconsistencies due to poorly perform-
ing Ethernet NICs, InfiniBand HBAs, faulty switches, and loose or
faulty cables can be identified. Finally, the Intel Cluster Checker
is extensible, enabling additional tests to be added supporting
additional features and capabilities.
Intel Cluster Ready
Intel Cluster Ready Builds
HPC Momentum
S
N
EW
SESW
NENW
tech BOX
Cluster PlatformSoftware Tools
Reference Designs
Demand Creation
Support & Training
Specification
Co
nfi
gu
ration
Softw
are
Interco
nn
ect
Server Platfo
rm
Intel P
rocesso
r ISV Enabling
©
Intel Cluster Ready Program
91
Intel Cluster Ready Builds HPC MomentumWith the Intel Cluster Ready (ICR) program, Intel Corporation
set out to create a win-win scenario for the major constituen-
cies in the high-performance computing (HPC) cluster mar-
ket. Hardware vendors and independent software vendors
(ISVs) stand to win by being able to ensure both buyers and
users that their products will work well together straight out
of the box.
System administrators stand to win by being able to meet
corporate demands to push HPC competitive advantages
deeper into their organizations while satisfying end users’
demands for reliable HPC cycles, all without increasing IT
staff. End users stand to win by being able to get their work
done faster, with less downtime, on certified cluster plat-
forms. Last but not least, with ICR, Intel has positioned itself
to win by expanding the total addressable market (TAM) and
reducing time to market for the company’s microprocessors,
chip sets, and platforms.
The Worst of Times
For a number of years, clusters were largely confined to
government and academic sites, where contingents of
graduate students and midlevel employees were availa-
ble to help program and maintain the unwieldy early sys-
tems. Commercial firms lacked this low-cost labor supply
and mistrusted the favored cluster operating system, open
source Linux, on the grounds that no single party could be
held accountable if something went wrong with it. Today,
cluster penetration in the HPC market is deep and wide,
extending from systems with a handful of processors to
This enables the Intel Cluster Checker to not only support the
minimal requirements of the Intel Cluster Ready Specification,
but the full cluster as delivered to the customer.
Conforming hardware and software
The preceding was primarily related to the builders of certified
clusters and the developers of registered applications. For end
users that want to purchase a certified cluster to run registered
applications, the ability to identify registered applications and
certified clusters is most important, as that will reduce their ef-
fort to evaluate, acquire, and deploy the clusters that run their
applications, and then keep that computing resource operating
properly, with full performance, directly increasing their pro-
ductivity.
Intel, XEON and certain other trademarks and logos appearing in this brochure,
are trademarks or registered trademarks of Intel Corporation.
92
some of the world’s largest supercomputers, and from un-
der $25,000 to tens or hundreds of millions of dollars in price.
Clusters increasingly pervade every HPC vertical market:
biosciences, computer-aided engineering, chemical engineer-
ing, digital content creation, economic/financial services,
electronic design automation, geosciences/geo-engineering,
mechanical design, defense, government labs, academia, and
weather/climate.
But IDC studies have consistently shown that clusters remain
difficult to specify, deploy, and manage, especially for new and
less experienced HPC users. This should come as no surprise,
given that a cluster is a set of independent computers linked to-
gether by software and networking technologies from multiple
vendors. Clusters originated as do-it-yourself HPC systems. In
the late 1990s users began employing inexpensive hardware to
cobble together scientific computing systems based on the “Be-
owulf cluster” concept first developed by Thomas Sterling and
Donald Becker at NASA. From their Beowulf origins, clusters have
evolved and matured substantially, but the system manage-
ment issues that plagued their early years remain in force today.
The Need for Standard Cluster SolutionsThe escalating complexity of HPC clusters poses a dilemma for
many large IT departments that cannot afford to scale up their
HPC-knowledgeable staff to meet the fast-growing end-user de-
mand for technical computing resources. Cluster management
is even more problematic for smaller organizations and busi-
ness units that often have no dedicated, HPC-knowledgeable
staff to begin with.
The ICR program aims to address burgeoning cluster complexity
by making available a standard solution (aka reference archi-
tecture) for Intel-based systems that hardware vendors can use
Intel Cluster Ready
Intel Cluster Ready Builds
HPC Momentum
93
to certify their configurations and that ISVs and other software
vendors can use to test and register their applications, system
software, and HPC management software. The chief goal of this
voluntary compliance program is to ensure fundamental hard-
ware-software integration and interoperability so that system
administrators and end users can confidently purchase and de-
ploy HPC clusters, and get their work done, even in cases where
no HPC-knowledgeable staff are available to help.
The ICR program wants to prevent end users from having to
become, in effect, their own systems integrators. In smaller or-
ganizations, the ICR program is designed to allow overworked
IT departments with limited or no HPC expertise to support HPC
user requirements more readily. For larger organizations with
dedicated HPC staff, ICR creates confidence that required user
applications will work, eases the problem of system adminis-
tration, and allows HPC cluster systems to be scaled up in size
without scaling support staff. ICR can help drive HPC cluster re-
sources deeper into larger organizations and free up IT staff to
focus on mainstream enterprise applications (e.g., payroll, sales,
HR, and CRM).
The program is a three-way collaboration among hardware ven-
dors, software vendors, and Intel. In this triple alliance, Intel pro-
vides the specification for the cluster architecture implementa-
tion, and then vendors certify the hardware configurations and
register software applications as compliant with the specifica-
tion. The ICR program’s promise to system administrators and
end users is that registered applications will run out of the box
on certified hardware configurations.
ICR solutions are compliant with the standard platform archi-
tecture, which starts with 64-bit Intel Xeon processors in an In-
tel-certified cluster hardware platform. Layered on top of this
foundation are the interconnect fabric (Gigabit Ethernet, Infini-
Band) and the software stack: Intel-selected Linux cluster tools,
an Intel MPI runtime library, and the Intel Math Kernel Library.
Intel runtime components are available and verified as part of
the certification (e.g., Intel tool runtimes) but are not required
to be used by applications. The inclusion of these Intel runtime
components does not exclude any other components a systems
vendor or ISV might want to use. At the top of the stack are In-
tel-registered ISV applications.
At the heart of the program is the Intel Cluster Checker, a vali-
dation tool that verifies that a cluster is specification compliant
and operational before ISV applications are ever loaded. After
the cluster is up and running, the Cluster Checker can function
as a fault isolation tool in wellness mode. Certification needs
to happen only once for each distinct hardware platform, while
verification – which determines whether a valid copy of the
specification is operating – can be performed by the Cluster
Checker at any time.
Cluster Checker is an evolving tool that is designed to accept
new test modules. It is a productized tool that ICR members ship
with their systems. Cluster Checker originally was designed for
homogeneous clusters but can now also be applied to clusters
with specialized nodes, such as all-storage sub-clusters. Cluster
Checker can isolate a wide range of problems, including network
or communication problems.
BOX
94
Intel Cluster Ready
The transtec Benchmarking Center
transtec offers their customers a new and fascinating way to
evaluate transtec’s HPC solutions in real world scenarios. With
the transtec Benchmarking Center solutions can be explored in
detail with the actual applications the customers will later run
on them. Intel Cluster Ready makes this feasible by simplifying
the maintenance of the systems and set-up of clean systems
very easily, and as often as needed.
As High-Performance Computing (HPC) systems are utilized for
numerical simulations, more and more advanced clustering
technologies are being deployed. Because of its performance,
price/performance and energy efficiency advantages, clusters
now dominate all segments of the HPC market and continue
to gain acceptance. HPC computer systems have become far
more widespread and pervasive in government, industry, and
academia. However, rarely does the client have the possibility
to test their actual application on the system they are planning
to acquire.
The transtec Benchmarking Center
transtec HPC solutions get used by a wide variety of clients.
Among those are most of the large users of compute power at
German and other European universities and research centers
as well as governmental users like the German army’s compute
center and clients from the high tech, the automotive and oth-
er sectors. transtec HPC solutions have demonstrated their
value in more than 500 installations. Most of transtec’s cluster
systems are based on SUSE Linux Enterprise Server, Red Hat
95
Enterprise Linux, CentOS, or Scientific Linux. With xCAT for ef-
ficient cluster deployment, and Moab HPC Suite by Adaptive
Computing for high-level cluster management, transtec is able
to efficiently deploy and ship easy-to-use HPC cluster solutions
with enterprise-class management features. Moab has proven
to provide easy-to-use workload and job management for small
systems as well as the largest cluster installations worldwide.
However, when selling clusters to governmental customers as
well as other large enterprises, it is often required that the cli-
ent can choose from a range of competing offers. Many times
there is a fixed budget available and competing solutions are
compared based on their performance towards certain custom
benchmark codes.
So, in 2007 transtec decided to add another layer to their already
wide array of competence in HPC – ranging from cluster deploy-
ment and management, the latest CPU, board and network tech-
nology to HPC storage systems. In transtec’s HPC Lab the sys-
tems are being assembled. transtec is using Intel Cluster Ready
to facilitate testing, verification, documentation, and final test-
ing throughout the actual build process.
At the benchmarking center transtec can now offer a set of small
clusters with the “newest and hottest technology” through
Intel Cluster Ready. A standard installation infrastructure
gives transtec a quick and easy way to set systems up accord-
ing to their customers’ choice of operating system, compilers,
workload management suite, and so on. With Intel Cluster Ready
there are prepared standard set-ups available with verified per-
formance at standard benchmarks while the system stability is
guaranteed by our own test suite and the Intel Cluster Checker.
The Intel Cluster Ready program is designed to provide a com-
mon standard for HPC clusters, helping organizations design
and build seamless, compatible and consistent cluster config-
urations. Integrating the standards and tools provided by this
program can help significantly simplify the deployment and
management of HPC clusters.
Parallel NFS – The New Standard for HPC StorageHPC computation results in the terabyte range are not uncommon. The problem
in this context is not so much storing the data at rest, but the performance of the
necessary copying back and forth in the course of the computation job flow and
the dependent job turn-around time.
For interim results during a job runtime or for fast storage of input and results data, par-
allel file systems have established themselves as the standard to meet the ever-increas-
ing performance requirements of HPC storage systems. Parallel NFS is about to become
the new standard framework for a parallel file system.
En
gin
ee
rin
g
Life
Sci
en
ces
Au
tom
oti
ve
Pri
ce M
od
ell
ing
A
ero
spa
ce
CA
E
Da
ta A
na
lyti
cs
Cluster Nodes= NFS Clients
NAS Head= NFS Server
98
Parallel NFS
The New Standard for HPC Storage
Yesterday’s Solution: NFS for HPC StorageThe original Network File System (NFS) developed by Sun Mi-
crosystems at the end of the eighties – now available in version 4.1
– has been established for a long time as a de-facto standard for
the provisioning of a global namespace in networked computing.
A very widespread HPC cluster solution includes a central master
node acting simultaneously as an NFS server, with its local file
system storing input, interim and results data and exporting
them to all other cluster nodes.
There is of course an immediate bottleneck in this method: When
the load of the network is high, or where there are large num-
bers of nodes, the NFS server can no longer keep up delivering or
receiving the data. In high-performance computing especially,
the nodes are interconnected at least once via Gigabit Ethernet,
so the sum total throughput is well above what an NFS server
with a Gigabit interface can achieve. Even a powerful network
connection of the NFS server to the cluster, for example with
10-Gigabit Ethernet, is only a temporary solution to this prob-
lem until the next cluster upgrade. The fundamental problem re-
mains – this solution is not scalable; in addition, NFS is a difficult
protocol to cluster in terms of load balancing: either you have
to ensure that multiple NFS servers accessing the same data are
A classical NFS server is a bottleneck
99
constantly synchronised, the disadvantage being a noticeable
drop in performance or you manually partition the global name-
space which is also time-consuming. NFS is not suitable for dy-
namic load balancing as on paper it appears to be stateless but
in reality is, in fact, stateful.
Today’s Solution: Parallel File SystemsFor some time, powerful commercial products have been
available to meet the high demands on an HPC storage sys-
tem. The open-source solutions BeeGFS from the Fraunhofer
Competence Center for High Performance Computing or Lus-
tre are widely used in the Linux HPC world, and also several
other free as well as commercial parallel file system solutions
exist.
What is new is that the time-honoured NFS is to be upgraded,
including a parallel version, into an Internet Standard with
the aim of interoperability between all operating systems.
The original problem statement for parallel NFS access was
written by Garth Gibson, a professor at Carnegie Mellon Uni-
versity and founder and CTO of Panasas. Gibson was already a
renowned figure being one of the authors contributing to the
original paper on RAID architecture from 1988.
The original statement from Gibson and Panasas is clearly no-
ticeable in the design of pNFS. The powerful HPC file system
developed by Gibson and Panasas, ActiveScale PanFS, with
object-based storage devices functioning as central compo-
nents, is basically the commercial continuation of the “Net-
work-Attached Secure Disk (NASD)” project also developed by
Garth Gibson at the Carnegie Mellon University.
Parallel NFSParallel NFS (pNFS) is gradually emerging as the future standard
to meet requirements in the HPC environment. From the indus-
try’s as well as the user’s perspective, the benefits of utilising
standard solutions are indisputable: besides protecting end
user investment, standards also ensure a defined level of inter-
operability without restricting the choice of products availa-
ble. As a result, less user and administrator training is required
which leads to simpler deployment and at the same time, a
greater acceptance.
As part of the NFS 4.1 Internet Standard, pNFS will not only
adopt the semantics of NFS in terms of cache consistency or
security, it also represents an easy and flexible extension of
the NFS 4 protocol. pNFS is optional, in other words, NFS 4.1 im-
plementations do not have to include pNFS as a feature. The
scheduled Internet Standard NFS 4.1 is today presented as IETF
RFC 5661.
The pNFS protocol supports a separation of metadata and data:
a pNFS cluster comprises so-called storage devices which store
the data from the shared file system and a metadata server
(MDS), called Director Blade with Panasas – the actual NFS 4.1
server. The metadata server keeps track of which data is stored
on which storage devices and how to access the files, the so-
called layout. Besides these “striping parameters”, the MDS
also manages other metadata including access rights or similar,
which is usually stored in a file’s inode.
The layout types define which Storage Access Protocol is used
by the clients to access the storage devices. Up until now, three
potential storage access protocols have been defined for pNFS:
file, block and object-based layouts, the former being described
in RFC 5661 direct, the latters in RFC 5663 and 5664, respective-
ly. Last but not least, a Control Protocol is also used by the MDS
and storage devices to synchronise status data. This protocol is
deliberately unspecified in the standard to give manufacturers
certain flexibility. The NFS 4.1 standard does however specify cer-
tain conditions which a control protocol has to fulfil, for exam-
ple, how to deal with the change/modify time attributes of files.
S
N
EW
SESW
NENW
tech BOX
100
Parallel NFS
What’s New in NFS 4.1?
What’s New in NFS 4.1?NFS 4.1 is a minor update to NFS 4, and adds new features to it.
One of the optional features is parallel NFS (pNFS) but there is
other new functionality as well.
One of the technical enhancements is the use of sessions, a
persistent server object, dynamically created by the client. By
means of sessions, the state of an NFS connection can be stored,
no matter whether the connection is live or not. Sessions sur-
vive temporary downtimes both of the client and the server.
Each session has a so-called fore channel, which is the connec-
tion from the client to the server for all RPC operations, and op-
tionally a back channel for RPC callbacks from the server that
now can also be realized through firewall boundaries. Sessions
can be trunked to increase the bandwidth. Besides session
trunking there is also a client ID trunking for grouping together
several sessions to the same client ID.
By means of sessions, NFS can be seen as a really stateful proto-
col with a so-called “Exactly-Once Semantics (EOS)”. Until now, a
necessary but unspecified reply cache within the NFS server is
implemented to handle identical RPC operations that have been
sent several times. This statefulness in reality is not very robust,
however, and sometimes leads to the well-known stale NFS han-
dles. In NFS 4.1, the reply cache is now a mandatory part of the
NFS implementation, storing the server replies to RPC requests
persistently on disk.
101
Another new feature of NFS 4.1 is delegation for directories: NFS
clients can be given temporary exclusive access to directories.
Before, this has only been possible for simple files. With the
forthcoming version 4.2 of the NFS standard, federated filesys-
tems will be added as a feature, which represents the NFS coun-
terpart of Microsoft’s DFS (distributed filesystem).
Some Proposed NFSv4.2 featuresNFSv4.2 promises many features that end-users have been
requesting, and that makes NFS more relevant as not only an
“every day” protocol, but one that has application beyond the
data center.
Server Side Copy
Server-Side Copy (SSC) removes one leg of a copy operation. In-
stead of reading entire files or even directories of files from one
server through the client, and then writing them out to another,
SSC permits the destination server to communicate directly to
the source server without client involvement, and removes the
limitations on server to client bandwidth and the possible con-
gestion it may cause.
Application Data Blocks (ADB)
ADB allows definition of the format of a file; for example, a VM
image or a database. This feature will allow initialization of data
stores; a single operation from the client can create a 300GB da-
tabase or a VM image on the server.
Guaranteed Space Reservation & Hole Punching
As storage demands continue to increase, various efficiency
techniques can be employed to give the appearance of a large
virtual pool of storage on a much smaller storage system. Thin
provisioning, (where space appears available and reserved, but
is not committed) is commonplace, but often problematic to
manage in fast growing environments. The guaranteed space
reservation feature in NFSv4.2 will ensure that, regardless of the
thin provisioning policies, individual files will always have space
vailable for their maximum extent.
While such guarantees are a reassurance for the end-user, they
don’t help the storage administrator in his or her desire to fully
utilize all his available storage. In support of better storage ef-
ficiencies, NFSv4.2 will introduce support for sparse files. Com-
monly called “hole punching”, deleted and unused parts of files
are returned to the storage system’s free space pool.
pNFS Clients Storage Device
Metadata Server
Storage Access Protocol
NFS 4.1Control Protocol
102
pNFS supports backwards compatibility with non-pNFS compat-
ible NFS 4 clients. In this case, the MDS itself gathers data from
the storage devices on behalf of the NFS client and presents the
data to the NFS client via NFS 4. The MDS acts as a kind of proxy
server – which is e.g. what the Director Blades from Panasas do.
pNFS Layout TypesIf storage devices act simply as NFS 4 file servers, the file layout
is used. It is the only storage access protocol directly specified
in the NFS 4.1 standard. Besides the stripe sizes and stripe lo-
cations (storage devices), it also includes the NFS file handles
which the client needs to use to access the separate file areas.
The file layout is compact and static, the striping information
does not change even if changes are made to the file enabling
multiple pNFS clients to simultaneously cache the layout and
avoid synchronisation overhead between clients and the MDS
or the MDS and storage devices.
File system authorisation and client authentication can be well
implemented with the file layout. When using NFS 4 as the stor-
age access protocol, client authentication merely depends on
the security flavor used – when using the RPCSEC_GSS securi-
ty flavor, client access is kerberized, for example and the server
Parallel NFS
Panasas HPC Storage
Parallel NFS
103
controls access authorization using specified ACLs and cryp-
tographic processes.
In contrast, the block/volume layout uses volume identifiers
and block offsets and extents to specify a file layout. SCSI
block commands are used to access storage devices. As the
block distribution can change with each write access, the
layout must be updated more frequently than with the file
layout.
Block-based access to storage devices does not offer any secure
authentication option for the accessing SCSI initiator. Secure
SAN authorisation is possible with host granularity only, based
on World Wide Names (WWNs) with Fibre Channel or Initiator
Node Names (IQNs) with iSCSI. The server cannot enforce access
control governed by the file system. On the contrary, a pNFS
client basically voluntarily abides with the access rights, the
storage device has to trust the pNFS client – a fundamental
access control problem that is a recurrent issue in the NFS pro-
tocol history.
The object layout is syntactically similar to the file layout, but it
uses the SCSI object command set for data access to so-called
Object-based Storage Devices (OSDs) and is heavily based on
the DirectFLOW protocol of the ActiveScale PanFS from Pana-
sas. From the very start, Object-Based Storage Devices were
designed for secure authentication and access. So-called capa-
bilities are used for object access which involves the MDS issu-
ing so-called capabilities to the pNFS clients. The ownership of
these capabilities represents the authoritative access right to
an object.
pNFS can be upgraded to integrate other storage access pro-
tocols and operating systems, and storage manufacturers also
have the option to ship additional layout drivers for their pNFS
implementations.
The Panasas file system uses parallel and redundant access to
object storage devices (OSDs), per-file RAID, distributed metada-
ta management, consistent client caching, file locking services,
and internal cluster management to provide a scalable, fault tol-
erant, high performance distributed file system.
The clustered design of the storage system and the use of cli-
ent-driven RAID provide scalable performance to many concur-
rent file system clients through parallel access to file data that
is striped across OSD storage nodes. RAID recovery is performed
in parallel by the cluster of metadata managers, and declus-
tered data placement yields scalable RAID rebuild rates as the
storage system grows larger.
Panasas HPC StoragePanasas ActiveStor parallel storage takes a very different ap-
proach by allowing compute clients to read and write directly
to the storage, entirely eliminating filer head bottlenecks and
allowing single file system capacity and performance to scale
linearly to extreme levels using a proprietary protocol called Di-
rectFlow. Panasas has actively shared its core knowledge with
a consortium of storage industry technology leaders to create
an industry standard protocol which will eventually replace the
need for DirectFlow. This protocol, called parallel NFS (pNFS) is
now an optional extension of the NFS v4.1 standard.
The Panasas system is a production system that provides file
service to some of the largest compute clusters in the world, in
scientific labs, in seismic data processing, in digital animation
studios, in computational fluid dynamics, in semiconductor
manufacturing, and in general purpose computing environ-
ments. In these environments, hundreds or thousands of file
system clients share data and generate very high aggregate I/O
load on the file system. The Panasas system is designed to sup-
port several thousand clients and storage capacities in excess
of a petabyte.
BOX
104
Parallel NFS
Panasas HPC Storage
Having a many years’ experience in deploying parallel file sys-
tems like Lustre or BeeGFS from smaller scales up to hundreds
of terabytes capacity and throughputs of several gigabytes
per second, transtec chose Panasas, the leader in HPC storage
solutions, as the partner for providing highest performance and
scalability on the one hand, and ease of management on the
other.
Therefore, with Panasas as the technology leader, and tran-
stec’s overall experience and customer-oriented approach,
customers can be assured to get the best possible HPC storage
solution available.
NFS/CIFS
OSDFS
SysMgr
COMPUTE NODE
MANAGER NODE
STORAGE NODE
iSCSI/OSD
PanFSClient
RPC
Client
105
The unique aspects of the Panasas system are its use of per-file,
client-driven RAID, its parallel RAID rebuild, its treatment of dif-
ferent classes of metadata (block, file, system) and a commodity
parts based blade hardware with integrated UPS. Of course, the
system has many other features (such as object storage, fault
tolerance, caching and cache consistency, and a simplified man-
agement model) that are not unique, but are necessary for a
scalable system implementation.
Panasas File System BackgroundThe storage cluster is divided into storage nodes and manager
nodes at a ratio of about 10 storage nodes to 1 manager node,
although that ratio is variable. The storage nodes implement an
object store, and are accessed directly from Panasas file system
clients during I/O operations. The manager nodes manage the
overall storage cluster, implement the distributed file system se-
mantics, handle recovery of storage node failures, and provide an
exported view of the Panasas file system via NFS and CIFS. The
following Figure gives a basic view of the system components.
Object Storage
An object is a container for data and attributes; it is analogous
to the inode inside a traditional UNIX file system implementa-
tion. Specialized storage nodes called Object Storage Devices
(OSD) store objects in a local OSDFS file system. The object in-
terface addresses objects in a two-level (partition ID/object ID)
namespace. The OSD wire protocol provides byteoriented ac-
cess to the data, attribute manipulation, creation and deletion
of objects, and several other specialized operations. Panasas
uses an iSCSI transport to carry OSD commands that are very
similar to the OSDv2 standard currently in progress within SNIA
and ANSI-T10.
The Panasas file system is layered over the object storage. Each file
is striped over two or more objects to provide redundancy and high
bandwidth access. The file system semantics are implemented by
metadata managers that mediate access to objects from clients
of the file system. The clients access the object storage using the
iSCSI/OSD protocol for Read and Write operations. The I/O op-
erations proceed directly and in parallel to the storage nodes,
bypassing the metadata managers. The clients interact with the
out-of-band metadata managers via RPC to obtain access capa-
bilities and location information for the objects that store files.
Object attributes are used to store file-level attributes, and di-
rectories are implemented with objects that store name to ob-
ject ID mappings. Thus the file system metadata is kept in the
object store itself, rather than being kept in a separate database
or some other form of storage on the metadata nodes.
Panasas System Components
106
Parallel NFS
Panasas HPC Storage
System Software Components
The major software subsystems are the OSDFS object storage
system, the Panasas file system metadata manager, the Pana-
sas file system client, the NFS/CIFS gateway, and the overall
cluster management system.
� The Panasas client is an installable kernel module that runs in-
side the Linux kernel. The kernel module implements the stand-
ard VFS interface, so that the client hosts can mount the file sys-
tem and use a POSIX interface to the storage system.
� Each storage cluster node runs a common platform that is based
on FreeBSD, with additional services to provide hardware moni-
toring, configuration management, and overall control.
� The storage nodes use a specialized local file system (OSD-
FS) that implements the object storage primitives. They
implement an iSCSI target and the OSD command set. The
OSDFS object store and iSCSI target/OSD command proces-
sor are kernel modules. OSDFS is concerned with traditional
block-level file system issues such as efficient disk arm utili-
zation, media management (i.e. error handling), high through-
put, as well as the OSD interface.
� The cluster manager (SysMgr) maintains the global configura-
tion, and it controls the other services and nodes in the storage
cluster. There is an associated management application that pro-
vides both a command line interface (CLI) and an HTML interface
(GUI). These are all user level applications that run on a subset
of the manager nodes. The cluster manager is concerned with
membership in the storage cluster, fault detection, configuration
management, and overall control for operations like software
upgrade and system restart.
� The Panasas metadata manager (PanFS) implements the file
system semantics and manages data striping across the object
storage devices. This is a user level application that runs on every
manager node. The metadata manager is concerned with distrib-
uted file system issues such as secure multi-user access, main-
taining consistent file- and object-level metadata, client cache
coherency, and recovery from client, storage node, and metadata
107
server crashes. Fault tolerance is based on a local transaction log
that is replicated to a backup on a different manager node.
� The NFS and CIFS services provide access to the file system
for hosts that cannot use our Linux installable file system
client. The NFS service is a tuned version of the standard
FreeBSD NFS server that runs inside the kernel. The CIFS ser-
vice is based on Samba and runs at user level. In turn, these
services use a local instance of the file system client, which
runs inside the FreeBSD kernel. These gateway services run
on every manager node to provide a clustered NFS and CIFS
service.
Commodity Hardware Platform
The storage cluster nodes are implemented as blades that are
very compact computer systems made from commodity parts.
The blades are clustered together to provide a scalable plat-
form. The OSD StorageBlade module and metadata manager Di-
rectorBlade module use the same form factor blade and fit into
the same chassis slots.
Storage Management
Traditional storage management tasks involve partitioning
available storage space into LUNs (i.e., logical units that are one
or more disks, or a subset of a RAID array), assigning LUN own-
ership to different hosts, configuring RAID parameters, creating
file systems or databases on LUNs, and connecting clients to the
correct server for their storage. This can be a labor-intensive sce-
nario. Panasas provides a simplified model for storage manage-
ment that shields the storage administrator from these kinds of
details and allow a single, part-time admin to manage systems
that were hundreds of terabytes in size.
The Panasas storage system presents itself as a file system with
a POSIX interface, and hides most of the complexities of storage
management. Clients have a single mount point for the entire
system. The /etc/fstab file references the cluster manager, and
from that the client learns the location of the metadata service
instances. The administrator can add storage while the system
is online, and new resources are automatically discovered. To
manage available storage, Panasas introduces two basic stor-
age concepts: a physical storage pool called a BladeSet, and a
logical quota tree called a Volume. The BladeSet is a collection
of StorageBlade modules in one or more shelves that comprise a
RAID fault domain. Panasas mitigates the risk of large fault do-
mains with the scalable rebuild performance described below
in the text. The BladeSet is a hard physical boundary for the vol-
umes it contains. A BladeSet can be grown at any time, either by
adding more StorageBlade modules, or by merging two existing
BladeSets together.
The Volume is a directory hierarchy that has a quota constraint
and is assigned to a particular BladeSet. The quota can be
changed at any time, and capacity is not allocated to the Vol-
ume until it is used, so multiple volumes compete for space
within their BladeSet and grow on demand. The files in those
volumes are distributed among all the StorageBlade modules
in the BladeSet. Volumes appear in the file system name space
as directories. Clients have a single mount point for the whole
storage system, and volumes are simply directories below the
mount point. There is no need to update client mounts when the
admin creates, deletes, or renames volumes.
Automatic Capacity Balancing
Capacity imbalance occurs when expanding a BladeSet (i.e., add-
ing new, empty storage nodes), merging two BladeSets, and re-
placing a storage node following a failure. In the latter scenario,
the imbalance is the result of the RAID rebuild, which uses spare
capacity on every storage node rather than dedicating a specif-
ic “hot spare” node. This provides better throughput during re-
build, but causes the system to have a new, empty storage node
after the failed storage node is replaced. The system automati-
cally balances used capacity across storage nodes in a BladeSet
108
Parallel NFS
Panasas HPC Storage
using two mechanisms: passive balancing and active balancing.
Passive balancing changes the probability that a storage node
will be used for a new component of a file, based on its availa-
ble capacity. This takes effect when files are created, and when
their stripe size is increased to include more storage nodes. Ac-
tive balancing is done by moving an existing component object
from one storage node to another, and updating the storage
map for the affected file. During the transfer, the file is transpar-
ently marked read-only by the storage management layer, and
the capacity balancer skips files that are being actively written.
Capacity balancing is thus transparent to file system clients.
Object RAID and ReconstructionPanasas protects against loss of a data object or an entire stor-
age node by striping files across objects stored on different
storage nodes, using a fault-tolerant striping algorithm such as
RAID-1 or RAID-5. Small files are mirrored on two objects, and
larger files are striped more widely to provide higher bandwidth
and less capacity overhead from parity information. The per-file
RAID layout means that parity information for different files is
not mixed together, and easily allows different files to use differ-
ent RAID schemes alongside each other. This property and the
security mechanisms of the OSD protocol makes it possible to
enforce access control over files even as clients access storage
nodes directly. It also enables what is perhaps the most novel
aspect of our system, client-driven RAID. That is, the clients are
responsible for computing and writing parity. The OSD security
mechanism also allows multiple metadata managers to manage
objects on the same storage device without heavyweight coor-
dination or interference from each other.
Client-driven, per-file RAID has four advantages for large-scale
storage systems. First, by having clients compute parity for
their own data, the XOR power of the system scales up as the
number of clients increases. We measured XOR processing dur-
ing streaming write bandwidth loads at 7% of the client’s CPU,
©2013 Panasas Incorporated. All rights reserved. Panasas,
the Panasas logo, Accelerating Time to Results, ActiveScale,
DirectFLOW, DirectorBlade, StorageBlade, PanFS, PanActive
and MyPanasas are trademarks or registered trademarks of
Panasas, Inc. in the United States and other countries. All oth-
er trademarks are the property of their respective owners.
Information supplied by Panasas, Inc. is believed to be accu-
rate and reliable at the time of publication, but Panasas, Inc.
assumes no responsibility for any errors that may appear in
this document. Panasas, Inc. reserves the right, without no-
tice, to make changes in product design, specifications and
prices. Information is subject to change without notice.
bCf
hJK
CDe
Gim
ADe
him
ACD
GJK
bCf
Gim
Aef
hJK
bDe
Gim
Abf
hjK
109
with the rest going to the OSD/iSCSI/TCP/IP stack and other file
system overhead. Moving XOR computation out of the storage
system into the client requires some additional work to handle
failures. Clients are responsible for generating good data and
good parity for it. Because the RAID equation is per-file, an er-
rant client can only damage its own data. However, if a client
fails during a write, the metadata manager will scrub parity to
ensure the parity equation is correct.
The second advantage of client-driven RAID is that clients can
perform an end-to-end data integrity check. Data has to go
through the disk subsystem, through the network interface on
the storage nodes, through the network and routers, through
the NIC on the client, and all of these transits can introduce er-
rors with a very low probability. Clients can choose to read pari-
ty as well as data, and verify parity as part of a read operation. If
errors are detected, the operation is retried. If the error is persis-
tent, an alert is raised and the read operation fails. By checking
parity across storage nodes within the client, the system can
ensure end-to-end data integrity. This is another novel property
of per-file, client-driven RAID.
Third, per-file RAID protection lets the metadata managers re-
build files in parallel. Although parallel rebuild is theoretically
possible in block-based RAID, it is rarely implemented. This is due
to the fact that the disks are owned by a single RAID controller,
even in dual-ported configurations. Large storage systems have
multiple RAID controllers that are not interconnected. Since the
SCSI Block command set does not provide fine-grained synchro-
nization operations, it is difficult for multiple RAID controllers
to coordinate a complicated operation such as an online rebuild
without external communication. Even if they could, without
connectivity to the disks in the affected parity group, other RAID
controllers would be unable to assist. Even in a high-availability
configuration, each disk is typically only attached to two differ-
ent RAID controllers, which limits the potential speedup to 2x.
When a StorageBlade module fails, the metadata managers that
own Volumes within that BladeSet determine what files are af-
fected, and then they farm out file reconstruction work to every
other metadata manager in the system. Metadata managers re-
build their own files first, but if they finish early or do not own
any Volumes in the affected Bladeset, they are free to aid other
metadata managers. Declustered parity groups spread out the
I/O workload among all StorageBlade modules in the BladeSet.
The result is that larger storage clusters reconstruct lost data
more quickly.
The fourth advantage of per-file RAID is that unrecoverable
faults can be constrained to individual files. The most com-
monly encountered double-failure scenario with RAID-5 is an
unrecoverable read error (i.e., grown media defect) during the
reconstruction of a failed storage device. The second storage
device is still healthy, but it has been unable to read a sector,
which prevents rebuild of the sector lost from the first drive and
potentially the entire stripe or LUN, depending on the design of
the RAID controller. With block-based RAID, it is difficult or im-
possible to directly map any lost sectors back to higher-level file
system data structures, so a full file system check and media
scan will be required to locate and repair the damage. A more
typical response is to fail the rebuild entirely. RAID control-
lers monitor drives in an effort to scrub out media defects and
avoid this bad scenario, and the Panasas system does media
Declustered parity groups
110
Parallel NFS
Panasas HPC Storage
scrubbing, too. However, with high capacity SATA drives, the
chance of encountering a media defect on drive B while rebuild-
ing drive A is still significant. With per-file RAID-5, this sort of
double failure means that only a single file is lost, and the spe-
cific file can be easily identified and reported to the administra-
tor. While block-based RAID systems have been compelled to
introduce RAID-6 (i.e., fault tolerant schemes that handle two
failures), the Panasas solution is able to deploy highly reliable
RAID-5 systems with large, high performance storage pools.
RAID Rebuild Performance
RAID rebuild performance determines how quickly the system
can recover data when a storage node is lost. Short rebuild
times reduce the window in which a second failure can cause
data loss. There are three techniques to reduce rebuild times: re-
ducing the size of the RAID parity group, declustering the place-
ment of parity group elements, and rebuilding files in parallel
using multiple RAID engines.
The rebuild bandwidth is the rate at which reconstructed data
is written to the system when a storage node is being recon-
structed. The system must read N times as much as it writes,
depending on the width of the RAID parity group, so the overall
throughput of the storage system is several times higher than
the rebuild rate. A narrower RAID parity group requires fewer
read and XOR operations to rebuild, so will result in a higher re-
build bandwidth.
However, it also results in higher capacity overhead for parity
data, and can limit bandwidth during normal I/O. Thus, selection
of the RAID parity group size is a trade-off between capacity
overhead, on-line performance, and rebuild performance. Un-
derstanding declustering is easier with a picture. In the figure
on the left, each parity group has 4 elements, which are indicat-
ed by letters placed in each storage device. They are distributed
among 8 storage devices. The ratio between the parity group
size and the available storage devices is the declustering ratio,
111
which in this example is ½. In the picture, capital letters repre-
sent those parity groups that all share the second storage node.
If the second storage device were to fail, the system would have
to read the surviving members of its parity groups to rebuild
the lost elements. You can see that the other elements of those
parity groups occupy about ½ of each other storage device. For
this simple example you can assume each parity element is the
same size so all the devices are filled equally. In a real system,
the component objects will have various sizes depending on the
overall file size, although each member of a parity group will be
very close in size. There will be thousands or millions of objects
on each device, and the Panasas system uses active balancing
to move component objects between storage nodes to level ca-
pacity. Declustering means that rebuild requires reading a sub-
set of each device, with the proportion being approximately the
same as the declustering ratio. The total amount of data read is
the same with and without declustering, but with declustering
it is spread out over more devices. When writing the reconstruct-
ed elements, two elements of the same parity group cannot be
located on the same storage node. Declustering leaves many
storage devices available for the reconstructed parity element,
and randomizing the placement of each file’s parity group lets
the system spread out the write I/O over all the storage. Thus
declustering RAID parity groups has the important property of
taking a fixed amount of rebuild I/O and spreading it out over
more storage devices.
Having per-file RAID allows the Panasas system to divide the
work among the available DirectorBlade modules by assigning
different files to different DirectorBlade modules. This division
is dynamic with a simple master/worker model in which meta-
data services make themselves available as workers, and each
metadata service acts as the master for the volumes it imple-
ments. By doing rebuilds in parallel on all DirectorBlade mod-
ules, the system can apply more XOR throughput and utilize the
additional I/O bandwidth obtained with declustering.
Metadata ManagementThere are several kinds of metadata in the Panasas system.
These include the mapping from object IDs to sets of block ad-
dresses, mapping files to sets of objects, file system attributes
such as ACLs and owners, file system namespace information
(i.e., directories), and configuration/management information
about the storage cluster itself.
Block-level Metadata
Block-level metadata is managed internally by OSDFS, the file
system that is optimized to store objects. OSDFS uses a floating
block allocation scheme where data, block pointers, and object
descriptors are batched into large write operations. The write
buffer is protected by the integrated UPS, and it is flushed to
disk on power failure or system panics. Fragmentation was an
issue in early versions of OSDFS that used a first-fit block allo-
cator, but this has been significantly mitigated in later versions
that use a modified best-fit allocator.
OSDFS stores higher level file system data structures, such as
the partition and object tables, in a modified BTree data struc-
ture. Block mapping for each object uses a traditional direct/
indirect/double-indirect scheme. Free blocks are tracked by
a proprietary bitmap-like data structure that is optimized for
copy-on-write reference counting, part of OSDFS’s integrated
support for object- and partition-level copy-on-write snapshots.
Block-level metadata management consumes most of the cycles
in file system implementations. By delegating storage manage-
ment to OSDFS, the Panasas metadata managers have an or-
der of magnitude less work to do than the equivalent SAN file
system metadata manager that must track all the blocks in the
system.
File-level Metadata
Above the block layer is the metadata about files. This includes
user-visible information such as the owner, size, and modification
112
Parallel NFS
Panasas HPC Storage
time, as well as internal information that identifies which ob-
jects store the file and how the data is striped across those
objects (i.e., the file’s storage map). Our system stores this file
metadata in object attributes on two of the N objects used to
store the file’s data. The rest of the objects have basic attrib-
utes like their individual length and modify times, but the high-
er-level file system attributes are only stored on the two attrib-
ute-storing components.
File names are implemented in directories similar to traditional
UNIX file systems. Directories are special files that store an array
of directory entries. A directory entry identifies a file with a tuple
of <serviceID, partitionID, objectID>, and also includes two <os-
dID> fields that are hints about the location of the attribute stor-
ing components. The partitionID/objectID is the two-level ob-
ject numbering scheme of the OSD interface, and Panasas uses
a partition for each volume. Directories are mirrored (RAID-1)
in two objects so that the small write operations associated
with directory updates are efficient.
Clients are allowed to read, cache and parse directories, or they
can use a Lookup RPC to the metadata manager to translate
a name to an <serviceID, partitionID, objectID> tuple and the
<osdID> location hints. The serviceID provides a hint about the
metadata manager for the file, although clients may be redirect-
ed to the metadata manager that currently controls the file.
The osdID hint can become out-of-date if reconstruction or
active balancing moves an object. If both osdID hints fail, the
metadata manager has to multicast a GetAttributes to the stor-
age nodes in the BladeSet to locate an object. The partitionID
and objectID are the same on every storage node that stores a
component of the file, so this technique will always work. Once
the file is located, the metadata manager automatically updates
the stored hints in the directory, allowing future accesses to by-
pass this step.
OSDs
1. CREATE
7. WRITE3. CREATE
MetadataServer
6. REPLY
Client
Txn_log
2
8
4
5
oplog
caplog
Replycache
113
File operations may require several object operations. The figure
on the left shows the steps used in creating a file. The metadata
manager keeps a local journal to record in-progress actions so it
can recover from object failures and metadata manager crashes
that occur when updating multiple objects. For example, creat-
ing a file is fairly complex task that requires updating the parent
directory as well as creating the new file. There are 2 Create OSD
operations to create the first two components of the file, and 2
Write OSD operations, one to each replica of the parent direc-
tory. As a performance optimization, the metadata server also
grants the client read and write access to the file and returns the
appropriate capabilities to the client as part of the FileCreate
results. The server makes record of these write capabilities to
support error recovery if the client crashes while writing the file.
Note that the directory update (step 7) occurs after the reply,
so that many directory updates can be batched together. The
deferred update is protected by the op-log record that gets de-
leted in step 8 after the successful directory update.
The metadata manager maintains an op-log that records the
object create and the directory updates that are in progress.
This log entry is removed when the operation is complete. If the
metadata service crashes and restarts, or a failure event moves
the metadata service to a different manager node, then the op-
log is processed to determine what operations were active at
the time of the failure. The metadata manager rolls the opera-
tions forward or backward to ensure the object store is consist-
ent. If no reply to the operation has been generated, then the op-
eration is rolled back. If a reply has been generated but pending
operations are outstanding (e.g., directory updates), then the
operation is rolled forward.
The write capability is stored in a cap-log so that when a meta-
data server starts it knows which of its files are busy. In addition
to the “piggybacked” write capability returned by FileCreate,
the client can also execute a StartWrite RPC to obtain a separate
write capability. The cap-log entry is removed when the client
releases the write cap via an EndWrite RPC. If the client reports
an error during its I/O, then a repair log entry is made and the file
is scheduled for repair. Read and write capabilities are cached
by the client over multiple system calls, further reducing meta-
data server traffic.
System-level Metadata
The final layer of metadata is information about the overall sys-
tem itself. One possibility would be to store this information in
objects and bootstrap the system through a discovery protocol.
The most difficult aspect of that approach is reasoning about
the fault model. The system must be able to come up and be
manageable while it is only partially functional. Panasas chose
instead a model with a small replicated set of system managers,
each that stores a replica of the system configuration metadata.
Creating a file
114
Parallel NFS
Panasas HPC Storage
Each system manager maintains a local database, outside of
the object storage system. Berkeley DB is used to store tables
that represent our system model. The different system manager
instances are members of a replication set that use Lamport’s
part-time parliament (PTP) protocol to make decisions and up-
date the configuration information. Clusters are configured with
one, three, or five system managers so that the voting quorum
has an odd number and a network partition will cause a minori-
ty of system managers to disable themselves.
System configuration state includes both static state, such
as the identity of the blades in the system, as well as dynam-
ic state such as the online/offline state of various services and
error conditions associated with different system components.
Each state update decision, whether it is updating the admin
password or activating a service, involves a voting round and an
update round according to the PTP protocol. Database updates
are performed within the PTP transactions to keep the data-
bases synchronized. Finally, the system keeps backup copies of
the system configuration databases on several other blades to
guard against catastrophic loss of every system manager blade.
Blade configuration is pulled from the system managers as part
of each blade’s startup sequence. The initial DHCP handshake
conveys the addresses of the system managers, and thereafter
the local OS on each blade pulls configuration information from
the system managers via RPC.
“Outstanding in the HPC world, the ActiveStor solutions provided by
Panasas are undoubtedly the only HPC storage solutions that com-
bine highest scalability and performance with a convincing ease of
management.“
Christoph Budziszewski HPC Solution Architect
115
The cluster manager implementation has two layers. The lower
level PTP layer manages the voting rounds and ensures that par-
titioned or newly added system managers will be brought up-to-
date with the quorum. The application layer above that uses the
voting and update interface to make decisions. Complex system
operations may involve several steps, and the system manager
has to keep track of its progress so it can tolerate a crash and
roll back or roll forward as appropriate.
For example, creating a volume (i.e., a quota-tree) involves file
system operations to create a top-level directory, object oper-
ations to create an object partition within OSDFS on each Stor-
ageBlade module, service operations to activate the appropri-
ate metadata manager, and configuration database operations
to reflect the addition of the volume.
Recovery is enabled by having two PTP transactions. The initial
PTP transaction determines if the volume should be created, and
it creates a record about the volume that is marked as incom-
plete. Then the system manager does all the necessary service
activations, file and storage operations. When these all com-
plete, a final PTP transaction is performed to commit the oper-
ation. If the system manager crashes before the final PTP trans-
action, it will detect the incomplete operation the next time
it restarts, and then roll the operation forward or backward.
BeeGFS – A Parallel File SystemBeeGFS is a highly flexible network file system, especially designed for scalability and
performance required by HPC systems.
It allows clients to communicate with storage servers via network (TCP/IP – or InfiniBand
with native VERBS support). Furthermore, BeeGFS is a parallel file system – by adding
more servers, the capacity and performance of them is aggregated in a single name-
space. That way the filesystem performance and capacity can be scaled to the level
which is required for HPC applications.
Life
Sci
en
ces
Ris
k A
na
lysi
s S
imu
lati
on
B
ig D
ata
An
aly
tics
H
igh
Pe
rfo
rma
nce
Co
mp
uti
ng
118
BeeGFS combines multiple storage servers to provide a highly
scalable shared network file system with striped file contents.
This way, it allows users to overcome the tight performance lim-
itations of single servers, single network interconnects, a limit-
ed number of hard drives etc. In such a system, high throughput
demands of large numbers of clients can easily be satisfied, but
even a single client can profit from the aggregated performance
of all the storage servers in the system.
Key Aspects:
� Maximum Flexibility
� BeeGFS supports a wide range of Linux distributions as well
as Linux kernels
� Storage servers run on top of an existing local file system
using the POSIX interface
� Clients and servers can be added to an existing system with-
out downtime
� BeeGFS supports multiple networks and dynamic failover
between different networks.
� BeeGFS provides support for metadata and file contents
mirroring
This is made possible by a separation of metadata and file con-
tents. While storage servers are responsible for storing stripes of
the actual contents of user files, metadata servers do the coordi-
nation of file placement and striping among the storage servers
and inform the clients about certain file details when necessary.
When accessing file contents, BeeGFS clients directly contact the
storage servers to perform file I/O and communicate with multi-
ple servers simultaneously, giving your applications truly paral-
lel access to the file data. To keep the metadata access latency
(e.g. directory lookups) at a minimum, BeeGFS allows you to also
distribute the metadata across multiple servers, so that each of
the metadata servers stores a part of the global file system name-
space.
BeeGFS
A Parallel File System
119
The following picture shows the system architecture and roles
within an BeeGFS instance. Note that although all roles in the pic-
ture are running on different hosts, it is also possible to run any
combination of client and servers on the same machine.
A very important differentiator for BeeGFS is the ease-of-use. The
philosophy behind BeeGFS is to make the hurdles as low as pos-
sible and allow the use of a parallel file system to as many people
as possible for their work. BeeGFS was designed to work with any
local file system (such as ext4 or xfs) for data storage – as long as
it is POSIX compliant. That way system administrators can chose
the local file system that they prefer and are experienced with
and therefore do not need to learn the use of new tools. This is
making it much easier to adopt the technology of parallel stor-
age – especially for sites with limited resources and manpower.
BeeGFS needs four main components to work
� the ManagementServer
� the ObjectStorageServer
� the MetaDataServer
� the file system client
There are two supporting daemons
� The file system client needs a “helper-daemon” to run on the
client.
� The Admon daemon might run in the storage cluster and give
system administrators a better view about what is going on.
It is not a necessary component and a BeeGFS system is fully
operable without it.
ManagementServer (MS)The MS is the component of the file system making sure that all
other processes can find each other. It is the first daemon that
has to be set up – and all configuration files of a BeeGFS installa-
tion have to point to the same MS. There is always exactly one MS
in a system. The MS maintains a list of all file system components
– this includes clients, MetaDataServers, MetaDataTargets, Stor-
ageServers and StorageTargets.
MetaDataServer (MDS)The MDS contains the information about MetaData in the
system. BeeGFS implements a fully scalable architecture for
BeeGFS Architecture
120
MetaData and allows a practically unlimited number of MDS.
Each MDS has exactly one MetaDataTarget (MDT). Each directory
in the global file system is attached to exactly one MDS handling
its content. Since the assignment of directories to MDS is random
BeeGFS can make efficient use of a (very) large number of MDS.
As long as the number of directories is significantly larger than
the number of MDS (which is normally the case), there will be a
roughly equal share of load on each of the MDS.
The MDS is a fully multithreaded daemon – the number of
threads to start has to be set to appropriate values for the used
hardware. Factors to consider here are number of CPU cores in
the MDS, number of Client, workload, number and type of disks
forming the MDT. So in general serving MetaData needs some
CPU power – and especially, if one wants to make use of fast un-
derlying storage like SSDs, the CPUs should not be too light to
avoid becoming the bottleneck.
ObjectStorageServer (OSS)The OSS is the main service to store the file contents. Each OSS
might have one or many ObjectStorageTargets (OST) – where
an OST is a RAID-Set (or LUN) with a local file system (such as xfs
or ext4) on top. A typical OST is between 6 and 12 hard drives in
RAID6 – so an OSS with 36 drives might be organized with 3 OSTs
each being a RAID6 with 12 drives.
The OSS is a user space daemon that is started on the appro-
priate machine. It will work with any local file system that
is POSIX compliant and supported by the Linux distribution.
The underlying file system may be picked according to the work-
load – or personal preference and experience.
BeeGFS
A Parallel File System
S
N
EW
SESW
NENW
tech BOX
121
Like the MDS, the OSS is a fully multithreaded daemon as well –
and so as with the MDS, the choice of the number of threads has
to be made with the underlying hardware in mind. The most im-
portant factor for the number of OSS-threads is the performance
and number of the OSTs that OSS is serving. In contradiction to
the MDS, the traffic on the OSTs should be (and normally is) to a
large extent sequential.
File System ClientBeeGFS comes with a native client that runs in Linux – it is a ker-
nel module that has to be compiled to match the used kernel. The
client is an open-source product made available under GPL. The
client has to be installed on all hosts that should access BeeGFS
with maximum performance.
The client needs two services to start:
fhgfs-helperd: This daemon provides certain auxiliary function-
alities for the fhgfs-client. The helperd does not require any
additional configuration – it is just accessed by the fhgfs-client
running on the same host. The helperd is mainly doing DNS for
the client kernel module and provides the capability to write the
logfile.
fhgfs-client: This service loads the client kernel module – if nec-
essary it will (re-)compile the kernel module. The recompile is
done using an automatic build process that is started when (for
example) the kernel version changes. This helps to make the file
system easy to use since, after applying a (minor or major) kernel
update, the BeeGFS client will still start up after a reboot, re-com-
pile and the file system will be available.
Striping One of the key functionalities of BeeGFS is called striping. In
a BeeGFS file system each directory has (amongst others) two
very important properties defining how files in these directo-
ries are handled.
Numtargets defines the number of targets a file is created
across. So if “4” is selected each file gets four OSTs assigned to
where the data of that file is stored to.
The chunksize on the other hand specifies, how much data is
stored on one assigned target before the client moves over to
the next target.
The goal of this striping of files is to improve the performance
for the single file as well as the available capacity. Assuming
that a target is 30 TB in size and can move data with 500 MB/s,
then a file across four targets can grow until 120 TB and be
accessed with 2 GB/s. Striping will allow you to store more
and store it faster!
122
Maximum Scalability � BeeGFS supports distributed file contents with flexible
striping across the storage servers as well as distributed
metadata
� Best in class client throughput of 3.5 GB/s write and 4.0 GB/s
read with a single I/O stream on FDR Infiniband
� Best in class metadata performance with linear scalability
through dynamic metadata namespace partitioning
� Best in class storage throughput with flexible choice of
underlying file system to perfectly fit the storage hardware.
Maximum UsabilityBeeGFS was designed with easy installation and admin-
istration in mind. BeeGFS requires no kernel patches (the
client is a patchless kernel module, the server components
are userspace daemons), comes with graphical cluster in-
stallation tools and allows you to add more clients and serv-
ers to the running system whenever you want it. The graph-
ical administration and monitoring system enables user to
handle typical management tasks in a intuitive way: cluster
installation, storage service management, live throughput and
health monitoring, file browsing, striping configuration and
more.
Features:
� Data and metadata for each file are separated into stripes and
distributed across the existing servers to aggregate server
speed
� Secure authentication process between client and server to
protect sensitive data
� Fair I/O option on user level to prevent a single user with mul-
tiple requests to stall other users request
� Automatic network failover, i.e. if Infiniband is down, BeeGFS
automatically uses Ethernet if available
� Optional server failover for external shared memory
resources - high availability
BeeGFS
A Parallel File System
„transtec has successfully deployed the first petabyte filesystem
with BeeGFS. And it is very impressive to see how BeeGFS combines
performance with reliability.”
Thomas Gebert HPC Solution Architect
BOX
123
transtec has an outstanding history of parallel filesystem im-
plementations since more than 12 years. Starting with some of
the first Lustre implementations, some of them extending to
nearly 1 Petabyte gross capacity, which constituted an enor-
mous size back 10 years ago, over to the deployment of many
enterprise-ready commercial solutions, we have now success-
fully deployed dozens of BeeGFS environment of various sizes.
In especial, transtec has been the first provider of a 1-peta-
byte-filesystem based on BeeGFS, then known as FraunhoferFS,
or FhGFS.
No matter what the customer’s requirements are: transtec has
knowledge in all parallel filesystem technologies and is able to
provide the customer with an individually sized and tailor-made
solution that meets their demands concerning capacity and per-
formance, and above all – because this is what customers take
as a matter of course – is stable and reliable at the same time.
124
� Runs on non-standard architectures, e.g. PowerPC or Intel
Xeon Phi
� Provides a coherent mode, in which it is guaranteed that
changes to a file or directory by one client are always immedi-
ately visible to other clients.
Performance:
BeeGFS was designed for extreme scalability. In a testbed with
20 servers and up to 640 client processes (32x the number of
metadata servers), BeeGFS delivers a sustained file creation rate
of more than 500,000 creates per second, making it possible to
create one billion files in as little time as about 30 minutes. In the
same testbed system with 20 servers - each equipped with a sin-
gle node local performance of 1332 MB/s (write) and 1317 MB/s
(read) – and 160 client processes, BeeGFS reached a sustained
throughput of 25 GB/s – which is 94.7 percent of the maximum
theoretical local write and 94.1 percent of the maximum theoret-
ical local read throughput.
BeeGFS
A Parallel File System
S
N
EW
SESW
NENW
tech BOX
125
MirroringBeeGFS provides support for metadata and file contents mirror-
ing. Mirroring capabilities are integrated into the normal BeeG-
FS services, so that no separate services or third-party tools are
needed. Both types of mirroring (metadata and file contents) can
be used independent of each other. The initial mirroring imple-
mentation enables data restore after a storage target or meta-
data server is lost. It does not provide failover or high availability
capabilites yet, so the file system will not automatically remain
accessible while a server is down. Those features will be added in
later FhGFS releases.
Mirroring can be enabled on a per-directory basis, so that some
data in the file system can be mirrored while other data might
not be mirrored. Primary servers/targets and mirror servers/tar-
gets are chosen individually for each new directory/file. Thus,
there are no passive (i.e. mirror-only) servers/targets in a file sys-
tem instance.
The granularity of metadata mirroring is a complete directory,
so that the entries of a single directory are always mirrored on
the same mirror server, but metadata for different directories is
mirrored on different metadata servers. The granularity of file
contents mirroring are individual files. Each file has its own set
of primary and mirror targets, even for files are in the same di-
rectory. Both types of mirroring can be used with an even or odd
number of servers or storage targets.
Metadata is mirrored asynchronously, i.e. the client operation
completes before the redundant copy has been transferred to
the mirror server. File contents are mirrored synchronously, i.e
the client operation completes after both copies of the data were
transferred to the servers. File contents mirroring is based on a
hybrid client- and server-side approach, where the client decides
whether it sends the mirror copy directly to the other server or
whether the storage server is responsible for forwarding of the
data to the mirror.
BeeGFS Dynamic FailoverBuilt on scalable multithreaded core components with native
Infiniband support, file system nodes can serve Infiniband and
Ethernet (or any other TCP-enabled network) connections at the
same time and automatically switch to a redundant connection
path in case any of them fails.
NVIDIA GPU Computing –The CUDA ArchitectureThe CUDA parallel computing platform is now widely deployed with 1000s of GPU-accel-
erated applications and 1000s of published research papers, and a complete range of
CUDA tools and ecosystem solutions is available to developers.
Hig
h T
rou
hp
ut
Co
mp
uti
ng
C
AD
B
ig D
ata
An
aly
tics
S
imu
lati
on
A
ero
spa
ce
Au
tom
oti
ve
128
What is GPU Accelerated Computing?GPU-accelerated computing is the use of a graphics processing
unit (GPU) together with a CPU to accelerate scientific, analytics,
engineering, consumer, and enterprise applications. Pioneered
in 2007 by NVIDIA, GPU accelerators now power energy-efficient
datacenters in government labs, universities, enterprises, and
small-and-medium businesses around the world. GPUs are accel-
erating applications in platforms ranging from cars, to mobile
phones and tablets, to drones and robots.
How GPUS accelerated applications GPU-accelerated computing offers unprecedented application
performance by offloading compute-intensive portions of the
application to the GPU, while the remainder of the code still runs
on the CPU. From a user’s perspective, applications simply run
significantly faster.
CPU versus GPUA simple way to understand the difference between a CPU and
GPU is to compare how they process tasks. A CPU consists of a
few cores optimized for sequential serial processing while a GPU
has a massively parallel architecture consisting of thousands
of smaller, more efficient cores designed for handling multiple
tasks simultaneously.
NVIDIA GPU Computing
What is GPU Computing?
“We are very proud to be one of the leading providers of Tesla sys-
tems who are able to combine the overwhelming power of NVIDIA
Tesla systems with the fully engineered and thoroughly tested
transtec hardware to a total Tesla-based solution.”
Bernd ZellSenior HPC System Engineer
CPU multiple cores GPU thousands of cores
129
GPUs have thousands of cores to process parallel workloads
efficiently
Kepler GK110/210 GPU Computing Architecture As the demand for high performance parallel computing increas-
es across many areas of science, medicine, engineering, and fi-
nance, NVIDIA continues to innovate and meet that demand with
extraordinarily powerful GPU computing architectures.
NVIDIA’s GPUs have already redefined and accelerated High Per-
formance Computing (HPC) capabilities in areas such as seismic
processing, biochemistry simulations, weather and climate mod-
eling, signal processing, computational finance, computer aided
engineering, computational fluid dynamics, and data analysis.
NVIDIA’s Kepler GK110/210 GPUs are designed to help solve the
world’s most difficult computing problems.
SMX Processing Core ArchitectureEach of the Kepler GK110/210 SMX units feature 192 single-preci-
sion CUDA cores, and each core has fully pipelined floating-point
and integer arithmetic logic units. Kepler retains the full IEEE
754-2008 compliant single and double-precision arithmetic in-
troduced in Fermi, including the fused multiply-add (FMA) oper-
ation. One of the design goals for the Kepler GK110/210 SMX was
to significantly increase the GPU’s delivered double precision
performance, since double precision arithmetic is at the heart of
many HPC applications. Kepler GK110/210’s SMX also retains the
special function units (SFUs) for fast approximate transcenden-
tal operations as in previous-generation GPUs, providing 8x the
number of SFUs of the Fermi GF110 SM.
Similar to GK104 SMX units, the cores within the new GK110/210
SMX units use the primary GPU clock rather than the 2x shader
clock. Recall the 2x shader clock was introduced in the G80 Tes-
la architecture GPU and used in all subsequent Tesla and Fermi
architecture GPUs.
Running execution units at a higher clock rate allows a chip
to achieve a given target throughput with fewer copies of the
execution units, which is essentially an area of optimization,
but the clocking logic for the faster cores is more power-hungry.
For Kepler, the priority was performance per watt. While NVIDIA
made many optimizations that benefitted both area and power,
they chose to optimize for power even at the expense of some
added area cost, with a larger number of processing cores run-
ning at the lower, less power-hungry GPU clock.
Quad Warp Scheduler The SMX schedules threads in groups of 32 parallel threads
called warps. Each SMX features four warp schedulers and eight
instruction dispatch units, allowing four warps to be issued and
executed concurrently.
Kepler’s quad warp scheduler selects four warps, and two inde-
pendent instructions per warp can be dispatched each cycle. Un-
like Fermi, which did not permit double precision instructions to
be paired with other instructions, Kepler GK110/210 allows dou-
ble precision instructions to be paired with other instructions.
130
New ISA Encoding: 255 Registers per ThreadThe number of registers that can be accessed by a thread has
been quadrupled in GK110, allowing each thread access to up to
255 registers. Codes that exhibit high register pressure or spill-
ing behavior in Fermi may see substantial speedups as a result
of the increased available per - thread register count. A compel-
ling example can be seen in the QUDA library for performing lat-
tice QCD (quantum chromodynamics) calculations using CUDA.
QUDA fp64-based algorithms see performance increases up to
5.3x due to the ability to use many more registers per thread and
experiencing fewer spills to local memory.
GK210 further improves this, doubling the overall register file ca-
pacity per SMX as compared to GK110. In doing so, it allows appli-
cations to more readily make use of higher numbers of registers
per thread without sacrificing the number of threads that can
fit concurrently per SMX. For example, a CUDA kernel using 128
registers thread on GK110 is limited to 512 out of a possible 2048
concurrent threads per SMX, limiting the available parallelism.
NVIDIA GPU Computing
Kepler GK110/210
GPU Computing Architecture
Each Kepler SMX contains 4 Warp Schedulers, each with dual Instruction Dispatch Units. A single Warp Scheduler Unit is shown above.
131
GK210 doubles the concurrency automatically in this case, which
can help to cover arithmetic and memory latencies, improving
overall efficiency.
Shuffle InstructionTo further improve performance, Kepler implements a new Shuf-
fle instruction, which allows threads within a warp to share data.
Previously, sharing data between threads within a warp required
separate store and load operations to pass the data through
shared memory. With the Shuffle instruction, threads within
a warp can read values from other threads in the warp in just
about any imaginable permutation.
Shuffle supports arbitrary indexed references – i.e. any thread
reads from any other thread. Useful shuffle subsets including
next-thread (offset up or down by a fixed amount) and XOR “but-
terfly” style permutations among the threads in a warp, are also
available as CUDA intrinsics. Shuffle offers a performance advan-
tage over shared memory, in that a store-and-load operation is
carried out in a single step. Shuffle also can reduce the amount of
shared memory needed per thread block, since data exchanged
at the warp level never needs to be placed in shared memory. In
the case of FFT, which requires data sharing within a warp, a 6%
performance gain can be seen just by using Shuffle.
Atomic OperationsAtomic memory operations are important in parallel pro-
gramming, allowing concurrent threads to correctly perform
read-modify-write operations on shared data structures. Atomic
operations such as add, min, max, and compare-and-swap are
atomic in the sense that the read, modify, and write operations
are performed without interruption by other threads.
Atomic memory operations are widely used for parallel sorting,
reduction operations, and building data structures in parallel
without locks that serialize thread execution. Throughput of
global memory atomic operations on Kepler GK110/210 are sub-
stantially improved compared to the Fermi generation. Atomic
operation throughput to a common global memory address is
improved by 9x to one operation per clock.
Atomic operation throughput to independent global address-
es is also significantly accelerated, and logic to handle address
conflicts has been made more efficient. Atomic operations can
often be processed at rates similar to global load operations.
This speed increase makes atomics fast enough to use frequent-
ly within kernel inner loops, eliminating the separate reduction
passes that were previously required by some algorithms to con-
solidate results.
This example shows some of the variations possible using the new Shuffle instruction in Kepler.
132
NVIDIA GPU Computing
Kepler GK110/210
GPU Computing Architecture
Kepler GK110 also expands the native support for 64-bit atomic
operations in global memory. In addition to atomicAdd, atomic-
CAS, and atomicExch (which were also supported by Fermi and
Kepler GK104), GK110 supports the following:
� atomicMin
� atomicMax
� atomicAnd
� atomicOr
� atomicXor
Other atomic operations which are not supported natively (for
example 64-bit floating point atomics) may be emulated using
the compare-and-swap (CAS) instruction.
Texture ImprovementsThe GPU’s dedicated hardware Texture units are a valuable re-
source for compute programs with a need to sample or filter
image data. The texture throughput in Kepler is significantly in-
creased compared to Fermi – each SMX unit contains 16 texture
filtering units, a 4x increase vs the Fermi GF110 SM.
In addition, Kepler changes the way texture state is managed. In
the Fermi generation, for the GPU to reference a texture, it had
to be assigned a “slot” in a fixed-size binding table prior to grid
launch.
The number of slots in that table ultimately limits how many
unique textures a program can read from at run time. Ultimately,
a program was limited to accessing only 128 simultaneous tex-
tures in Fermi.
With bindless textures in Kepler, the additional step of using
slots isn’t necessary: texture state is now saved as an object in
memory and the hardware fetches these state objects on de-
mand, making binding tables obsolete.
133
This effectively eliminates any limits on the number of unique
textures that can be referenced by a compute program. Instead,
programs can map textures at any time and pass texture handles
around as they would any other pointer.
Dynamic ParallelismIn a hybrid CPU-GPU system, enabling a larger amount of paral-
lel code in an application to run efficiently and entirely within
the GPU improves scalability and performance as GPUs increase
in perf/watt. To accelerate these additional parallel portions of
the application, GPUs must support more varied types of parallel
workloads. Dynamic Parallelism is introduced with Kepler GK110
and also included in GK210. It allows the GPU to generate new
work for itself, synchronize on results, and control the schedul-
ing of that work via dedicated, accelerated hardware paths, all
without involving the CPU.
Fermi was very good at processing large parallel data structures
when the scale and parameters of the problem were known at
kernel launch time. All work was launched from the host CPU,
would run to completion, and return a result back to the CPU.
The result would then be used as part of the final solution, or
would be analyzed by the CPU which would then send additional
requests back to the GPU for additional processing.
In Kepler GK110/210 any kernel can launch another kernel, and
can create the necessary streams, events and manage the de-
pendencies needed to process additional work without the need
for host CPU interaction. This architectural innovation makes it
easier for developers to create and optimize recursive and data -
dependent execution patterns, and allows more of a program to
be run directly on GPU. The system CPU can then be freed up for
additional tasks, or the system could be configured with a less
powerful CPU to carry out the same workload.
Dynamic Parallelism allows more varieties of parallel algorithms
to be implemented on the GPU, including nested loops with
differing amounts of parallelism, parallel teams of serial con-
trol-task threads, or simple serial control code offloaded to the
GPU in order to promote data-locality with the parallel portion
of the application. Because a kernel has the ability to launch ad-
ditional workloads based on intermediate, on-GPU results, pro-
grammers can now intelligently load-balance work to focus the
bulk of their resources on the areas of the problem that either
require the most processing power or are most relevant to the
solution. One example would be dynamically setting up a grid
for a numerical simulation – typically grid cells are focused in re-
gions of greatest change, requiring an expensive pre-processing
pass through the data.
Alternatively, a uniformly coarse grid could be used to prevent
wasted GPU resources, or a uniformly fine grid could be used to
ensure all the features are captured, but these options risk miss-
ing simulation features or “over-spending” compute resources
on regions of less interest. With Dynamic Parallelism, the grid
resolution can be determined dynamically at runtime in a da-
ta-dependent manner. Starting with a coarse grid, the simulation
can “zoom in” on areas of interest while avoiding unnecessary
calculation in areas with little change. Though this could be
accomplished using a sequence of CPU-launched kernels, it
would be far simpler to allow the GPU to refine the grid itself by
Dynamic Parallelism allows more parallel code in an application to be launched directly by the GPU onto itself (right side of image) rather than requiring CPU intervention (left side of image).
134
analyzing the data and launching additional work as part of a
single simulation kernel, eliminating interruption of the CPU and
data transfers between the CPU and GPU.
Hyper - QOne of the challenges in the past has been keeping the GPU
supplied with an optimally scheduled load of work from multi-
ple streams. The Fermi architecture supported 16 - way concur-
rency of kernel launches from separate streams, but ultimately
the streams were all multiplexed into the same hardware work
queue. This allowed for false intra-stream dependencies, requir-
ing dependent kernels within one stream to complete before ad-
ditional kernels in a separate stream could be executed. While
this could be alleviated to some extent through the use of a
breadth-first launch order, as program complexity increases, this
can become more and more difficult to manage efficiently.
Kepler GK110/210 improve on this functionality with their Hy-
per-Q feature. Hyper-Q increases the total number of connec-
tions (work queues) between the host and the CUDA Work Dis-
tributor (CWD) logic in the GPU by allowing 32 simultaneous,
hardware-managed connections (compared to the single con-
nection available with Fermi). Hyper-Q is a flexible solution that
allows connections from multiple CUDA streams, from multiple
Message Passing Interface (MPI) processes, or even from multiple
Image attribution Charles Reid
NVIDIA GPU Computing
Kepler GK110/210
GPU Computing Architecture
135
threads within a process. Applications that previously encoun-
tered false serialization across tasks, thereby limiting GPU utiliza-
tion, can see up to a 32x performance increase without changing
any existing code. Each CUDA stream is managed within its own
hardware work queue, inter-stream dependencies are optimized,
and operations in one stream will no longer block other streams,
enabling streams to execute concurrently without needing to
specifically tailor the launch order to eliminate possible false de-
pendencies.
Hyper-Q offers significant benefits for use in MPI-based parallel
computer systems. Legacy MPI-based algorithms were often cre-
ated to run on multi-core CPU systems, with the amount of work
assigned to each MPI process scaled accordingly. This can lead to
a single MPI process having insufficient work to fully occupy the
GPU. While it has always been possible for multiple MPI process-
es to share a GPU, these processes could become bottlenecked
by false dependencies. Hyper-Q removes those false dependen-
cies, dramatically increasing the efficiency of GPU sharing across
MPI processes.
Grid Management Unit – Efficiently Keeping the GPU Utilized New features introduced with Kepler GK110, such as the ability
for CUDA kernels to launch work directly on the GPU with Dynam-
ic Parallelism, required that the CPU-to-GPU workflow in Kepler
offer increased functionality over the Fermi design. On Fermi, a
grid of thread blocks would be launched by the CPU and would
always run to completion, creating a simple unidirectional flow
of work from the host to the SMs via the CUDA Work Distributor
(CWD) unit. Kepler GK110/210 improve the CPU-to-GPU workflow
by allowing the GPU to efficiently manage both CPU and CUDA
created workloads.
NVIDIA discussed the ability of Kepler GK110 GPU to allow ker-
nels to launch work directly on the GPU, and it’s important to
understand the changes made in the Kepler GK110 architecture
to facilitate these new functions. In Kepler GK110/210, a grid can
be launched from the CPU just as was the case with Fermi, how-
ever new grids can also be created programmatically by CUDA
within the Kepler SMX unit. To manage both CUDA-created and
host-originated grids, a new Grid Management Unit (GMU) was
introduced in Kepler GK110. This control unit manages and pri-
oritizes grids that are passed into the CWD to be sent to the SMX
units for execution.
The CWD in Kepler holds grids that are ready to dispatch, and
it is able to dispatch 32 active grids, which is double the capac-
ity of the Fermi CWD. The Kepler CWD communicates with the
GMU via a bi-directional link that allows the GMU to pause the
dispatch of new grids and to hold pending and suspended grids
until needed. The GMU also has a direct connection to the Kepler
SMX units to permit grids that launch additional work on the
GPU via Dynamic Parallelism to send the new work back to GMU
to be prioritized and dispatched. If the kernel that dispatched the
additional workload pauses, the GMU will hold it inactive until
the dependent work has completed.
The redesigned Kepler HOST to GPU workflow shows the new Grid Management Unit, which allows it to manage the actively dispatching grids, pause dispatch, and hold pending and suspended grids.
BOX
136
transtec has strived for developing well-engineered GPU Com-
puting solutions from the very beginning of the Tesla era. From
High-Performance GPU Workstations to rack-mounted Tesla
server solutions, transtec has a broad range of specially de-
signed systems available.
As an NVIDIA Tesla Preferred Provider (TPP), transtec is able to
provide customers with the latest NVIDIA GPU technology as
well as fully-engineered hybrid systems and Tesla Preconfig-
ured Clusters. Thus, customers can be assured that transtec’s
large experience in HPC cluster solutions is seamlessly brought
into the GPU computing world. Performance Engineering made
by transtec.
NVIDIA GPU Computing
Kepler GK110/210
GPU Computing Architecture
137
NVIDIA GPUDirectWhen working with a large amount of data, increasing the data
throughput and reducing latency is vital to increasing compute
performance. Kepler GK110/210 support the RDMA feature in
NVIDIA GPUDirect, which is designed to improve performance
by allowing direct access to GPU memory by third-party devices
such as IB adapters, NICs, and SSDs.
When using CUDA 5.0 or later, GPUDirect provides the following
important features:
� Direct memory access (DMA) between NIC and GPU without
the need for CPU-side data buffering.
� Significantly improved MPISend/MPIRecv efficiency between
GPU and other nodes in a network.
� Eliminates CPU bandwidth and latency bottlenecks
�Works with variety of 3rd-party network, capture and storage
devices
Applications like reverse time migration (used in seismic imaging
for oil & gas exploration) distribute the large imaging data across
several GPUs. Hundreds of GPUs must collaborate to crunch the
data, often communicating intermediate results. GPUDirect en-
ables much higher aggregate bandwidth for this GPU-to-GPU
GPUDirect RDMA allows direct access to GPU memory from 3rd-party devices such as network adapters, which translates into direct transfers between GPUs across nodes as well.
communication scenario within a server and across servers with
the P2P and RDMA features. Kepler GK110 also supports other
GPUDirect features such as Peer-to-Peer and GPUDirect for Video.
Kepler Memory Subsystem – L1, L2, ECC Kepler’s memory hierarchy is organized similarly to Fermi. The
Kepler architecture supports a unified memory request path for
loads and stores, with an L1 cache per SMX multiprocessor. Ke-
pler GK110 also enables compiler-directed use of an additional
new cache for read-only data, as described below.
Configurable shared memory and L1 CacheIn the Kepler GK110 architecture, as in the previous generation
Fermi architecture, each SMX has 64 KB of on-chip memory that
can be configured as 48 KB of Shared memory with 16 KB of L1
cache, or as 16 KB of shared memory with 48 KB of L1 cache.
Kepler now allows for additional flexibility in configuring the
allocation of shared memory and L1 cache by permitting a 32 KB/
32 KB split between shared memory and L1 cache.
Configurable Shared Memory and L1 Cache
138
To support the increased throughput of each SMX unit, the
shared memory bandwidth for 64 bit and larger load operations
is also doubled compared to the Fermi SM, to 256 byte per core
clock. For the GK210 architecture, the total amount of configu-
rable memory is doubled to 128 KB, allowing a maximum of 112
KB shared memory and 16 KB of L1 cache. Other possible memory
configurations are 32 KB L1 cache with 96 KB shared memory, or
48 KB L1 cache with 80 KB of shared memory.
This increase allows a similar improvement in concurrency of
threads as is enabled by the register file capacity improvement
described above.
48 KB Read-Only Data CacheIn addition to the L1 cache, Kepler introduces a 48 KB cache for
data that is known to be read-only for the duration of the func-
tion. In the Fermi generation, this cache was accessible only by
the Texture unit. Expert programmers often found it advanta-
geous to load data through this path explicitly by mapping their
data as textures, but this approach had many limitations.
In Kepler, in addition to significantly increasing the capacity of
this cache along with the texture horsepower increase, we de-
cided to make the cache directly accessible to the SM for general
load operations. Use of the read-only path is beneficial because
it takes both load and working set footprint off of the Shared/L1
cache path. In addition, the Read-Only Data Cache’s higher tag
bandwidth supports full speed unaligned memory access pat-
terns among other scenarios.
Use of the read-only path can be managed automatically by the
compiler or explicitly by the programmer. Access to any variable
or data structure that is known to be constant through program-
mer use of the C99 standard “const __restrict” keyword may be
tagged by the compiler to be loaded through the Read- Only Data
Cache. The programmer can also explicitly use this path with the
__ldg() intrinsic.
NVIDIA GPU Computing
Kepler GK110/210
GPU Computing Architecture
NVIDIA, GeForce, Tesla, CUDA, PhysX, GigaThread, NVIDIA Par-
allel Data Cache and certain other trademarks and logos ap-
pearing in this brochure, are trademarks or registered trade-
marks of NVIDIA orporation.
S
N
EW
SESW
NENW
tech BOX
139
Improved L2 CacheThe Kepler GK110/210 GPUs feature 1536 KB of dedicated L2
cache memory, double the amount of L2 available in the Fermi
architecture. The L2 cache is the primary point of data unifica-
tion between the SMX units, servicing all load, store, and texture
requests and providing efficient, high speed data sharing across
the GPU.
The L2 cache on Kepler offers up to 2x of the bandwidth per clock
available in Fermi. Algorithms for which data addresses are not
known beforehand, such as physics solvers, ray tracing, and
sparse matrix multiplication especially benefit from the cache
hierarchy. Filter and convolution kernels that require multiple
SMs to read the same data also benefit.
Memory Protection SupportLike Fermi, Kepler’s register files, shared memories, L1 cache, L2
cache and DRAM memory are protected by a Single-Error Correct
Double-Error Detect (SECDED) ECC code. In addition, the Read-
Only Data Cache supports single-error correction through a par-
ity check; in the event of a parity error, the cache unit automati-
cally invalidates the failed line, forcing a read of the correct data
from L2.
ECC checkbit fetches from DRAM necessarily consume some
amount of DRAM bandwidth, which results in a performance
difference between ECC-enabled and ECC-disabled operation,
especially on memory bandwidth-sensitive applications. Kepler
GK110 implements several optimizations to ECC checkbit fetch
handling based on Fermi experience. As a result, the ECC on-vs-
off performance delta has been reduced by an average of 66%,
as measured across our internal compute application test suite.
CUDACUDA is a combination hardware/software platform that
enables NVIDIA GPUs to execute programs written with C,
C++, Fortran, and other languages. A CUDA program invokes
parallel functions called kernels that execute across many
parallel threads. The programmer or compiler organizes
these threads into thread blocks and grids of thread blocks.
Each thread within a thread block executes an instance of
the kernel. Each thread also has thread and block IDs within
its thread block and grid, a program counter, registers, per-
thread private memory, inputs, and output results.
A thread block is a set of concurrently executing threads that
can cooperate among themselves through barrier synchro-
nization and shared memory. A thread block has a block ID
within its grid. A grid is an array of thread blocks that execute
the same kernel, read inputs from global memory, write re-
sults to global memory, and synchronize between dependent
kernel calls. In the CUDA parallel programming model, each
thread has a per-thread private memory space used for regis-
ter spills, function calls, and C automatic array variables. Each
thread block has a per-block shared memory space used for
inter-thread communication, data sharing, and result sharing
in parallel algorithms. Grids of thread blocks share results in
Global Memory space after kernel-wide global synchroniza-
tion.
CUDA Hardware ExecutionCUDA’s hierarchy of threads maps to a hierarchy of proces-
sors on the GPU; a GPU executes one or more kernel grids; a
streaming multiprocessor (SM on Fermi / SMX on Kepler) ex-
ecutes one or more thread blocks; and CUDA cores and other
140
execution units in the SMX execute thread instructions. The SMX
executes threads in groups of 32 threads called warps.
While programmers can generally ignore warp execution for
functional correctness and focus on programming individual
scalar threads, they can greatly improve performance by having
threads in a warp execute the same code path and access memory
with nearby addresses.
CUDA Hierarchy of threads, blocks, and grids, with corresponding per-thread private, per-block shared,-and per-application global memory spaces.
NVIDIA GPU Computing
A Quick Refresher on CUDA
CUDA Toolkit The NVIDIA CUDA Toolkit provides a comprehensive development
environment for C and C++ developers building GPU-accelerated
141
applications. The CUDA Toolkit includes a compiler for NVIDIA
GPUs, math libraries, and tools for debugging and optimizing the
performance of your applications. You’ll also find programming
guides, user manuals, API reference, and other documentation to
help you get started quickly accelerating your application with
GPUs.
Dramatically simplify parallel programming64 bit ARM Support
� Develop or recompile your applications to run on 64-bit ARM
systems with NVIDIA GPUs
Unified Memory
� Simplifies programming by enabling applications to access
CPU and GPU memory without the need to manually copy
data. Read more about unified memory.
Drop-in Libraries and cuFFT callbacks
� Automatically accelerate applications’ BLAS and FFTW calcu-
lations by up to 8X by simply replacing the existing CPU librar-
ies with the GPU-accelerated equivalents.
� Use cuFFT callbacks for higher performance custom process-
ing on input or output data.
Multi-GPU scaling
� cublasXT – a new BLAS GPU library that automatically scales
performance across up to eight GPUs in a single node, deliver-
ing over nine teraflops of double precision performance per
node, and supporting larger workloads than ever before (up
to 512GB). The re-designed FFT GPU library scales up to 2 GPUs
in a single node, allowing larger transform sizes and higher
throughput.
Improved Tools Support
� Support for Microsoft Visual Studio 2013.
� Improved debugging for CUDA FORTRAN applications
� Replay feature in Visual Profiler and nvprof
� nvprune utiliy to optimize the size of object files
The new Tesla K80 GPU Accelerator based on the GK210 GPUA dual GPU board that combines 24 GB of memory with blazing fast
memory bandwidth and up to 2.91 Tflops double precision per-
formance with NVIDIA GPU Boost, the Tesla K80 GPU is designed
for the most demanding computational tasks. It’s ideal for single
and double precision workloads that not only require leading
compute performance but also demands high data throughput.
Intel Xeon Phi Coprocessor – The ArchitectureIntel Many Integrated Core (Intel MIC) architecture combines many Intel CPU cores onto
a single chip. Intel MIC architecture is targeted for highly parallel, High Performance
Computing (HPC) workloads in a variety of fields such as computational physics, chem-
istry, biology, and financial services. Today such workloads are run as task parallel pro-
grams on large compute clusters.
Engi
nee
rin
g L
ife
Scie
nce
s A
uto
mo
tive
P
rice
Mo
del
lin
g A
ero
spac
e C
AE
Dat
a A
nal
ytic
s
144
Intel Xeon Phi Coprocessor
The Architecture
What is the Intel Xeon Phi coprocessor?Intel Xeon Phi coprocessors are PCI Express form factor add-in
cards that work synergistically with Intel Xeon processors to
enable dramatic performance gains for highly parallel code –
up to 1.2 double-precision teraFLOPS (floating point operations
per second) per coprocessor. Manufactured using Intel’s indus-
try-leading 22 nm technology with 3-D Tri-Gate transistors, each
coprocessor features more cores, more threads, and wider vector
execution units than an Intel Xeon processor. The high degree of
parallelism compensates for the lower speed of each core to de-
liver higher aggregate performance for highly parallel workloads.
What applications can benefit from the Intel Xeon Phi
coprocessor?
While a majority of applications (80 to 90 percent) will contin-
ue to achieve maximum performance on Intel Xeon processors,
certain highly parallel applications will benefit dramatically by
using Intel Xeon Phi coprocessors. To take full advantage of Intel
Xeon Phi coprocessors, an application must scale well to over 100
software threads and either make extensive use of vectors or ef-
ficiently use more local memory bandwidth than is available on
an Intel Xeon processor. Examples of segments with highly paral-
lel applications include: animation, energy, finance, life sciences,
manufacturing, medical, public sector, weather, and more.
When do I use Intel Xeon processors and Intel Xeon Phi
coprocessors?
S
N
EW
SESW
NENW
tech BOX
145
Think “resuse” rather than “recode”
Since languages, tools, and applications are compatible for
both Intel Xeon processor and coprocessors, now you can think
“reuse” rather than “recode”.
Single programming model for all your code
The Intel Xeon Phi coprocessor gives developers a hardware de-
sign point optimized for extreme parallelism, without requiring
them to re-architect or rewrite their code. No need to rethink the
entire problem or learn a new programming model; simply rec-
ompile and optimize existing code using familiar tools, libraries,
and runtimes.
Performance multiplier
By maintaining a single source code between Intel Xeon proces-
sors and Intel Xeon Phi coprocessors, developers optimize once
for parallelism but maximize performance on both processor
and coprocessor.
Execution flexibility
Intel Xeon Phi is designed from the ground up for high-perfor-
mance computing. Unlike a GPU, a coprocessor can host an oper-
ating system, be fully IP addressable, and support standards such
as Message Passing Interface (MPI). Additionally, it can operate in
multiple execution modes:
� “Symmetric” mode: Workload tasks are shared between the
host processor and coprocessor
� “Native” mode: Workload resides entirely on the coprocessor,
and essentially acts as a separate compute nodes
� “Offload” mode: Workload resides on the host processor, and
parts of the workload are sent out to the coprocessor as needed
Multiple Intel Xeon Phi coprocessors can be installed in a sin-
gle host system. Within a single system, the coprocessors can
communicate with each other through the PCIe peer-to-peer
interconnect without any intervention from the host. Similarly,
A Family of Coprocessors for Diverse Needs Intel Xeon Phi coprocessors provide up to 61 cores, 244
threads, and 1.2 teraflops of performance, and they come
in a variety of configurations to address diverse hardware,
software, workload, performance, and efficiency require-
ments. They also come in a variety of form factors, including
a standard PCIe x16 form factor (with active, passive, or no
thermal solution), and a dense form factor that offers addi-
tional design flexibility.
� The Intel Xeon Phi Coprocessor 3100 family provides out-
standing parallel performance. It is an excellent choice for
compute-bound workloads, such as MonteCarlo, black-
Scholes, HPL, LifeSc, and many others. Active and passive
cooling options provide flexible support for a variety of
server and workstation systems.
� The Intel Xeon Phi Coprocessor 5100 family is opti-
mized for high-density computing and is well-suited for
workloads that are memory-bandwidth bound, such as
STREAM, memory-capacity bound, such as ray-tracing, or
both, such as reverse time migration (RTM). These copro-
cessors are passively cooled and have the lowest thermal
design power (TDP) of the Intel Xeon Phi product family.
� The Intel Xeon Phi Coprocessor 7100 family provides the
most features and the highest performance and memory
capacity of the Intel Xeon Phi product family. This family
supports Intel Turbo boost Technology 1.0, which increas-
es core frequencies during peak workloads when thermal
conditions allow. Passive and no thermal solution options
enable powerful and innovative computing solutions.
146
the coprocessors can also communicate through a network card
such as InfiniBand or Ethernet, without any intervention from
the host.
The Intel Xeon Phi coprocessor is primarily composed of process-
ing cores, caches, memory controllers, PCIe client logic, and a very
high bandwidth, bidirectional ring interconnect. Each core comes
complete with a private L2 cache that is kept fully coherent by a
global-distributed tag directory. The memory controllers and the
PCIe client logic provide a direct interface to the GDDR5 mem-
ory on the coprocessor and the PCIe bus, respectively. All these
components are connected together by the ring interconnect.
The first generation Intel Xeon Phi product codenamed “Knights Corner”
Microarchitecture
Intel Xeon Phi Coprocessor
The Architecture
147
Each core in the Intel Xeon Phi coprocessor is designed to be pow-
er efficient while providing a high throughput for highly parallel
workloads. A closer look reveals that the core uses a short in-or-
der pipeline and is capable of supporting 4 threads in hardware.
It is estimated that the cost to support IA architecture legacy is a
mere 2% of the area costs of the core and is even less at a full chip
or product level. Thus the cost of bringing the Intel Architecture
legacy capability to the market is very marginal.
Vector Processing UnitAn important component of the Intel Xeon Phi coprocessor’s
core is its vector processing unit (VPU). The VPU features a novel
512-bit SIMD instruction set, officially known as Intel Initial Many
Core Instructions (Intel IMCI). Thus, the VPU can execute 16 sin-
gle-precision (SP) or 8 double-precision (DP) operations per cycle.
The VPU also supports Fused Multiply-Add (FMA) instructions and
hence can execute 32 SP or 16 DP floating point operations per
cycle. It also provides support for integers.
Vector units are very power efficient for HPC workloads. A single
operation can encode a great deal of work and does not incur en-
ergy costs associated with fetching, decoding, and retiring many
instructions. However, several improvements were required to
support such wide SIMD instructions. For example, a mask register
was added to the VPU to allow per lane predicated execution. This
helped in vectorizing short conditional branches, thereby im-
proving the overall software pipelining efficiency. The VPU also
supports gather and scatter instructions, which are simply non-
unit stride vector memory accesses, directly in hardware. Thus
for codes with sporadic or irregular access patterns, vector scat-
ter and gather instructions help in keeping the code vectorized.
The VPU also features an Extended Math Unit (EMU) that can ex-
ecute transcendental operations such as reciprocal, square root,
and log, thereby allowing these operations to be executed in a
vector fashion with high bandwidth. The EMU operates by calcu-
lating polynomial approximations of these functions.
The InterconnectThe interconnect is implemented as a bidirectional ring. Each
direction is comprised of three independent rings. The first,
largest, and most expensive of these is the data block ring. The
data block ring is 64 bytes wide to support the high bandwidth
requirement due to the large number of cores. The address ring
is much smaller and is used to send read/write commands and
memory addresses. Finally, the smallest ring and the least expen-
sive ring is the acknowledgement ring, which sends flow control
and coherence messages.
Intel Xeon Phi Coprocessor Core
Vector Processing Unit
148
When a core accesses its L2 cache and misses, an address request
is sent on the address ring to the tag directories. The memory ad-
dresses are uniformly distributed amongst the tag directories
on the ring to provide a smooth traffic characteristic on the ring.
If the requested data block is found in another core’s L2 cache,
a forwarding request is sent to that core’s L2 over the address
ring and the request block is subsequently forwarded on the
data block ring. If the requested data is not found in any caches,
a memory address is sent from the tag directory to the memory
controller.
The figure below shows the distribution of the memory con-
trollers on the bidirectional ring. The memory controllers are
symmetrically interleaved around the ring. There is an all-to-all
mapping from the tag directories to the memory controllers. The
addresses are evenly distributed across the memory controllers,
The Interconnect
Distributed Tag Directories
Intel Xeon Phi Coprocessor
The Architecture
149
thereby eliminating hotspots and providing a uniform access
pattern which is essential for a good bandwidth response.
During a memory access, whenever an L2 cache miss occurs on a
core, the core generates an address request on the address ring
and queries the tag directories. If the data is not found in the tag
directories, the core generates another address request and que-
ries the memory for the data. Once the memory controller fetch-
es the data block from memory, it is returned back to the core
over the data ring. Thus during this process, one data block, two
address requests (and by protocol, two acknowledgement mes-
sages) are transmitted over the rings. Since the data block rings
are the most expensive and are designed to support the required
data bandwidth, it is necessary to increase the number of less ex-
pensive address and acknowledgement rings by a factor of two
to match the increased bandwidth requirement caused by the
higher number of requests on these rings.
Other Design FeaturesOther micro-architectural optimizations incorporated into the
Intel Xeon Phi coprocessor include a 64-entry second-level Trans-
lation Lookaside Buffer (TLB), simultaneous data cache loads and
stores, and 512KB L2 caches. Lastly, the Intel Xeon Phi coproc-
essor implements a 16 stream hardware prefetcher to improve
the cache hits and provide higher bandwidth. The figure below
shows the net performance improvements for the SPECfp 2006
benchmark suite for a single core, single thread runs. The results
indicate an average improvement of over 80% per cycle not in-
cluding frequency.
Per-Core ST Performance Improvement (per cycle)
Interleaved Memory Access
Interconnect: 2x AD/AK
150
CachesThe Intel MIC architecture invests more heavily in L1 and L2 cach-
es compared to GPU architectures. The Intel Xeon Phi coproces-
sor implements a leading-edge, very high bandwidth memory
subsystem. Each core is equipped with a 32KB L1 instruction
cache and 32KB L1 data cache and a 512KB unified L2 cache.
These caches are fully coherent and implement the x86 memory
order model. The L1 and L2 caches provide an aggregate band-
width that is approximately 15 and 7 times, respectively, faster
compared to the aggregate memory bandwidth. Hence, effective
utilization of the caches is key to achieving peak performance on
the Intel Xeon Phi coprocessor. In addition to improving band-
width, the caches are also more energy efficient for supplying
data to the cores than memory. The figure below shows the ener-
gy consumed per byte of data transfered from the memory, and
L1 and L2 caches. In the exascale compute era, caches will play a
crucial role in achieving real performance under restrictive pow-
er constraints.
StencilsStencils are common in physics simulations and are classic exam-
ples of workloads which show a large performance gain through
efficient use of caches.
Caches – For or Against?
Intel Xeon Phi Coprocessor
The Architecture
151
Stencils are typically employed in simulation of physical sys-
tems to study the behavior of the system over time. When these
workloads are not programmed to be cache-blocked, they will be
bound by memory bandwidth. Cache blocking promises substan-
tial performance gains given the increased bandwidth and ener-
gy efficiency of the caches compared to memory. Cache blocking
improves performance by blocking the physical structure or the
physical system such that the blocked data fits well into a core’s
L1 and or L2 caches. For example, during each time-step, the
same core can process the data which is already resident in the
L2 cache from the last time step, and hence does not need to be
fetched from the memory, thereby improving performance. Addi-
tionally, the cache coherence further aids the stencil operation
by automatically fetching the updated data from the nearest
neighboring blocks which are resident in the L2 caches of other
cores. Thus, stencils clearly demonstrate the benefits of efficient
cache utilization and coherence in HPC workloads.
Power ManagementIntel Xeon Phi coprocessors are not suitable for all workloads.
In some cases, it is beneficial to run the workloads only on the
host. In such situations where the coprocessor is not being used,
it is necessary to put the coprocessor in a power-saving mode.
To conserve power, as soon as all the four threads on a core are
halted, the clock to the core is gated.
Once the clock has been gated for some programmable time, the
core power gates itself, thereby eliminating any leakage. At any
point, any number of the cores can be powered down or powered
up as shown in the figure below. Additionally, when all the cores
are power gated and the uncore detects no activity, the tag direc-
tories, the interconnect, L2 caches and the memory controllers
are clock gated.
Clock Gate Core
Power Gate Core
Stencils Example
152
At this point, the host driver can put the coprocessor into a deep-
er sleep or an idle state, wherein all the uncore is power gated,
the GDDR is put into a self-refresh mode, the PCIe logic is put in
a wait state for a wakeup and the GDDR-IO is consuming very lit-
tle power. These power management techniques help conserve
power and make Intel Xeon Phi coprocessor an excellent candi-
date for data centers.
SummaryThe Intel Xeon Phi coprocessor provides high performance, and
performance per watt for highly parallel HPC workloads, while
not requiring a new programming model, API, language or restric-
tive memory modelIt is able to do this with an array of general
purpose cores withmultiple thread contexts, wide vector units,
caches, and high bandwidth on die and memory interconnect.
Knights Corner is the first Intel Xeon Phi product in the MIC ar-
chitecture family of processors from Intel, aimed at enabling the
exascale era of computing.
Package Auto C3
Intel Xeon Phi Coprocessor
An Outlook on Knights Landing
and Omni Path
S
N
EW
SESW
NENW
tech BOX
153
Knights Landing Knights Landing is the codename for Intel’s 2nd generation In-
tel Xeon Phi Product Family, which will deliver massive thread
parallelism, data parallelism and memory bandwidth – with
improved single-thread performance and Intel Xeon processor
binary-compatibility in a standard CPU form factor.
Additionally, Knights Landing will offer integrated Intel Om-
ni-Path fabric technology, and also be available in the tradition-
al PCIe coprocessor form factor.
Performance3+ TeraFLOPS of double-precision peak theoretical performance
per single socket node.
Power Management: All On and Running
IntegrationIntel Omni Scale fabric integration
High-performance on-package memory (MCDRAM)
� Over 5x STREAM vs. DDR4 Over 400 GB/s
� Up to 16GB at launch
� NUMA support
� Over 5x Energy Efficiency vs. GDDR52
� Over 3x Density vs. GDDR52
� In partnership with Micron Technology
� Flexible memory modes including cache and flat
Server Processor � Standalone bootable processor (running host OS) and a PCIe
coprocessor (PCIe end-point device)
� Platform memory capacity comparable to Intel Xeon
Processors
� Reliability (“Intel server-class reliability”)
� Power Efficiency (Over 25% better than discrete coprocessor)
with over 10 GF/W
� Density (3+ KNL with fabric in 1U)
Microarchitecture � Based on Intel’s 14 nanometer manufacturing technology
� Binary compatible with Intel Xeon Processors
� Support for Intel Advanced Vector Extensions 512
(Intel AVX-512)
� 3x Single-Thread Performance compared to Knights Corner
� Cache-coherency
� 60+ cores in a 2D Mesh architecture
154
4 Threads / Core
� Deep Out-of-Order Buffers
� Gather/scatter in hardware
� Advanced Branch Prediction
� High cache bandwidth
� Most of today’s parallel optimizations carry forward to KNL
� Multiple NUMA domain support per socket
Roadmap � Knights Hill is the codename for the 3rd generation of the
Intel Xeon Phi product family
� Based on Intel’s 10 nanometer manufacturing technology
� Integrated 2nd generation Intel Omni-Path Fabric
Omni Path - the Next-Generation FabricIntel Omni-Path Architecture is designed to deliver the perfor-
mance for tomorrow’s high performance computing (HPC) work-
loads and the ability to scale to tens – and eventually hundreds
– of thousands of nodes at a cost-competitive price to today’s
fabrics.
This next-generation fabric from Intel builds on the Intel True
Scale Fabric with an end-to-end solution, including PCIe adapters,
silicon, switches, cables, and management software. Customers
deploying Intel True Scale Fabric today will be able to migrate to
Intel Omni-Path Architecture through an Intel upgrade program.
In the near future, Intel will integrate the Intel Omni-Path Host
Fabric Interface onto future generations of Intel Xeon processors
and Intel Xeon Phi processors to address several key challenges
of HPC computing centers:
Intel Xeon Phi Coprocessor
An Outlook on Knights Landing
and Omni Path
155
� Performance: Processor capacity and memory bandwidth are
scaling faster than system I/O. Intel Omni-Path Architecture
will accelerate message passing interface (MPI) rates for to-
morrow’s HPC.
� Cost and density: More components on a server limit density
and increase fabric cost. An integrated fabric controller helps
eliminate the additional costs and required space of discrete
cards, enabling higher server density.
� Reliability and power: Discrete interface cards consume
many watts of power. Integrated onto the processor, the In-
tel Omni-Path Host Fabric Interface will draw less power with
fewer discrete components.
For software applications, Intel Omni-Path Architecture will
maintain consistency and compatibility with existing Intel True
Scale Fabric and InfiniBand APIs by working through the open
source OpenFabrics Alliance (OFA) software stack. The OFA soft-
ware stack works with leading Linux distribution releases.
Intel is providing a clear path to addressing the issues of tomor-
row’s HPC with Intel Omni-Path Architecture.
Power Management: All On and Running
InfiniBand High-Speed InterconnectInfiniBand (IB) is an efficient I/O technology that provides high-speed data transfers
and ultra-low latencies for computing and storage over a highly reliable and scalable
single fabric. The InfiniBand industry standard ecosystem creates cost effective hard-
ware and software solutions that easily scale from generation to generation.
InfiniBand is a high-bandwidth, low-latency network interconnect solution that has
grown tremendous market share in the High Performance Computing (HPC) cluster
community. InfiniBand was designed to take the place of today’s data center network-
ing technology. In the late 1990s, a group of next generation I/O architects formed an
open, community driven network technology to provide scalability and stability based
on successes from other network designs.
Today, InfiniBand is a popular and widely used I/O fabric among customers within the
Top500 supercomputers: major Universities and Labs; Life Sciences; Biomedical; Oil and
Gas (Seismic, Reservoir, Modeling Applications); Computer Aided Design and Engineer-
ing; Enterprise Oracle; and Financial Applications.
Life
Sci
en
ces
Ris
k A
na
lysi
s S
imu
lati
on
B
ig D
ata
An
aly
tics
H
igh
Pe
rfo
rma
nce
Co
mp
uti
ng
158
InfiniBand was designed to meet the evolving needs of the
high performance computing market. Computational science
depends on InfiniBand to deliver:
� High Bandwidth: Supports host connectivity of 10Gbps with
Single Data Rate (SDR), 20Gbps with Double Data Rate (DDR),
and 40Gbps with Quad Data Rate (QDR), all while offering an
80Gbps switch for link switching
� Low Latency: Accelerates the performance of HPC and enter-
prise computing applications by providing ultra-low latencies
� Superior Cluster Scaling: Point-to-point latency remains low
as node and core counts scale – 1.2 µs. Highest real message
per adapter: each PCIe x16 adapter drives 26 million messag-
es per second. Excellent communications/computation over-
lap among nodes in a cluster
� High Efficiency: InfiniBand allows reliable protocols like Re-
mote Direct Memory Access (RDMA) communication to occur
between interconnected hosts, thereby increasing efficiency
� Fabric Consolidation and Energy Savings: InfiniBand can
consolidate networking, clustering, and storage data over a
single fabric, which significantly lowers overall power, real
estate, and management overhead in data centers. Enhanced
Quality of Service (QoS) capabilities support running and
managing multiple workloads and traffic classes
� Data Integrity and Reliability: InfiniBand provides the high-
est levels of data integrity by performing Cyclic Redundancy
Checks (CRCs) at each fabric hop and end-to-end across the
fabric to avoid data corruption. To meet the needs of mission
critical applications and high levels of availability, InfiniBand
provides fully redundant and lossless I/O fabrics with auto-
matic failover path and link layer multi-paths
InfiniBand
The Architecture
Intel is a global leader and technology innovator in high performance net-working, including adapters, switches and ASICs.
Host Channel Adapter
InfiniBand SwitchWith Subnet Manager
Target Channel Adapter
159
Components of the InfiniBand Fabric
InfiniBand is a point-to-point, switched I/O fabric architecture.
Point-to-point means that each communication link extends
between only two devices. Both devices at each end of a link
have full and exclusive access to the communication path. To go
beyond a point and traverse the network, switches come into
play. By adding switches, multiple points can be interconnected
to create a fabric. As more switches are added to a network, ag-
gregated bandwidth of the fabric increases. By adding multiple
paths between devices, switches also provide a greater level of
redundancy.
The InfiniBand fabric has four primary components, which are
explained in the following sections:
� Host Channel Adapter
� Target Channel Adapter
� Switch
� Subnet Manager
Host Channel Adapter
This adapter is an interface that resides within a server and com-
municates directly with the server’s memory and processor as
well as the InfiniBand Architecture (IBA) fabric. The adapter guar-
antees delivery of data, performs advanced memory access, and
can recover from transmission errors. Host channel adapters can
communicate with a target channel adapter or a switch. A host
channel adapter can be a standalone InfiniBand card or it can
be integrated on a system motherboard. Intel TrueScale Infini-
Band host channel adapters outperform the competition with
the industry’s highest message rate. Combined with the lowest
MPI latency and highest effective bandwidth, Intel host channel
adapters enable MPI and TCP applications to scale to thousands
of nodes with unprecedented price performance.
Target Channel Adapter
This adapter enables I/O devices, such as disk or tape storage, to
be located within the network independent of a host computer.
Target channel adapters include an I/O controller that is specific
to its particular device’s protocol (for example, SCSI, Fibre Chan-
nel (FC), or Ethernet). Target channel adapters can communicate
with a host channel adapter or a switch.
Switch
An InfiniBand switch allows many host channel adapters and
target channel adapters to connect to it and handles network
traffic. The switch looks at the “local route header” on each
packet of data that passes through it and forwards it to the ap-
propriate location. The switch is a critical component of the In-
finiBand implementation that offers higher availability, higher
aggregate bandwidth, load balancing, data mirroring, and much
more. A group of switches is referred to as a Fabric. If a host com-
puter is down, the switch still continues to operate. The switch
also frees up servers and other devices by handling network
traffic.
figure 1 Typical InfiniBand High Performance Cluster
S
N
EW
SESW
NENW
tech BOX
MPI v3.1
MPI v4.0
1400
35% Improvement
1200
1000
800
600
400
200
0
Bet
ter
Ela
pse
d t
ime
Number of cores
16 32 64
160
InfiniBand
Intel MPI Library 4.0 Performance
IntroductionIntel’s latest MPI release, Intel MPI 4.0, is now optimized to work
with Intel’s TrueScale InfiniBand adapter. Intel MPI 4.0 can now
directly call Intel’s TrueScale Performance Scale Messaging
(PSM) interface. The PSM interface is designed to optimize MPI
application performance. This means that organizations will be
able to achieve a significant performance boost with a combina-
tion of Intel MPI 4.0 and Intel TrueScale InfiniBand.
SolutionIntel worked with Intel to tune and optimize the company’s lat-
est MPI release – Intel MPI Library 4.0 – to improve performance
when used with Intel TrueScale InfiniBand. With MPI Library 4.0,
applications can make full use of High Performance Computing
(HPC) hardware, improving the overall performance of the appli-
cations on the clusters.
Figure 3 Performance with and without GPUDirect
“Effective fabric management has become the most important factor
in maximizing performance in an HPC cluster. With IFS 7.0, Intel has
addressed all of the major fabric management issues in a product
that in many ways goes beyond what others are offering.”
Michael WirthHPC Presales Specialist
161
Intel MPI Library 4.0 PerformanceIntel MPI Library 4.0 uses the high performance MPI-2 specifica-
tion on multiple fabrics, which results in better performance for
applications on Intel architecture-based clusters.
This library enables quick delivery of maximum end user perfor-
mance, even if there are changes or upgrades to new intercon-
nects – without requiring major changes to the software or op-
erating environment. This high-performance, message-passing
interface library develops applications that can run on multiple
cluster fabric interconnects chosen by the user at runtime.
Testing
Intel used the Parallel Unstructured Maritime Aerodynamics
(PUMA) Benchmark program to test the performance of Intel
MPI Library versions 4.0 and 3.1 with Intel TrueScale InfiniBand.
The program analyses internal and external non-reacting com-
pressible flows over arbitrarily complex 3D geometries. PUMA is
written in ANSI C, and uses MPI libraries for message passing.
Results
The test showed that MPI performance can improve by more
than 35 percent using Intel MPI Library 4.0 with Intel TrueScale
InfiniBand, compared to using Intel MPI Library 3.1.
Intel TrueScale InfiniBand Adapters Intel TrueScale InfiniBand Adapters offer scalable performance,
reliability, low power, and superior application performance.
These adapters ensure superior performance of HPC appli-
cations by delivering the highest message rate for multicore
compute nodes, the lowest scalable latency, large node count
clusters, the highest overall bandwidth on PCI Express Gen1
platforms, and superior power efficiency.
Typical InfiniBand High Performance Cluster
Purpose: Compare the performance of Intel MPI Library 4.0 and 3.1 with Intel InfiniBandBenchmark: PUMA FlowCluster: Intel/IBM iDataPlex clusterSystem Configuration: The NET track IBM Q-Blue Cluster/iDataPlex nodes were configured as follows:
Processor Intel Xeon CPU X5570 @ 2,93 GHz
Memory 24GB (6x4GB) @ 1333MHz (DDR3)
QDR InfiniBand SwitchIntel Model 12300/firmware version
6.0.0.0.33
QDR InfiniBand Host Cnel Adapter Intel QLE7340 software stack 5.1.0.0.49
Operating System Red Hat Enterprise Linux Server re-
lease 5.3
Kernel 2.6.18-128.el5
File System IFS Mounted
162
The Intel TrueScale 12000 family of Multi-Protocol Fab-
ric Directors is the most highly integrated cluster com-
puting interconnect solution available. An ideal solu-
tion HPC, database clustering, and grid utility computing
applications, the 12000 Fabric Directors maximize cluster
and grid computing interconnect performance while sim-
plifying and reducing the cost of operating a data center.
Subnet Manager
The subnet manager is an application responsible for config-
uring the local subnet and ensuring its continued operation.
Configuration responsibilities include managing switch setup
and reconfiguring the subnet if a link goes down or a new one
is added.
How High Performance Computing Helps VerticalApplicationsEnterprises that want to do high performance computing must
balance the following scalability metrics as the core size and
the number of cores per node increase:
� Latency and Message Rate Scalability: must allow near line-
ar growth in productivity as the number of compute cores is
scaled.
� Power and Cooling Efficiency: as the cluster is scaled, power
and cooling requirements must not become major concerns
in today’s world of energy shortages and high-cost energy.
InfiniBand offers the promise of low latency, high bandwidth,
and unmatched scalability demanded by high performance
computing applications. IB adapters and switches that per-
form well on these key metrics allow enterprises to meet their
high-performance and MPI needs with optimal efficiency. The
IB solution allows enterprises to quickly achieve their compute
and business goals.
InfiniBand
The Architecture
BOX
163
transtec HPC solutions excel through their easy management
and high usability, while maintaining high performance and
quality throughout the whole lifetime of the system. As clusters
scale, issues like congestion mitigation and Quality-of-Service
can make a big difference in whether the fabric performs up to
its full potential.
With the intelligent choice of Intel InfiniBand products, trans-
tec remains true to combining the best components together
to provide the full best-of-breed solution stack to the customer.
transtec HPC engineering experts are always available to fine-
tune customers’ HPC cluster systems and InfiniBand fabrics to
get the maximum performance while at the same time provide
them with an easy-to-manage and easy-to-use HPC solution.
S
N
EW
SESW
NENW
tech BOX
164
InfiniBand
Intel Fabric Suite 7
Intel Fabric Suite 7Maximizing Investments in High Performance Computing
Around the world and across all industries, high performance
computing (HPC) is used to solve today’s most demanding com-
puting problems. As today’s high performance computing chal-
lenges grow in complexity and importance, it is vital that the
software tools used to install, configure, optimize, and manage
HPC fabrics also grow more powerful. Today’s HPC workloads
are too large and complex to be managed using software tools
that are not focused on the unique needs of HPC. Designed spe-
cifically for HPC, Intel Fabric Suite 7 is a complete fabric manage-
ment solution that maximizes the return on HPC investments,
by allowing users to achieve the highest levels of performance,
efficiency, and ease of management from InfiniBand-connected
HPC clusters of any size.
Highlights
� Intel FastFabric and Fabric Viewer integration with leading
third-party HPC cluster management suites
� Simple but powerful Fabric Viewer dashboard for monitor-
ing fabric performance
� Intel Fabric Manager integration with leading HPC workload-
management suites that combine virtual fabrics and
compute
� Quality of Service (QoS) levels that maximize fabric efficien-
cy and application performance
� Smart, powerful software tools that make Intel TrueScale
Fabric solutions easy to install, configure, verify, optimize,
and manage
� Congestion control architecture that reduces the effects of
fabric congestion caused by low credit availability that can
165
result in head-of-line blocking
� Powerful fabric routing methods – including adaptive and
dispersive routing – that optimize traffic flows to avoid or
eliminate congestion, maximizing fabric throughput
� Intel Fabric Manager’s advanced design ensures fabrics of all
sizesand topologies – from fat-tree to mesh and torusscale
to support the most demanding HPC workloads
Superior Fabric Performance and Simplified Management are
Vital for HPC
As HPC clusters scale to take advantage of multi-core and
GPU-accelerated nodes attached to ever-larger and more com-
plex fabrics, simple but powerful management tools are vital for
maximizing return on HPC investments.
Intel Fabric Suite 7 provides the performance and management
tools for today’s demanding HPC cluster environments. As clus-
ters grow larger, management functions, from installation and
configuration to fabric verification and optimization, are vital
in ensuring that the interconnect fabric can support growing
workloads. Besides fabric deployment and monitoring, IFS opti-
mizes the performance of message passing applications – from
advanced routing algorithms to quality of service (QoS) – that
ensure all HPC resources are optimally utilized.
Scalable Fabric Performance
� Purpose-built for HPC, IFS is designed to make HPC clusters
faster, easier, and simpler to deploy, manage, and optimize
� The Intel TrueScale Fabric on-load host architecture exploits
the processing power of today’s faster multi-core proces-
sors for superior application scaling and performance
� Policy-driven vFabric virtual fabrics optimize HPC resource
utilization by prioritizing and isolating compute and storage
traffic flows
� Advanced fabric routing options – including adaptive and
dispersive routing – that distribute traffic across all poten-
tial links that improve the overall fabric performance, lower-
ing congestion to improve throughput and latency
Scalable Fabric Intelligence
� Routing intelligence scales linearly as Intel TrueScale Fabric
switches are added to the fabric
� Intel Fabric Manager can initialize fabrics having several
thousand nodes within seconds
� Advanced and optimized routing algorithms overcome the
limitations of typical subnet managers
� Smart management tools quickly detect and respond to
fabric changes, including isolating and correcting problems
that can result in unstable fabrics
Standards-based Foundation
Intel TrueScale Fabric solutions are compliant with all industry
software and hardware standards. Intel Fabric Suite (IFS) re-
defines IBTA management by coupling powerful management
tools with intuitive user interfaces. InfiniBand fabrics built using
IFS deliver the highest levels of fabric performance, efficiency,
and management simplicity, allowing users to realize the full
benefits from their HPC investments.
Major components of Intel Fabric Suite 7 include:
� FastFabric Toolset
� Fabric Viewer
� Fabric Manager
Intel FastFabric Toolset
Ensures rapid, error-free installation and configuration of Intel
True Scale Fabric switch, host, and management software tools.
Guided by an intuitive interface, users can easily install, config-
ure, validate, and optimize HPC fabrics.
166
Key features include:
� Automated host software installation and configuration
� Powerful fabric deployment, verification, analysis and repor-
ting tools for measuring connectivity, latency and bandwidth
� Automated chassis and switch firmware installation and
update
� Fabric route and error analysis tools
� Benchmarking and tuning tools
� Easy-to-use fabric health checking tools
Intel Fabric Viewer
A key component that provides an intuitive, Java-based user
interface with a “topology-down” view for fabric status and di-
agnostics, with the ability to drill down to the device layer to
identify and help correct errors. IFS 7 includes fabric dashboard,
a simple and intuitive user interface that presents vital fabric
performance statistics.
Key features include:
� Bandwidth and performance monitoring
� Device and device group level displays
� Definition of cluster-specific node groups so that displays
can be oriented toward end-user node types
� Support for user-defined hierarchy
� Easy-to-use virtual fabrics interface
Intel Fabric Manager
Provides comprehensive control of administrative functions us-
ing a commercial-grade subnet manager. With advanced routing
algorithms, powerful diagnostic tools, and full subnet manager
failover, Fabric Manager simplifies subnet, fabric, and individual
component management, making even the largest fabrics easy
to deploy and optimize.
InfiniBand
Intel Fabric Suite 7
167
Key features include:
� Designed and optimized for large fabric support
� Integrated with both adaptive and dispersive routing systems
� Congestion control architecture (CCA)
� Robust failover of subnet management
� Path/route management
� Fabric/chassis management
� Fabric initialization in seconds, even for very large fabrics
� Performance and fabric error monitoring
Adaptive Routing
While most HPC fabrics are configured to have multiple paths
between switches, standard InfiniBand switches may not be ca-
pable of taking advantage of them to reduce congestion. Adap-
tive routing monitors the performance of each possible path,
and automatically chooses the least congested route to the des-
tination node.
Unlike other implementations that rely on a purely subnet man-
ager-based approach, the intelligent path selection capabilities
within Fabric Manager, a key part of IFS and Intel 12000 series
switches, scales as your fabric grows larger and more complex.
Key adaptive routing features include:
� Highly scalable–adaptive routing intelligence scales as
the fabric grows
� Hundreds of real-time adjustments per second per switch
� Topology awareness through Intel Fabric Manager
� Supports all InfiniBand Trade Association* (IBTA*)-compliant
adapters
Dispersive Routing
One of the critical roles of the fabric management is the initial-
ization and configuration of routes through the fabric between
a single pair of nodes. Intel Fabric Manager supports a variety
of routing methods, including defining alternate routes that
disperse traffic flows for redundancy, performance, and load
balancing. Instead of sending all packets to a destination on a
single path, Intel dispersive routing distributes traffic across all
possible paths. Once received, packets are reassembled in their
proper order for rapid, efficient processing. By leveraging the
entire fabric to deliver maximum communications performance
for all jobs, dispersive routing ensures optimal fabric efficiency.
Key dispersive routing features include:
� Fabric routing algorithms that provide the widest separa-
tion of paths possible
� Fabric “hotspot” reductions to avoid fabric congestion
� Balances traffic across all potential routes
� May be combined with vFabrics and adaptive routing
� Very low latency for small messages and time sensitive con-
trol protocols
� Messages can be spread across multiple paths–Intel Perfor-
mance Scaled Messaging (PSM) ensures messages are reas-
sembled correctly
� Supports all leading MPI libraries
Mesh and Torus Topology Support
Fat-tree configurations are the most common topology used in
HPC cluster environments today, but other topologies are gain-
ing broader use as organizations try to create increasingly larg-
er, more cost-effective fabrics. Intel is leading the way with full
support of emerging mesh and torus topologies that can help re-
duce networking costs as clusters scale to thousands of nodes.
IFS has been enhanced to support these emerging options–from
failure handling that maximizes performance, to even higher re-
liability for complex networking environments.
NumascaleNumascale’s NumaConnect technology enables computer system vendors to build
scalable servers with the functionality of enterprise mainframes at the cost level of
clusters. The technology unites all the processors, memory and IO resources in the sys-
tem in a fully virtualized environment controlled by standard operating systems.
Hig
h T
rou
hp
ut
Co
mp
uti
ng
C
AD
B
ig D
ata
An
aly
tics
S
imu
lati
on
A
ero
spa
ce
Au
tom
oti
ve
170
NumaConnect BackgroundSystems based on NumaConnect will efficiently support all
classes of applications using shared memory or message pass-
ing through all popular high level programming models. System
size can be scaled to 4k nodes where each node can contain mul-
tiple processors. Memory size is limited by the 48-bit physical
address range provided by the Opteron processors resulting in a
total system main memory of 256 TBytes.
At the heart of NumaConnect is NumaChip; a single chip that
combines the cache coherent shared memory control logic with
an on-chip 7 way switch. This eliminates the need for a separate,
central switch and enables linear capacity and cost scaling.
The continuing trend with multi-core processor chips is enabling
more applications to take advantage of parallel processing. Nu-
maChip leverages the multi-core trend by enabling applications
to scale seamlessly without the extra programming effort re-
quired for cluster computing. All tasks can access all memory
and IO resources. This is of great value to users and the ultimate
way to virtualization of all system resources.
All high speed interconnects now use the same kind of physical
interfaces resulting in almost the same peak bandwidth. The
differentiation is in latency for the critical short transfers, func-
tionality and software compatibility. NumaConnect differenti-
ates from all other interconnects through the ability to provide
unified access to all resources in a system and utilize caching
techniques to obtain very low latency.
Key Facts:
� Scalable, directory based Cache Coherent Shared Memory
interconnect for Opteron
� Attaches to coherent HyperTransport (cHT) through HTX con-
nector, pick-up module or mounted directly on main board
� Configurable Remote Cache for each node
Numascale
NumaConnect Background
171
� Full 48 BIT physical address space (256 Tbytes)
� Up to 4k (4096) nodes
� ≈1 microsecond MPI latency (ping-pong/2)
� On-chip, distributed switch fabric for 2 or 3 dimensional torus
topologies
Expanding the capabilities of multi-core processors
Semiconductor technology has reached a level where proces-
sor frequency can no longer be increased much due to power
consumption with corresponding heat dissipation and thermal
handling problems. Historically, processor frequency scaled at
approximately the same rate as transistor density and resulted
in performance improvements for most all applications with
no extra programming efforts. Processor chips are now instead
being equipped with multiple processors on a single die. Utiliz-
ing the added capacity requires softwarethat is prepared for
parallel processing. This is quite obviously simple for individual
and separated tasks that can be run independently, but is much
more complex for speeding up single tasks.
The complexity for speeding up a single task grows with the log-
ic distance between the resources needed to do the task, i.e. the
fewer resources that can be shared, the harder it is. Multi-core
processors share the main memory and some of the cache
levels, i.e. they are classified as Symmetrical Multi Processors
(SMP). Modern processors chips are also equipped with signals
and logic that allow connecting to other processor chips still
maintaining the same logic sharing of memory. The practical
limit is at two to four processor sockets before the overheads
reduce performance scaling instead of increasing it. This is nor-
mally restricted to a single motherboard.
Currently, scaling beyond the single/dual SMP motherboards is
done through some form of network connection using Ethernet
or a higher speed interconnect like InfiniBand. This requires pro-
cesses runningon the different compute nodes to communicate
through explicit messages. With this model, programs that need
to be scaled beyond a small number of processors have to be
written in a more complex way where the data can no longer
be shared among all processes, but need to be explicitly decom-
posed and transferred between the different processors’ mem-
ories when required.
NumaConnect uses a scalable approachto sharing all memory
based on distributed directories to store information about
shared memory locations. This means that programs can be
scaled beyond the limit of a single motherboard without any
changes to the programming principle. Any process running on
any processor in the system can use any part of the memory re-
gardless if the physical location of the memory is on a different
motherboard.
NumaConnect Value PropositionNumaConnect enables significant cost savings in three dimen-
sions; resource utilization, system management and program-
mer productivity.
According to long time users of both large shared memory sys-
tems (SMPs) and clusters in environments with a variety of ap-
plications, the former provide a much higher degree of resource
utilization due to the flexibility of all system resources. They
indicate that large mainframe SMPs can easily be kept at more
than 90% utilization and that clusters seldom can reach more
than 60-70% in environments running a variety of jobs. Better
compute resource utilization also contributes to more efficient
use of the necessary infrastructure with power consumption
and cooling as the most prominent ones (account for approxi-
mately one third of the overall coast) with floor-space as a sec-
ondary aspect.
Regarding system management, NumaChip can reduce the
number of individual operating system images significantly.
172
In a system with 100Tflops computing power, the number of
system images can be reduced from approximately 1 400 to
40, a reduction factor of 35. Even if each of those 40 OS imag-
es require somewhat more resources for management than the
1 400 smaller ones, the overall savings are significant.
Parallel processing in a cluster requires explicit message pass-
ing programming whereas shared memory systems can utilize
compilers and other tools that are developed for multi-core pro-
cessors. Parallel programming is a complex task and programs
written for message passing normally contain 50%-100% more
code than programs written for shared memory processing.
Since all programs contain errors, the probability of errors in
message passing programs is 50%-100% higher than for shared
memory programs. A significant amount of software develop-
ment time is consumed by debugging errors further increasing
the time to complete development of an application.
In principle, servers are multi-tasking, multi-user machines that
are fully capable of running multiple applications at any given
time. Small servers are very cost-efficient measured by a peak
price/performance ratio because they are manufactured in very
high volumes and use many of the same components as desk-
side and desktop computers. However, these small to medium
sized servers are not very scalable. The most widely used config-
uration has 2 CPU sockets that hold from 4 to 16 CPU cores each.
They cannot be upgraded with out changing to a different main
board that also normally requires a larger power supply and a
different chassis. In turn, this means that careful capacity plan-
ning is required to optimize cost and if compute requirements
increase, it may be necessary to replace the entire server with a
bigger and much more expensive one since the price increase is
far from linear.
NumaChip contains all the logic needed to build Scale-Up sys-
tems based on volume manufactured server components. This
Numascale
Technology
173
drives the cost per CPU core down to the same level while offer-
ing the same capabilities as the mainframe type servers. Where
IT budgets are in focus the price difference is obvious and Nu-
maChip represents a compelling proposition to get mainframe
capabilities at the cost level of high-end cluster technology. The
expensive mainframes still include some features for dynamic
system reconfiguration that NumaChip systems will not offer in-
itially. Such features depend on operating system software and
can be also be implemented in NumaChip based systems.
TechnologyMulti-core processors and shared memory
Shared memory programming for multi-processing boosts pro-
grammer productivity since it is easier to handle than the alter-
native message passing paradigms. Shared memory programs
are supported by compiler tools and require less code than the
alternatives resulting in shorter development time and fewer
program bugs. The availability of multi-core processors on all
major platforms including desktops and laptops is driving more
programs to take advantage of the increased performance
potential. NumaChip offers seamless scaling within the same
programming paradigm regardless of system size from a single
processor chip to systems with more than 1,000 processor chips.
Other interconnect technologies that do not offer cc-NUMA
capabilities require that applications are written for message
passing, resulting in larger programs with more bugs and cor-
respondingly longer development time while systems built with
NumaChip can run any program efficiently.
Virtualization
The strong trend of virtualization is driven by the desire of ob-
taining higher utilization of resources in the datacenter. In short,
it means that any application should be able to run on any serv-
er in the datacenter so that each server can be better utilized
by combining more applications on each server dynamically
according to user loads. Commodity server technology repre-
sents severe limitations in reaching this goal.
One major limitation is that the memory requirements of any
given application need to be satisfied by the physical server
that hosts the application at any given time. In turn, this means
that if any application in the datacenter shall be dynamically
executable on all of the servers at different times, all of the serv-
ers must be configured with the amount of memory required by
the most demanding application, but only the one running the
app will actually use that memory. This is where the mainframes
excel since these have a flexible shared memory architecture
where any processor can use any portion of the memory at any
given time, so they only need to be configured to be able to han-
dle the most demanding application in one instance.
NumaChip offers the exact same feature, by providing any appli-
cation with access to the aggregate amount of memory in the
system. In addition, it also offers all applications access to all I/O
devices in the system through the standard virtual view provid-
ed by the operating system.
The two distinctly different architectures of clusters and main-
frames are shown in Figure 1. In Clusters processes are loosely
coupled through a network like Ethernet or InfiniBand. An appli-
cation that needs to utilize more processors or I/O than those
present in each server must be programmed to do so from the
beginning. In the main-frame, any application can use any re-
source in the system as a virtualized resource and the compiler
can generate threads to be executed on any processor.
In a system interconnected with NumaChip, all processors can
access all the memory and all the I/O resources in the system
in the same way as on a mainframe. NumaChip provides a fully
virtualized hardware environment with shared memory and I/O
and with the same ability as mainframes to utilize compiler gen-
erated parallel processes and threads.
Clustered Architecture
Mainframe Architecture
174
Operating Systems
Systems based on NumaChip can run standard operating sys-
tems that handle shared memory multiprocessing. Numascale
provides a bootstrap loader that is invoked after power-up and
performs initialization of the system by setting up node address
routing tables. When the standard bootstrap loader is launched,
the system will appear as a large unified shared memory system.
Cache Coherent Shared Memory
The big differentiator for NumaConnect compared to other
high-speed interconnect technologies is the shared memory
Clustered vs Mainframe ArchitectureNumascale
Technology
175
� Resources can be mapped and used by any processor in the
system – optimal use of resources in a virtualized environ-
ment
� Process scheduling is synchronized through a single, real-time
clock - avoids serialization of scheduling associated with
asynchronous operating systems in a cluster and the corre-
sponding loss of efficiency
Scalability and Robustness
The initial design aimed at scaling to very large numbers of pro-
cessors with 64-bit physical address space with 16 bits for node
identifier and 48 bits of address within each node. The current
implementation for Opteron is limited by the global physical ad-
dress space of 48 bits, with 12 bits used to address 4 096 physical
nodes for a total physical address range of 256 Terabytes.
A directory based cache coherence protocol was developed to
handle scaling with significant number of nodes sharing data to
avoid overloading the interconnect between nodes with coher-
ency traffic which would seriously reduce real data throughput.
and cache coherency mechanisms. These features allow pro-
grams to access any memory location and any memory mapped
I/O device in a multiprocessor system with high degree of effi-
ciency. It provides scalable systems with a unified programming
model that stays the same from the small multi-core machines
used in laptops and desktops to the largest imaginable single
system image machines that may contain thousands of proces-
sors.
There are a number of pros for shared memory machines that
lead experts to hold the architecture as the holy grail of comput-
ing compared to clusters:
� Any processor can access any data location through direct
load and store operations - easier programming, less code to
write and debug
� Compilers can automatically exploit loop level parallelism –
higher efficiency with less human effort
� System administration relates to a unified system as opposed
to a large number of separate images in a cluster – less effort
to maintain
NumaChip System Architecture
176
The basic ring topology with distributed switching allows a
number of different interconnect configurations that are more
scalable than most other interconnect switch fabrics. This also
eliminates the need for a centralized switch and includes inher-
ent redundancy for multidimensional topologies.
Functionality is included to manage robustness issues associ-
ated with high node counts and extremely high requirements
for data integrity with the ability to provide high availability for
systems managing critical data in transaction processing and
realtime control. All data that may exist in only one copy are ECC
protected with automatic scrubbing after detected single bit
errors and automatic background scrubbing to avoid accumula-
tion of single bit errors.
Integrated, distributed switching
NumaChip contains an on-chip switch to connect to other nodes
in a NumaChip based system, eliminating the need to use a
centralized switch. The on-chip switch can connect systems in
one, two or three dimensions. Small systems can use one, medi-
um-sized system two and large systems will use all three dimen-
sions to provide efficient and scalable connectivity between
processors.
The two- and three-dimensional topologies (called Torus) also
have the advantage of built-in redundancy as opposed to sys-
tems based on centralized switches, where the switch repre-
sents a single point of failure.
The distributed switching reduces the cost of the system since
there is no extra switch hardware to pay for. It also reduces the
amount of rack space required to hold the system as well as the
power consumption and heat dissipation from the switch hard-
ware and the associated power supply energy loss and cooling
requirements.
Numascale
Technology
177
Block Diagram of NumaChip
Multiple links allow flexible system configurations in multi-dimensional topologies
S
N
EW
SESW
NENW
tech BOX
178
Numascale
Redefining Scalable OpenMP and MPI
Redefining Scalable OpenMP and MPI Shared Memory Advantages
Multi-processor shared memory processing has long been the
preferred method for creating and running technical computing
codes. Indeed, this computing model now extends from a user’s
dual core laptop to 16+ core servers. Programmers often add
parallel OpenMP directives to their programs in order to take
advantage of the extra cores on modern servers. This approach
is flexible and often preserves the “sequential” nature of the
program (pthreads can of course also be used, but OpenMP is
much easier to use). To extend programs beyond a single server,
however, users must use the Message Passing Interface (MPI) to
allow the program to operate across a high-speed interconnect.
Interestingly, the advent of multi-core servers has created a par-
allel asymmetric computing model, where programs must map
themselves to networks of shared memory SMP servers. This
asymmetric model introduces two levels of communication,
local within a node, and distant to other nodes. Programmers
often create pure MPI programs that run across multiple cores
on multiple nodes. While a pure MPI program does represent
the greatest common denominator, better performance may be
sacrificed by not utilizing the local nature of multi-core nodes.
Hybrid (combined MPI/OpenMP) models have been able to pull
more performance from cluster hardware, but often introduce
programming complexity and may limit portability.Clearly, us-
ers prefer writing software for large shared memory systems to
MPI programming. This preference becomes more pronounced
when large data sets are used. In a large SMP system the data
are simply used in place, whereas in a distributed memory clus-
ter the dataset must be partitioned across compute nodes.
179
In summary, shared memory systems have a number of highly
desirable features that offer ease of use and cost reduction over
traditional distributed memory systems:
� Any processor can access any data location through direct load
and store operations, allowing easier programming (less time
and training) for end users, with less code to write and debug.
� Compilers, such as those supporting OpenMP, can automatically
exploit loop level parallelism and create more efficient codes,
increasing system throughput and better resource utilization.
� System administration of a unified system (as opposed to
a large number of separate images in a cluster) results in
reduced effort and cost for system maintenance.
� Resources can be mapped and used by any processor in
the system, with optimal use of resources in a single image
operating system environment.
Shared Memory as a Universal Platform
Although the advantages of shared memory systems are clear,
the actual implementation of such systems “at scale” has been
difficult prior to the emergence of NumaConnect technolo-
gy. There have traditionally been limits to the size and cost of
shared memory SMP systems, and as a result the HPC commu-
nity has moved to distributed memory clusters that now scale
into the thousands of cores. Distributed memory programming
occurs within the MPI library, where explicit communication
pathways are established between processors (i.e., data is es-
sentially copied from machine to machine). A large number of
existing applications use MPI as a programming model. Fortu-
nately, MPI codes can run effectively on shared memory sys-
tems. Optimizations have been built into many MPI versions
that recognize the availability of shared memory and avoid full
message protocols when communicating between processes.
Shared memory programming using OpenMP has been useful on
small-scale SMP systems such as commodity workstations and
servers. Providing large-scale shared memory environments for
these codes, however, opens up a whole new world of perfor-
mance capabilities without the need for re-programming. Using
NumaConnect technology, scalable shared memory clusters are
capable of efficiently running both large-scale OpenMP and MPI
codes without modification.
Record-Setting OpenMP Performance
In the HPC community NAS Parallel Benchmarks (NPB) have
been used to test the performance of parallel computers (http://
www.nas.nasa.gov/publications/npb.html). The benchmarks are
a small set of programs derived from Computational Fluid Dy-
namics (CFD) applications that were designed to help evaluate
the performance of parallel supercomputers. Problem sizes in
NPB) are predefined and indicated as different classes (currently
A through F, with F being the largest).
Reference implementations of NPB are available in common-
ly-used programming models such as MPI and OpenMP, which
make them ideal for measuring the performance of both distrib-
uted memory and SMP systems. These benchmarks were com-
piled with Intel ifort version 14.0.0. (Note: the current-generated
code is slightly faster, but Numascale is working on NumaCon-
nect optimizations for the GNU compilers and thus suggests
using gcc and gfortran for OpenMP applications.) For the follow-
ing tests, the NumaConnect Shared Memory benchmark system
has 1TB of memory and 256 cores. It utilizes eight servers, each
equipped with two x AMD Opteron 2.5 GHz 6380 CPUs, each with
16 cores and 128GB of memory. Figure One shows the results for
running NPB-SP (Scalar Penta-diagonal solver) over a range of 16
to 121 cores using OpenMP for the Class D problem size.
Figure Two shows results for the NPB-LU benchmark (Lower-Up-
per Gauss-Seidel solver) over a range of 16 to 121 cores, using
OpenMP for the Class D problem size.
180
Figure Three shows he NAS-SP benchmark E-class scaling per-
fectly from 64 processes (using affinity 0-255:4) to 121 process-
es (using affinity 0-241:2). Results indicate that larger problems
scale better on NumaConnect systems, and it was noted that
NASA has never seen OpenMP E Class results with such a high
number of cores.
OpenMP NAS Parallel results for NPB-SP (Class D)
OpenMP NAS Parallel results for NPB-LU (Class E)
Numascale
Redefining Scalable OpenMP and MPI
181
OpenMP applications cannot run on InfinBand clusters without
additional software layers and kernel modifications. The Numa-
Connect cluster runs a standard Linux kernel image.
Surprisingly Good MPI Performance
Despite the excellent OpenMP shared memory performance
that NumaConnect can deliver, applications have historically
been written using MPI. The performance of these applications
is presented below. As mentioned, the NumaConnect system
can easily run MPI applications.
Figure Four is a comparison of NumaConnect and FDR InfiniB-
and NPB-SP (Class D). The results indicate that NumaConnect
performance is superior to that of a traditional distributed In-
finiBand memory cluster. MPI tests were run with OpenMPI and
gfortran 4.8.1 using the same hardware mentioned above.
Both industry-standard OpenMPI and MPICH2 work in shared
memory mode. Numascale has implemented their own version
of the OpenMPI BTL (Byte Transfer Layer) to optimize the com-
munication by utilizing non-polluting store instructions. MPI
messages require data to be moved, and in a shared memory en-
vironment there is no reason to use standard instructions that
implicitly result in cache pollution and reduced performance.
This results in very efficient message passing and excellent MPI
performance.
Similar results are shown in Figure Five for the NAS-LU (Class D).
NumaConnect’s performance over InfiniBand may be one of the
more startling results for the NAS benchmarks. Recall again that
OpenMP applications cannot run on InfiniBand clusters without
additional software layers and kernel modifications.
OpenMP NAS results for NPB-SP (Class E)
NPB-SP comparison of NumaConnect to FDI InfiniBand
NPB-LU comparison of NumaConnect to FDI InfiniBand
S
N
EW
SESW
NENW
tech BOX
182
Numascale
Big Data
Big Data Big Data applications – once limited to a few exotic disciplines
– are steadily becoming the dominant feature of modern com-
puting. In industry after industry, massive datasets are being
generated by advanced instruments and sensor technology.
Consider just one example, next generation DNA sequencing
(NGS). Annual NGS capacity now exceeds 13 quadrillion base
pairs (the As, Ts, Gs, and Cs that make up a DNA sequence). Each
base pair represents roughly 100 bytes of data (raw, analyzed,
and interpreted). Turning the swelling sea of genomic data into
useful biomedical information is a classic Big Data challenge,
one of many, that didn’t exist a decade ago.
This mainstreaming of Big Data is an important transformational
moment in computation. Datasets in the 10-to-20 Terabytes (TB)
range are increasingly common. New and advanced algorithms
for memory-intensive applications in Oil & Gas (e.g. seismic data
processing), finance (real-time trading) social media (database),
and science (simulation and data analysis), to name but a few,
are hard or impossible to run efficiently on commodity clusters.
The challenge is that traditional cluster computing based on
distributed memory – which was so successful in bringing down
the cost of high performance computing (HPC) – struggles when
forced to run applications where memory requirements exceed
the capacity of a single node. Increased interconnect latencies,
longer and more complicated software development, inefficient
system utilization, and additional administrative overhead are
all adverse factors. Conversely, traditional mainframes running
shared memory architecture and a single instance of the OS have
always coped well with Big Data Crunching jobs.
183
Numascale’s SolutionUntil now, the biggest obstacle to wider use of shared memory
computing has been the high cost of mainframes and high-end
‘super-servers’. Given the ongoing proliferation of Big Data appli-
cations, a more efficient and cost-effective approach to shared
memory computing is needed. Now, Numascale, a four-year-old
spin-off from Dolphin Interconnect Solutions, has developed a
technology, NumaConnect, which turns a collection of standard
servers with separate memories and IO into a unified system
that delivers the functionality of high-end enterprise servers and
mainframes at a fraction of the cost.
Numascale Technology snapshot (board and chip):
NumaConnect links commodity servers together to form a single
unified system where all processors can coherently access and
share all memory and I/O. The combined system runs a single
instance of a standard operating system like Linux. At the heart
of NumaConnect is NumaChip – a single chip that combines the
cache coherent shared memory control logic with an on-chip
7-way switch. This eliminates the need for a separate, central
switch and enables linear capacity and cost scaling.
Systems based on NumaConnect support all classes of applica-
tions using shared memory or message passing through all pop-
ular high level programming models. System size can be scaled
to 4k nodes where each node can contain multiple processors.
Memory size is limited only by the 48-bit physical address range
provided by the Opteron processors resulting in a record-break-
ing total system main memory of 256 TBytes. (For details of Nu-
mascale technology see: The NumaConnect White Paper)
The result is an affordable, shared memory computing option to
tackle data-intensive applications. NumaConnect-based systems
running with entire data sets in memory are “orders of magni-
tude faster than clusters or systems based on any form of existing
mass storage devices and will enable data analysis and decision
support applications to be applied in new and innovative ways,”
says Kåre Løchsen, Numascale founder and CEO.
The big differentiator for NumaConnect compared to other high-
speed interconnect technologies is the shared memory and
cache coherency mechanisms. These features allow programs to
access any memory location and any memory mapped I/O device
in a multiprocessor system with high degree of efficiency. It pro-
vides scalable systems with a unified programming model that
stays the same from the small multi-core machines used in lap-
tops and desktops to the largest imaginable single system image
machines that may contain thousands of processors and tens to
hundreds of terabytes of main memory.
Early adopters are already demonstrating performance gains
and costs savings. A good example is Statoil, the global energy
company based in Norway. Processing seismic data requires
massive amounts of floating point operations and is normally
performed on clusters. Broadly speaking, this kind of processing
is done by programs developed for a message-passing paradigm
(MPI). Not all algorithms are suited for the message passing para-
digm and the amount of code required is huge and the develop-
ment process and debugging task are complex.
Numascale’s implementation of shared memory has many inno-
vations. Its on-chip switch, for example, can connect systems in
one, two, or three dimensions (e.g. 2D and 3D Torus). Small sys-
tems can use one, medium sized system two, and large systems
will use all three dimensions to provide efficient and scalable
connectivity between processors. The distributed switching re-
duces the cost of the system since there is no extra switch hard-
ware to pay for. It also reduces the amount of rack space required
to hold the system as well as the power consumption and heat
dissipation from the switch hardware and the associated power
supply energy loss and cooling requirements.
S
N
EW
SESW
NENW
tech BOX
184
Numascale
Cache Coherence
Cache Coherence Multiprocessor systems with caches and shared memory space
need to resolve the problem of keeping shared data coherent.
This means that the most recently written data to a memory lo-
cation from one processor needs to be visible to other proces-
sors in the system immediately after a synchronization point.
The synchronization point is normally implemented as a barrier
through a lock (semaphore) where the process that stores data
releases a lock after finishing the store sequence.
When the lock is released, the most recent version of the data
that was stored by the releasing process can be read by another
process. If there were no individual caches or buffers in the data
paths between processors and the shared memory system, this
would be a relatively trivial task. The presence of caches chang-
es this picture dramatically. Even if the caches were so-called
write-through, in which case all store operations proceed all
the way to main memory, it would still mean that the proces-
sor wanting to read shared data could have copies of previous
185
versions of the data in the cache and such a read operation
would return a stale copy seen from the producer process un-
less certain steps are taken to avoid the problem.
Early cache systems would simply require a cache clear opera-
tion to be issued before reading the shared memory buffer in
which case the reading process would encounter cache misses
and automatically fetch the most recent copy of the data from
main memory. This could be done without too much loss of per-
formance when the speed difference between memory, cache
and processors was relatively small.
As these differences grew and write-back caches were intro-
duced, such coarse-grained coherency mechanisms were no
longer adequate. A write-back cache will only store data back
to main memory upon cache line replacements, i.e. when the
cache line will be replaced due to a cache miss of some sort. To
accommodate this functionality, a new cache line state, “dirty”,
was introduced in addition to the traditional “hit/miss” states.
S
N
EW
SESW
NENW
tech BOX
186
Numascale
NUMA and ccNUMA
NUMA and ccNUMA The term Non-Uniform Memory Access means that memory
accesses from a given processor in a multiprocessor system will
have different access time depending on the physical location
of the memory. The “cc” means Cache Coherence, (i.e. all memory
references from all processors will return the latest updated
data from any cache in the system automatically.)
Modern multiprocessor systems where processors are con-
nected through a hierarchy of interconnects (or buses) will per-
form differently depending on the relative physical location of
processors and the memory being accessed. Most small multi-
processor systems used to be symmetric (SMP – Symmetric Mul-
ti-Processor) having a couple of processors hooked up to a main
memory system through a central bus where all processors had
equal priority and the same distance from the memory.
This was changed through AMD’s introduction of Hypertrans-
port with on-chip DRAM controllers. When more than one such
processor chip (normally also with multiple CPU cores per chip)
are connected, the criteria for being categorized as NUMA are
fulfilled. This is illustrated by the fact that reading or writing
data from or to the local memory is indeed quite a bit faster
than performing the same operations on the memory that is
controlled by the other processor chip.
Seen from a programming point of view, the uniform, global
shared address space is a great advantage since any operand or
187
variable can be directly referenced from he program through a
single load or store operation. This is both simple and incurs no
overhead. In addition, the large physical memory capacity elimi-
nates the need for data decomposition and the need for explicit
message passing between processes. It is also of great impor-
tance that the programming model can be maintained through-
out the whole range of system sizes from small desktop systems
with a couple of processor cores to huge supercomputers with
thousands of processors.
The Non-Uniformity seen from the program is only coupled with
the access time difference between different parts of the physi-
cal memory. Allocation of physical memory is handled by the op-
erating system (OS) and modern OS kernels include features for
optimizing allocation of memory to match with the scheduling
of processors to execute the program. If a programmer wishes
to influence memory allocation and process scheduling in any
way, OS calls to pin-down memory and reserve processors can
be inserted in the code.
This is of course less desirable than to let the OS handle it auto-
matically because it increases the efforts required to port the
application between different operating systems. In normal
cases with efficient caching of remote accesses, the automatic
handling by the OS will be sufficient to ensure high utilization of
processors even if the memory is physically scattered between
the processing nodes.
Engi
nee
rin
g L
ife
Scie
nce
s A
uto
mo
tive
P
rice
Mo
del
lin
g A
ero
spac
e C
AE
Dat
a A
nal
ytic
s
IBM Platform HPC –Enterprise-Ready Cluster & Workload Management
High performance computing (HPC) is becoming a necessary tool for organizations to
speed up product design, scientific research, and business analytics. However, there
are few software environments more complex to manage and utilize than modern high
performance computing clusters.
Therefore, addressing the problem of complexity in cluster management is a key aspect
of leveraging HPC to improve time to results and user productivity.
190
IntroductionClusters based on the Linux operating system have become in-
creasingly prevalent at large supercomputing centers and con-
tinue to make significant in-roads in commercial and academic
settings. This is primarily due to their superior price/performance
and flexibility, as well as the availability of commercial applica-
tions that are based on the Linux OS.
Ironically, the same factors that make Linux a clear choice for
high performance computing often make the operating system
less accessible to smaller computing centers. These organiza-
tions may have Microsoft Windows administrators on staff, but
have little or no Linux or cluster management experience. The
complexity and cost of cluster management often outweigh the
benefits that make open, commodity clusters so compelling. Not
only can HPC cluster deployments be difficult, but the ongoing
need to deal with heterogeneous hardware and operating sys-
tems, mixed workloads, and rapidly evolving toolsets make de-
ploying and managing an HPC cluster a daunting task.
These issues create a barrier to entry for scientists and research-
ers who require the performance of an HPC cluster, but are limit-
ed to the performance of a workstation. This is why ease of use is
now mandatory for HPC cluster management. This paper reviews
the most complete and easy to use cluster management solu-
tion, Platform HPC, which is now commercially available from
Platform Computing.
The cluster management challengeTo provide a proper HPC application environment, system admin-
istrators need to provide a full set of capabilities to their users,
as shown below. These capabilities include cluster provisioning
and node management, application workload management, and
an environment that makes it easy to develop, run and manage
distributed parallel applications.
Modern application environments tend to be heterogeneous;
some workloads require Windows compute hosts while others
IBM Platform HPC
Enterprise-Ready Cluster &
Workload Management
Application-Centric Interface
Unified Management Interface
Provisioning& Node
Management
WorkloadManagement
Parallel JobEnablement
AdaptiveScheduling
191
require particular Linux operating systems or versions.
The ability to change a node’s operating system on-the-fly in re-
sponse to changing application needs - referred to as adaptive
scheduling - is important since it allows system administrators to
maximize resource use, and present what appears to be a larger
resource pool to cluster users.
Learning how to use a command line interface to power-up, pro-
vision and manage a cluster is extremely time-consuming. Ad-
ministrators therefore need remote, web-based access to their
HPC environment that makes it easier for them to install and
manage an HPC cluster. An easy-to-use application-centric web
interface can have tangible benefits including improved produc-
tivity, reduced training requirements, reduced errors rates, and
secure remote access.
While there are several cluster management tools that address
parts of these requirements, few address them fully, and some
tools are little more than collections of discrete open-source
software components.
Some cluster toolkits focus largely on the problem of cluster pro-
visioning and management. While they clearly simplify cluster
deployment, administrators wanting to make changes to node
configurations or customize their environment will quickly find
themselves hand-editing XML configuration files or writing their
own shell scripts. Third-party workload managers and various
open-source MPI libraries might be included as part of a distribu-
tion. However, these included components are loosely integrat-
ed and often need to be managed separately from the cluster
manager. As a result the cluster administrator needs to learn how
to utilize each additional piece of software in order to manage
the cluster effectively.
Other HPC solutions are designed purely for application work-
load management. While these are all capable workload manag-
ers, most do not address at all the issue of cluster management,
application integration, or adaptive scheduling. If such capabil-
ities exist they usually require the purchase of additional soft-
ware products.
Parallel job management is also critical. One of the primary rea-
sons that customers deploy HPC clusters is to maximize applica-
tion performance. Processing problems in parallel is a common
way to achieve performance gains. The choice of MPI, its scala-
bility, and the degree to which it is integrated with various OFED
drivers and high performance interconnects has a direct impact
on delivered application performance. Furthermore, if the work-
load manager does not incorporate specific parallel job manage-
ment features, busy cluster users and administrators can find
themselves manually cleaning up after failed MPI jobs or writing
their own shell scripts to do the same.
Complexity is a real problem. Many small organizations or depart-
ments grapple with a new vocabulary full of cryptic commands,
configuring and troubleshooting Anaconda kick start scripts,
finding the correct OFED drivers for specialized hardware, and
configuring open source monitoring systems like Ganglia or Na-
gios. Without an integrated solution administrators may need
to deal with dozens of distinct software components, making
managing HPC cluster implementations extremely tedious and
time-consuming.
Essential components of an HPC cluster solution
192
Re-thinking HPC clustersClearly these challenges demand a fresh approach to HPC cluster
management. Platform HPC represents a “re-think” of how HPC
clusters are deployed and managed. Rather than addressing only
part of the HPC management puzzle, Platform HPC addresses all
facets of cluster management. It provides:
� A complete, easy-to-use cluster management solution
� Integrated application support
� User-friendly, topology-aware workload management
� Robust workload and system monitoring and reporting
� Dynamic operating system multi-boot (adaptive scheduling)
� GPU scheduling
� Robust commercial MPI library (Platform MPI)
�Web-based interface for access anywhere
Most complete HPC cluster management solution
Platform HPC makes it easy to deploy, run and manage HPC clusters
while meeting the most demanding requirements for application
performance and predictable workload management. It is a com-
plete solution that provides a robust set of cluster management
capabilities; from cluster provisioning and management to work-
load management and monitoring. The easy-to -use unified web
portal provides a single point of access into the cluster, making it
easy to manage your jobs and optimize application performance.
Platform HPC is more than just a stack of software; it is a fully
integrated and certified solution designed to ensure ease of use
and simplified troubleshooting.
Integrated application support
High performing, HPC-optimized MPI libraries come integrated
with Platform HPC, making it easy to get parallel applications up
and running. Scripting guidelines and job submission templates
for commonly used commercial applications simplify job submis-
sion, reduce setup time and minimize operation errors. Once the ap-
plications are up and running, Platform HPC improves application
“Platform HPC and Platform LSF have been known for many years as
a highest-quality and enterprise-ready cluster and workload man-
agement solutions, and we have many customers in academia and
industry relying on them.”
Robin KieneckerHPC Sales Specialist
IBM Platform HPC
Enterprise-Ready Cluster &
Workload Management
193
performance by intelligently scheduling resources based on
workload characteristics.
Fully certified and supported
Platform HPC unlocks cluster management to provide the easiest
and most complete HPC management capabilities while reducing
overall cluster cost and improving administrator productivity. It is
based on the industry’s most mature and robust workload manager,
Platform LSF, making it the most reliable solution on the market.
Other solutions are typically a collection of open-source tools, which
may also include pieces of commercially developed software. They
lack key HPC functionality and vendor support, relying on the ad-
ministrator’s technical ability and time to implement. Platform HPC
is a single product with a single installer and a unified web-based
management interface. With the best support in the HPC industry,
Platform HPC provides the most complete solution for HPC cluster
management.
Complete solutionPlatform HPC provides a complete set of HPC cluster manage-
ment features. In this section we’ll explore some of these unique
capabilities in more detail.
Easy to use, cluster provisioning and management
With Platform HPC, administrators can quickly provision and
manage HPC clusters with unprecedented ease. It ensures max-
imum uptime and can transparently synchronize files to cluster
nodes without any downtime or re-installation.
Fast and efficient software Installation – Platform HPC can be
installed on the head node and takes less than one hour using
three different mechanisms:
� Platform HPC DVD
� Platform HPC ISO file
� Platform partner’s factory install bootable USB drive
Installing software on cluster nodes is simply a matter of associ-
ating cluster nodes with flexible provisioning templates through
the web-based interface.
Flexible provisioning – Platform HPC offers multiple options for
provisioning Linux operating environments that include:
� Package-based provisioning
� Image based provisioning
� Diskless node provisioning
Large collections of hosts can be provisioned using the same pro-
visioning template. Platform HPC automatically manages details
such as IP address assignment and node naming conventions
that reflect the position of cluster nodes in data center racks.
Unlike competing solutions, Platform HPC deploys multiple op-
erating systems and OS versions to a cluster simultaneously. This
includes Red Hat Enterprise Linux, CentOS, Scientific Linux, and
SUSE Linux Enterprise Server. This provides administrators with
greater flexibility in how they serve their user communities and
means that HPC clusters can grow and evolve incrementally as
requirements change.
S
N
EW
SESW
NENW
tech BOX
194
IBM Platform HPC
What’s New in IBM Platform LSF 8
What’s New in IBM Platform LSF 8 Written with Platform LSF administrators in mind, this brief pro-
vides a short explanation of significant changes in Platform’s lat-
est release of Platform LSF, with a specific emphasis on schedul-
ing and workload management features.
About IBM Platform LSF 8Platform LSF is the most powerful workload manager for de-
manding, distributed high performance computing environ-
ments. It provides a complete set of workload management ca-
pabilities, all designed to work together to reduce cycle times
and maximize productivity in missioncritical environments.
This latest Platform LSF release delivers improvements in perfor-
mance and scalability while introducing new features that sim-
plify administration and boost user productivity. This includes:
� Guaranteed resources – Aligns business SLA’s with infra-
structure configuration for simplified administration and
configuration
� Live reconfiguration – Provides simplified administration
and enables agility
� Delegation of administrative rights – Empowers line of busi-
ness owners to take control of their own projects
� Fairshare & pre-emptive scheduling enhancements – Fine
tunes key production policies
195
Platform LSF 8 FeaturesGuaranteed Resources Ensure Deadlines are Met
In Platform LSF 8, resource-based scheduling has been extended
to guarantee resource availability to groups of jobs. Resources
can be slots, entire hosts or user-defined shared resources such
as software licenses.
As an example, a business unit might guarantee that it has access
to specific types of resources within ten minutes of a job being
submitted, even while sharing resources between departments.
This facility ensures that lower priority jobs using the needed re-
sources can be pre-empted in order to meet the SLAs of higher
priority jobs. Because jobs can be automatically attached to an
SLA class via access controls, administrators can enable these
guarantees without requiring that end-users change their job
submission procedures, making it easy to implement this capa-
bility in existing environments.
Live Cluster Reconfiguration
Platform LSF 8 incorporates a new live reconfiguration capabil-
ity, allowing changes to be made to clusters without the need
to re-start LSF daemons. This is useful to customers who need
to add hosts, adjust sharing policies or re-assign users between
groups “on the fly”, without impacting cluster availability or run-
ning jobs.
Changes to the cluster configuration can be made via the bconf
command line utility, or via new API calls. This functionality can
also be integrated via a web-based interface using Platform Ap-
plication Center. All configuration modifications are logged for
a complete audit history, and changes are propagated almost
instantaneously. The majority of reconfiguration operations are
completed in under half a second.
With Live Reconfiguration, down-time is reduced, and adminis-
trators are free to make needed adjustments quickly rather than
wait for scheduled maintenance periods or non-peak hours. In
cases where users are members of multiple groups, controls can
be put in place so that a group administrator can only control
jobs associated with their designated group rather than impact-
ing jobs related to another group submitted by the same user.
Delegation of Administrative Rights
With Platform LSF 8, the concept of group administrators has
been extended to enable project managers and line of business
managers to dynamically modify group membership and fair-
share resource allocation policies within their group. The ability
to make these changes dynamically to a running cluster is made
possible by the Live Reconfiguration feature.
These capabilities can be delegated selectively depending on the
group and site policy. Different group administrators can man-
age jobs, control sharing policies or adjust group membership.
196
More Flexible Fairshare Scheduling Policies
To enable better resource sharing flexibility with Platform LSF 8,
the algorithms used to tune dynamically calculated user priori-
ties can be adjusted at the queue level. These algorithms can vary
based on department, application or project team preferences.
The Fairshare parameters ENABLE_HIST_RUN_TIME and HIST_
HOURS enable administrators to control the degree to which LSF
considers prior resource usage when determining user priority.
The flexibility of Platform LSF 8 has also been improved by al-
lowing a similar “decay rate” to apply to currently running jobs
(RUN_TIME_DECAY), either system-wide or at the queue level.
This is most useful for customers with long-running jobs, where
setting this parameter results in a more accurate view of real re-
source use for the fairshare scheduling to consider.
Performance & Scalability Enhancements
Platform LSF has been extended to support an unparalleled
scale of up to 100,000 cores and 1.5 million queued jobs for
very high throughput EDA workloads. Even higher scalability is
possible for more traditional HPC workloads. Specific areas of im-
provement include the time required to start the master-batch
daemon (MBD), bjobs query performance, job submission and
job dispatching as well as impressive performance gains result-
ing from the new Bulk Job Submission feature. In addition, on
very large clusters with large numbers of user groups employing
fairshare scheduling, the memory footprint of the master batch
scheduler in LSF has been reduced by approximately 70% and
scheduler cycle time has been reduced by 25%, resulting in better
performance and scalability.
IBM Platform HPC
What’s New in IBM Platform LSF 8
197
More Sophisticated Host-based Resource Usage for Parallel Jobs
Platform LSF 8 provides several improvements to how resource
use is tracked and reported with parallel jobs. Accurate tracking
of how parallel jobs use resources such as CPUs, memory and
swap, is important for ease of management, optimal scheduling
and accurate reporting and workload analysis. With Platform LSF
8 administrators can track resource usage on a per-host basis and
an aggregated basis (across all hosts), ensuring that resource use
is reported accurately. Additional details such as running PIDs
and PGIDs for distributed parallel jobs, manual cleanup (if neces-
sary) and the development of scripts for managing parallel jobs
are simplified. These improvements in resource usage reporting
are reflected in LSF commands including bjobs, bhist and bacct.
Improved Ease of Administration for Mixed Windows and
Linux Clusters
The lspasswd command in Platform LSF enables Windows LSF
users to advise LSF of changes to their Windows level passwords.
With Platform LSF 8, password synchronization between envi-
ronments has become much easier to manage because the Win-
dows passwords can now be adjusted directly from Linux hosts
using the lspasswd command. This allows Linux users to conven-
iently synchronize passwords on Windows hosts without need-
ing to explicitly login into the host.
Bulk Job Submission
When submitting large numbers of jobs with different resource
requirements or job level settings, Bulk Job Submission allows
for jobs to be submitted in bulk by referencing a single file con-
taining job details.
198
Simplified configuration changes – Platform HPC simplifies ad-
ministration and increases cluster availability by allowing chang-
es such as new package installations, patch updates, and changes
to configuration files to be propagated to cluster nodes automat-
ically without the need to re-install those nodes. It also provides a
mechanism whereby experienced administrators can quickly per-
form operations in parallel across multiple cluster nodes.
Repository snapshots / trial installations – Upgrading software
can be risky, particularly in complex environments. If a new soft-
ware upgrade introduces problems, administrators often need to
rapidly “rollback” to a known good state. With other cluster man-
agers this can mean having to re-install the entire cluster. Platform
HPC incorporates repository snapshots, which are “restore points”
for the entire cluster. Administrators can snapshot a known good
repository, make changes to their environment, and easily revert
to a previous “known good” repository in the event of an unfore-
seen problem. This powerful capability takes the risk out of cluster
software upgrades.
New hardware integration – When new hardware is added
to a cluster it may require new or updated device drivers that
are not supported by the OS environment on the installer
node. This means that a newly updated node may not net-
work boot and provision until the head node on the cluster is
updated with a new operating system; a tedious and disrup-
tive process. Platform HPC includes a driver patching utility
that allows updated device drivers to be inserted into exist-
ing repositories, essentially future proofing the cluster, and
providing a simplified means of supporting new hardware
without needing to re-install the environment from scratch.
Software updates with no re-boot – Some cluster managers al-
ways re-boot nodes when updating software, regardless of how
minor the change. This is a simple way to manage updates. Howev-
er, scheduling downtime can be difficult and disruptive. Platform
IBM Platform HPC
Enterprise-Ready Cluster &
Workload Management
199
HPC performs updates intelligently and selectively so that compute
nodes continue to run even as non-intrusive updates are applied.
The repository is automatically updated so that future installations
include the software update. Changes that require the re-installa-
tion of the node (e.g. upgrading an operating system) can be made
in a “pending” state until downtime can be scheduled.
User-friendly, topology aware workload managementPlatform HPC includes a robust workload scheduling capability,
which is based on Platform LSF - the industry’s most powerful,
comprehensive, policy driven workload management solution for
engineering and scientific distributed computing environments.
By scheduling workloads intelligently according to policy,
Platform HPC improves end user productivity with minimal sys-
tem administrative effort. In addition, it allows HPC user teams to
easily access and share all computing resources, while reducing
time between simulation iterations.
GPU scheduling – Platform HPC provides the capability to sched-
ule jobs to GPUs as well as CPUs. This is particularly advantageous
in heterogeneous hardware environments as it means that admin-
istrators can configure Platform HPC so that only those jobs that
can benefit from running on GPUs are allocated to those resourc-
es. This frees up CPU-based resources to run other jobs. Using the
unified management interface, administrators can monitor the
GPU performance as well as detect ECC errors.
Unified management interfaceCompeting cluster management tools either do not have a web-
based interface or require multiple interfaces for managing dif-
ferent functional areas. In comparison, Platform HPC includes a
single unified interface through which all administrative tasks
can be performed including node-management, job-manage-
ment, jobs and cluster monitoring and reporting. Using the uni-
fied management interface, even cluster administrators with
very little Linux experience can competently manage a state of
the art HPC cluster.
Job management – While command line savvy users can contin-
ue using the remote terminal capability, the unified web portal
makes it easy to submit, monitor, and manage jobs. As changes
are made to the cluster configuration, Platform HPC automatical-
ly re-configures key components, ensuring that jobs are allocat-
ed to the appropriate resources.
The web portal is customizable and provides job data manage-
ment, remote visualization and interactive job support.
Workload/system correlation – Administrators can correlate
workload information with system load, so that they can make
timely decisions and proactively manage compute resources
against business demand. When it’s time for capacity planning,
the management interface can be used to run detailed reports
and analyses which quantify user needs and remove the guess
work from capacity expansion.
Simplified cluster management – The unified management con-
sole is used to administer all aspects of the cluster environment.
It enables administrators to easily install, manage and monitor
their cluster. It also provides an interactive environment to easily
Resource monitoring
200
package software as kits for application deployment as well as
pre-integrated commercial application support. One of the key
features of the interface is an operational dashboard that pro-
vides comprehensive administrative reports. As the image illus-
trates, Platform HPC enables administrators to monitor and report
on key performance metrics such as cluster capacity, available
memory and CPU utilization. This enables administrators to easily
identify and troubleshoot issues. The easy to use interface saves
the cluster administrator time, and means that they do not need to
become an expert in the administration of open-source software
components. It also reduces the possibility of errors and time lost
due to incorrect configuration. Cluster administrators enjoy the
best of both worlds – easy access to a powerful, web-based cluster
manager without the need to learn and separately administer all
the tools that comprise the HPC cluster environment.
Robust Commercial MPI libraryPlatform MPI – In order to make it easier to get parallel applica-
tions up and running, Platform HPC includes the industry’s most
robust and highest performing MPI implementation, Platform MPI.
Platform MPI provides consistent performance at application run-
time and for application scaling, resulting in top performance re-
sults across a range of third-party benchmarks.
Open Source MPI – Platform HPC also includes various other in-
dustry standard MPI implementations. This includes MPICH1,
MPICH2 and MVAPICH1, which are optimized for cluster hosts con-
nected via InfiniBand, iWARP or other RDMA based interconnects.
Integrated application supportJob submission templates – Platform HPC comes complete with
job submission templates for ANSYS Mechanical, ANSYS Fluent,
ANSYS CFX, LS-DYNA, MSC Nastran, Schlumberger ECLIPSE, Sim-
ulia Abaqus, NCBI Blast, NWChem, ClustalW, and HMMER. By
configuring these templates based on the application settings
in your environment, users can start using the cluster without
writing scripts.
IBM Platform HPC
Enterprise-Ready Cluster &
Workload Management
201
Scripting Guidelines – Cluster users that utilize homegrown or
open-source applications, can utilize the Platform HPC scripting
guidelines. These user-friendly interfaces help minimize job sub-
mission errors. They are also self-documenting, enabling users to
create their own job submission templates.
Benchmark tests – Platform HPC also includes standard bench-
mark tests to ensure that your cluster will deliver the best perfor-
mance without manual tuning.
Flexible OS provisioningPlatform HPC can deploy multiple operating system versions con-
currently on the same cluster and, based on job resource require-
ments, dynamically boot the Linux or Windows operating system
required to run the job. Administrators can also use a web inter-
face to manually switch nodes to the required OS to meet applica-
tion demands, providing them with the flexibility to support spe-
cial requests and accommodate unanticipated changes. Rather
than being an extracost item as it is with other HPC management
suites, this capability is included as a core feature of Platform HPC.
Commercial Service and supportCertified cluster configurations – Platform HPC is tested and
certified on all partner hardware platforms. By qualifying each
platform individually and providing vendor-specific software
with optimized libraries and drivers that take maximum advan-
tage of unique hardware features, Platform Computing has es-
sentially done the integration work in advance.
As a result, clusters can be deployed quickly and predictably with
minimal effort. As a testament to this, Platform HPC is certified
under the Intel Cluster Ready program.
Enterprise class service and support – Widely regarded as hav-
ing the best HPC support organization in the business, Platform
Computing is uniquely able to support an integrated HPC plat-
form. Because support personnel have direct access to the Plat-
form HPC developers, Platform Computing is able to offer a high-
er level of support and ensure that any problems encountered
are resolved quickly and efficiently.
SummaryPlatform HPC is the ideal solution for deploying and managing
state of the art HPC clusters. It makes cluster management sim-
ple, enabling analysts, engineers and scientists from organiza-
tions of any size to easily exploit the power of Linux clusters.
Unlike other HPC solutions that address only parts of the HPC
management challenge, Platform HPC uniquely addresses all as-
pects of cluster and management including:
� Easy-to-use cluster provisioning and management
� User-friendly, topology aware workload management
� Unified management interface
� Robust commercial MPI library
� Integrated application support
� Flexible OS provisioning
� Commercial HPC service and support
By providing simplified management over the entire lifecycle of
a cluster, Platform HPC has a direct and positive impact on pro-
ductivity while helping to reduce complexity and cost.
Job submission templates
202
The comprehensive web-based management interface, and fea-
tures like repository snapshots and the ability to update soft-
ware packages on the fly means that state-of-the-art HPC clus-
ters can be provisioned and managed even by administrators
with little or no Linux administration experience.
Capability / Feature Platform HPC
Cluster Provisioning and Management
Initial cluster provisioning
Multiple provisioning methods
Web-based cluster management
Node updates with no re-boot
Repository snapshots
Flexible node templates
Multiple OS and OS versions
Workload Management & Application Integration
Integrated workload management
HPC libraries & toolsets
NVIDIA CUDA SDK support
Web-based job management
Web-based job data management
Multi-boot based on workload
Advanced parallel job management
Commercial application integrations
MPI Libraries
Commercial grade MPI
Workload and system monitoring, reporting and correlation
Workload monitoring
Workload reporting
System monitoring & reporting
Workload and system load correlation
Integration with 3rd party management tools
IBM Platform HPC
IBM Platform MPI 8.1
S
N
EW
SESW
NENW
tech BOX
203
IBM Platform MPI 8.1Benefits
� Superior application performance
� Reduced development and support costs
� Faster time-to-market
� The industry’s best technical support
Features � Supports the widest range of hardware, networks and oper-
ating systems
� Distributed by over 30 leading commercial software vendors
� Change interconnects or libraries with no need to re-compile
� Seamless compatibility across Windows and Linux environ-
ments
� Ensures a production quality implementation
Ideal for: � Enterprises that develop or deploy parallelized software ap-
plications on HPC clusters
� Commercial software vendors wanting to improve applica-
tions performance over the widest range of computer hard-
ware, interconnects and operating systems
The Standard for Scalable, Parallel ApplicationsPlatform MPI is a high performance, production–quality imple-
mentation of the Message Passing Interface (MPI). It is widely
used in the high performance computing (HPC) industry and is
considered the de facto standard for developing scalable, parallel
applications. Platform MPI maintains full backward compatibili-
ty with HP-MPI and Platform MPI applications and incorporates
advanced CPU affinity features, dynamic selection of interface
libraries, superior workload manger integrations and improved
performance and scalability.
Platform MPI supports the broadest range of industry standard
platforms, interconnects and operating systems helping ensure
that your parallel applications can run anywhere.
Focus on portabilityPlatform MPI allows developers to build a single executable that
transparently leverages the performance features of any type of
interconnect, thereby providing applications with optimal laten-
cy and bandwidth for each protocol. This reduces development
effort, and enables applications to use the “latest and greatest”
technologies on Linux or Microsoft Windows without the need to
re-compile and re-link applications.
Platform MPI is optimized for both distributed (DMP) and shared
memory (SMP) environments and provides a variety of flexible
CPU binding strategies for processes and threads, enabling bet-
ter performance on multi–core environments. With this capa-
bility memory and cache conflicts are managed by more intelli-
gently distributing the load among multiple cores. With support
for Windows HPC Server 2008 and the Microsoft job scheduler,
as well as other Microsoft operating environments, Platform MPI
allows developers targeting Windows platforms to enjoy the
benefits of a standard portable MPI and avoid proprietary lock-in.
Life
Sci
en
ces
CA
E
Hig
h P
erf
orm
an
ce C
om
pu
tin
g
Big
Da
ta A
na
lyti
cs
Sim
ula
tio
n
CA
D
General Parallel File System (GPFS)Big Data, Cloud Storage it doesn’t matter what you call it, there is certainly increasing
demand to store larger and larger amounts of unstructured data.
The IBM General Parallel File System (GPFSTM) has always been considered a pioneer of
big data storage and continues today to lead in introducing industry leading storage
technologies.
Since 1998 GPFS has lead the industry with many technologies that make the storage
of large quantities of file data possible. The latest version continues in that tradition,
GPFS 3.5 represents a significant milestone in the evolution of big data management.
GPFS 3.5 introduces some revolutionary new features that clearly demonstrate IBM’s
commitment to providing industry leading storage solutions.
206
General Parallel File System (GPFS)
What is GPFS?
What is GPFS?GPFS is more than clustered file system software; it is a full
featured set of file management tools. This includes advanced
storage virtualization, integrated high availability, automated
tiered storage management and the performance to effectively
manage very large quantities of file data.GPFS allows a group of
computers concurrent access to a common set of file data over
a common SAN infrastructure, a network or a mix of connection
types. The computers can run any mix of AIX, Linux or Windows
Server operating systems. GPFS provides storage management,
information life cycle management tools, centralized adminis-
tration and allows for shared access to file systems from remote
GPFS clusters providing a global namespace.
A GPFS cluster can be a single node, two nodes providing a high
availability platform supporting a database application, for ex-
ample, or thousands of nodes used for applications like the mod-
eling of weather patterns. The largest existing configurations ex-
ceed 5,000 nodes. GPFS has been available on since 1998 and has
been field proven for more than 14 years on some of the world’s
most powerful supercomputers2 to provide reliability and effi-
cient use of infrastructure bandwidth.
GPFS was designed from the beginning to support high perfor-
mance parallel workloads and has since been proven very effec-
tive for a variety of applications. Today it is installed in clusters
supporting big data analytics, gene sequencing, digital media
and scalable file serving. These applications are used across
many industries including financial, retail, digital media, bio-
technology, science and government. GPFS continues to push
technology limits by being deployed in very demanding large
environments. You may not need multiple petabytes of data to-
day, but you will, and when you get there you can rest assured
GPFS has already been tested in these enviroments. This leader-
ship is what makes GPFS a solid solution for any size application.
207
Supported operating systems for GPFS Version 3.5 include AIX,
Red Hat, SUSE and Debian Linux distributions and Windows Serv-
er 2008.
The file systemA GPFS file system is built from a collection of arrays that contain
the file system data and metadata. A file system can be built from
a single disk or contain thousands of disks storing petabytes of
data. Each file system can be accessible from all nodes within the
cluster. There is no practical limit on the size of a file system. The
architectural limit is 299 bytes. As an example, current GPFS cus-
tomers are using single file systems up to 5.4PB in size and others
have file systems containing billions of files.
Application interfacesApplications access files through standard POSIX file system in-
terfaces. Since all nodes see all of the file data applications can
scale-out easily. Any node in the cluster can concurrently read or
update a common set of files. GPFS maintains the coherency and
consistency of the file system using sophisticated byte range
locking, token (distributed lock) management and journaling.
This means that applications using standard POSIX locking se-
mantics do not need to be modified to run successfully on a GPFS
file system.
In addition to standard interfaces GPFS provides a unique set
of extended interfaces which can be used to provide advanced
application functionality. Using these extended interfaces an
application can determine the storage pool placement of a file,
create a file clone and manage quotas. These extended interfaces
provide features in addition to the standard POSIX interface.
Performance and scalabilityGPFS provides unparalleled performance for unstructured data.
GPFS achieves high performance I/O by:
Striping data across multiple disks attached to multiple nodes.
� High performance metadata (inode) scans.
� Supporting a wide range of file system block sizes to match
I/O requirements.
� Utilizing advanced algorithms to improve read-ahead and
write-behind IO operations.
� Using block level locking based on a very sophisticated scalable
token management system to provide data consistency while al-
lowing multiple application nodes concurrent access to the files.
When creating a GPFS file system you provide a list of raw devices
and they are assigned to GPFS as Network Shared Disks (NSD). Once
a NSD is defined all of the nodes in the GPFS cluster can access the
disk, using local disk connection, or using t-he GPFS NSD network
protocol for shipping data over a TCP/IP or InfiniBand connection.
GPFS token (distributed lock) management coordinates access to
NSD’s ensuring the consistency of file system data and metadata
when different nodes access the same file. Token management
responsibility is dynamically allocated among designated nodes
in the cluster. GPFS can assign one or more nodes to act as token
managers for a single file system. This allows greater scalabil-
ity when you have a large number of files with high transaction
workloads. In the event of a node failure the token management
responsibility is moved to another node.
All data stored in a GPFS file system is striped across all of the disks
within a storage pool, whether the pool contains 2 LUNS or 2,000
LUNS. This wide data striping allows you to get the best performance
for the available storage. When disks are added to or removed from
a storage pool existing file data can be redistributed across the new
storage to improve performance. Data redistribution can be done
automatically or can be scheduled. When redistributing data you
can assign a single node to perform the task to control the impact
on a production workload or have all of the nodes in the cluster par-
ticipate in data movement to complete the operation as quickly as
possible. Online storage configuration is a good example of an en-
terprise class storage management feature included in GPFS.
208
To achieve the highest possible data access performance GPFS
recognizes typical access patterns including sequential, reverse
sequential and random optimizing I/O access for these patterns.
Along with distributed token management, GPFS provides scalable
metadata management by allowing all nodes of the cluster access-
ing the file system to perform file metadata operations. This feature
distinguishes GPFS from other cluster file systems which typically
have a centralized metadata server handling fixed regions of the
file namespace. A centralized metadata server can often become a
performance bottleneck for metadata intensive operations, limiting
scalability and possibly introducing a single point of failure. GPFS
solves this problem by enabling all nodes to manage metadata.
AdministrationGPFS provides an administration model that is easy to use and is
consistent with standard file system administration practices while
providing extensions for the clustering aspects of GPFS. These func-
tions support cluster management and other standard file system
administration functions such as user quotas, snapshots and ex-
tended access control lists. GPFS administration tools simplify clus-
ter-wide tasks. A single GPFS command can perform a file system
function across the entire cluster and most can be issued from any
node in the cluster. Optionally you can designate a group of admin-
istration nodes that can be used to perform all cluster administra-
tion tasks, or only authorize a single login session to perform admin
commands cluster-wide. This allows for higher security by reducing
the scope of node to node administrative access.
Rolling upgrades allow you to upgrade individual nodes in the clus-
ter while the file system remains online. Rolling upgrades are sup-
ported between two major version levels of GPFS (and service levels
within those releases). For example you can mix GPFS 3.4 nodes with
GPFS 3.5 nodes while migrating between releases.
Quotas enable the administrator to manage file system usage by us-
ers and groups across the cluster. GPFS provides commands to gen-
erate quota reports by user, group and on a sub-tree of a file system
General Parallel File System (GPFS)
What is GPFS?
209
called a fileset. Quotas can be set on the number of files (inodes) and
the total size of the files. New in GPFS 3.5 you can now define user
and group per fileset quotas which allows for more options in quo-
ta configuration. In addition to traditional quota management, the
GPFS policy engine can be used query the file system metadata and
generate customized space usage reports. An SNMP interface al-
lows monitoring by network management applications. The SNMP
agent provides information on the GPFS cluster and generates traps
when events occur in the cluster. For example, an event is generated
when a file system is mounted or if a node fails. The SNMP agent
runs on Linux and AIX. You can monitor a heterogeneous cluster as
long as the agent runs on a Linux or AIX node.
You can customize the response to cluster events using GPFS
callbacks. A callback is an administrator defined script that is ex-
ecuted when an event occurs, for example, when a file system is
un-mounted for or a file system is low on free space. Callbacks can
be used to create custom responses to GPFS events and integrate
these notifications into various cluster monitoring tools. GPFS
provides support for the Data Management API (DMAPI) interface
which is IBM’s implementation of the X/Open data storage man-
agement API. This DMAPI interface allows vendors of storage man-
agement applications such as IBM Tivoli Storage Manager (TSM)
and High Performance Storage System (HPSS) to provide Hierarchi-
cal Storage Management (HSM) support for GPFS.
GPFS supports POSIX and NFSv4 access control lists (ACLs). NFSv4
ACLs can be used to serve files using NFSv4, but can also be used in
other deployments, for example, to provide ACL support to nodes
running Windows. To provide concurrent access from multiple
operating system types GPFS allows you to run mixed POSIX and
NFS v4 permissions in a single file system and map user and group
IDs between Windows and Linux/UNIX environments. File systems
may be exported to clients outside the cluster through NFS. GPFS is
often used as the base for a scalable NFS file service infrastructure.
The GPFS clustered NFS (cNFS) feature provides data availability to
NFS clients by providing NFS service continuation if an NFS server
fails. This allows a GPFS cluster to provide scalable file service
by providing simultaneous access to a common set of data from
multiple nodes. The clustered NFS tools include monitoring of file
services and IP address fail over. GPFS cNFS supports NFSv3 only.
You can export a GPFS file system using NFSv4 but not with cNFS.
Data availabilityGPFS is fault tolerant and can be configured for continued access
to data even if cluster nodes or storage systems fail. This is ac-
complished though robust clustering features and support for
synchronous and asynchronous data replication.
GPFS software includes the infrastructure to handle data con-
sistency and availability. This means that GPFS does not rely on
external applications for cluster operations like node failover.
The clustering support goes beyond who owns the data or who
has access to the disks. In a GPFS cluster all nodes see all of the
data and all cluster operations can be done by any node in the
cluster with a server license. All nodes are capable of perform-
ing all tasks. What tasks a node can perform is determined by the
type of license and the cluster configuration.
As a part of the built-in availability tools GPFS continuously mon-
itors the health of the file system components. When failures are
detected appropriate recovery action is taken automatically. Ex-
tensive journaling and recovery capabilities are provided which
maintain metadata consistency when a node holding locks or
performing administrative services fails.
Snapshots can be used to protect the file system’s contents
against a user error by preserving a point in time version of the
file system or a sub-tree of a file system called a fileset. GPFS im-
plements a space efficient snapshot mechanism that generates a
map of the file system or fileset at the time the snaphot is taken.
New data blocks are consumed only when the file system data
has been deleted or modified after the snapshot was created.
210
This is done using a redirect-on-write technique (sometimes
called copy-on-write). Snapshot data is placed in existing storage
pools simplifying administration and optimizing the use of ex-
isting storage. The snapshot function can be used with a backup
program, for example, to run while the file system is in use and
still obtain a consistent copy of the file system as it was when the
snapshot was created. In addition, snapshots provide an online
backup capability that allows files to be recovered easily from
common problems such as accidental file deletion.
Data ReplicationFor an additional level of data availability and protection synchro-
nous data replication is available for file system metadata and
data. GPFS provides a very flexible replication model that allows
you to replicate a file, set of files, or an entire file system. The rep-
lication status of a file can be changed using a command or by us-
ing the policy based management tools. Synchronous replication
allows for continuous operation even if a path to an array, an array
itself or an entire site fails.
Synchronous replication is location aware which allows you to
optimize data access when the replicas are separated across a
WAN. GPFS has knowledge of what copy of the data is “local” so
read-heavy applications can get local data read performance even
when data replicated over a WAN. Synchronous replication works
well for many workloads by replicating data across storage arrays
within a data center, within a campus or across geographical dis-
tances using high quality wide area network connections.
When wide area network connections are not high performance
or are not reliable, an asynchronous approach to data replication
is required. GPFS 3.5 introduces a feature called Active File Man-
agement (AFM). AFM is a distributed disk caching technology de-
veloped at IBM Research that allows the expansion of the GPFS
global namespace across geographical distances. It can be used to
provide high availability between sites or to provide local “copies”
of data distributed to one or more GPFS clusters. For more details
on AFM see the section entitled Sharing data between clusters.
General Parallel File System (GPFS)
What is GPFS?
211
For a higher level of cluster reliability GPFS includes advanced
clustering features to maintain network connections. If a network
connection to a node fails GPFS automatically tries to reestablish
the connection before marking the node unavailable. This can pro-
vide for better uptime in environments communicating across a
WAN or experiencing network issues.
Using these features along with a high availability infrastructure
ensures a reliable enterprise class storage solution.
GPFS Native Raid (GNR)Larger disk drives and larger file systems are creating challeng-
es for traditional storage controllers. Current RAID 5 and RAID 6
based arrays do not address the challenges of Exabyte scale stor-
age performance, reliability and management. To address these
challenges GPFS Native RAID (GNR) brings storage device man-
agement into GPFS. With GNR GPFS can directly manage thou-
sands of storage devices. These storage devices can be individual
disk drives or any other block device eliminating the need for a
storage controller.
GNR employs a de-clustered approach to RAID. The de-clustered
architecture reduces the impact of drive failures by spreading
data over all of the available storage devices improving appli-
cation IO and recovery performance. GNR provides very high
reliability through an 8+3 Reed Solomon based raid code that di-
vides each block of a file into 8 parts and associated parity. This
algorithm scales easily starting with as few as 11 storage devices
and growing to over 500 per storage pod. Spreading the data over
many devices helps provide predicable storage performance and
fast recovery times measured in minutes rather than hours in the
case of a device failure.
In addition to performance improvements GNR provides ad-
vanced checksum protection to ensure data integrity. Checksum
information is stored on disk and verified all the way to the NSD
client.
Information lifecycle management (ILM) toolsetGPFS can help you to achieve data lifecycle management efficien-
cies through policy-driven automation and tiered storage man-
agement. The use of storage pools, filesets and user-defined poli-
cies provide the ability to better match the cost of your storage to
the value of your data.
Storage pools are used to manage groups of disks within a file sys-
tem. Using storage pools you can create tiers of storage by group-
ing disks based on performance, locality or reliability characteris-
tics. For example, one pool could contain high performance solid
state disk (SSD) disks and another more economical 7,200 RPM disk
storage. These types of storage pools are called internal storage
pools. When data is placed in or moved between internal storage
pools all of the data management is done by GPFS. In addition to
internal storage pools GPFS supports external storage pools.
External storage pools are used to interact with an external stor-
age management application including IBM Tivoli Storage Man-
ager (TSM) and High Performance Storage System (HPSS). When
moving data to an external pool GPFS handles all of the metada-
ta processing then hands the data to the external application for
storage on alternate media, tape for example. When using TSM or
HPSS data can be retrieved from the external storage pool on de-
mand, as a result of an application opening a file or data can be
retrieved in a batch operation using a command or GPFS policy.
A fileset is a sub-tree of the file system namespace and provides
a way to partition the namespace into smaller, more manageable
units.
Filesets provide an administrative boundary that can be used to
set quotas, take snapshots, define AFM relationships and be used
in user defined policies to control initial data placement or data
migration. Data within a single fileset can reside in one or more
storage pools. Where the file data resides and how it is managed
once it is created is based on a set of rules in a user defined policy.
212
There are two types of user defined policies in GPFS: file placement
and file management. File placement policies determine in which
storage pool file data is initially placed. File placement rules are
defined using attributes of a file known when a file is created such
as file name, fileset or the user who is creating the file. For example
a placement policy may be defined that states ‘place all files with
names that end in .mov onto the near-line SAS based storage pool
and place all files created by the CEO onto the SSD based storage
pool’ or ‘place all files in the fileset ‘development’ onto the SAS
based storage pool’.
Once files exist in a file system, file management policies can be
used for file migration, deletion, changing file replication status
or generating reports. You can use a migration policy to transpar-
ently move data from one storage pool to another without chang-
ing the file’s location in the directory structure. Similarly you can
use a policy to change the replication status of a file or set of files,
allowing fine grained control over the space used for data availa-
bility. You can use migration and replication policies together, for
example a policy that says: ‘migrate all of the files located in the
subdirectory /database/payroll which end in *.dat and are greater
than 1 MB in size to storage pool #2 and un-replicate these files’.
File deletion policies allow you to prune the file system, deleting
files as defined by policy rules. Reporting on the contents of a file
system can be done through list policies. List policies allow you
to quickly scan the file system metadata and produce information
listing selected attributes of candidate files. File management pol-
icies can be based on more attributes of a file than placement pol-
icies because once a file exists there is more known about the file.
For example file placement attributes can utilize attributes such
as last access time, size of the file or a mix of user and file size. This
may result in policies like: ‘Delete all files with a name ending in
.temp that have not been accessed in the last 30 days’, or ‘Migrate
all files owned by Sally that are larger than 4GB to the SATA stor-
age pool’.
General Parallel File System (GPFS)
What is GPFS?
213
Rule processing can be further automated by including attrib-
utes related to a storage pool instead of a file using the thresh-
old option. Using thresholds you can create a rule that moves
files out of the high performance pool if it is more than 80%
full, for example. The threshold option comes with the ability to
set high, low and pre-migrate thresholds. Pre-migrated files are
files that exist on disk and are migrated to tape. This method
is typically used to allow disk access to the data while allow-
ing disk space to be freed up quickly when a maximum space
threshold is reached. This means that GPFS begins migrating
data at the high threshold, until the low threshold is reached.
If a pre-migrate threshold is set GPFS begins copying data un-
til the pre-migrate threshold is reached. This allows the data to
continue to be accessed in the original pool until it is quickly
deleted to free up space the next time the high threshold is
reached. Thresholds allow you to fully utilize your highest per-
formance storage and automate the task of making room for
new high priority content.
Policy rule syntax is based on the SQL 92 syntax standard and
supports multiple complex statements in a single rule enabling
powerful policies. Multiple levels of rules can be applied to a file
system, and rules are evaluated in order for each file when the
policy engine executes allowing a high level of flexibility. GPFS
provides unique functionality through standard interfaces, an
example of this is extended attributes. Extended attributes are a
standard POSIX facility.
GPFS has long supported the use of extended attributes, though
in the past they were not commonly used, in part because of per-
formance concerns. In GPFS 3.4, a comprehensive redesign of the
extended attributes support infrastructure was implemented,
resulting in significant performance improvements. In GPFS 3.5,
extended attributes are accessible by the GPFS policy engine al-
lowing you to write rules that utilize your custom file attributes.
Executing file management operations requires the ability to
efficiently process the file metadata. GPFS includes a high per-
formance metadata scan interface that allows you to efficiently
process the metadata for billions of files. This makes the GPFS ILM
toolset a very scalable tool for automating file management. This
high performance metadata scan engine employs a scale-out ap-
proach. The identification of candidate files and data movement
operations can be performed concurrently by one or more nodes
in the cluster. GPFS can spread rule evaluation and data move-
ment responsibilities over multiple nodes in the cluster pro-
viding a very scalable, high performance rule processing engine.
Cluster configurationsGPFS supports a variety of cluster configurations independent of
which file system features you use. Cluster configuration options
can be characterized into three basic categories:
� Shared disk
� Network block I/O
� Synchronously sharing data between clusters.
� Asynchronously sharing data between clusters.
Shared disk
A shared disk cluster is the most basic environment. In this con-
figuration, the storage is directly attached to all machines in the
cluster as shown in Figure 1. The direct connection means that
each shared block device is available concurrently to all of the
nodes in the GPFS cluster. Direct access means that the storage
is accessible using a SCSI or other block level protocol using a
SAN, Infiniband, iSCSI, Virtual IO interface or other block level IO
connection technology. Figure 1 illustrates a GPFS cluster where
all nodes are connected to a common fibre channel SAN. This
example shows a fibre channel SAN though the storage attach-
ment technology could be InfiniBand, SAS, FCoE or any other.
The nodes are connected to the storage using the SAN and to
each other using a LAN. Data used by applications running on
the GPFS nodes flows over the SAN and GPFS control informa-
tion flows among the GPFS instances in the cluster over the LAN.
214
This configuration is optimal when all nodes in the cluster need
the highest performance access to the data. For example, this is
a good configuration for providing etwork file service to client
systems using clustered NFS, high-speed data access for digital
media applications or a grid infrastructure for data analytics.
Network-based block IO
As data storage requirements increase and new storage and con-
nection technologies are released a single SAN may not be a suf-
ficient or appropriate choice of storage connection technology.
In environments where every node in the cluster is not attached
to a single SAN, GPFS makes use of an integrated network block
device capability. GPFS provides a block level interface over
TCP/IP networks called the Network Shared Disk (NSD) protocol.
Whether using the NSD protocol or a direct attachment to the
SAN the mounted file system looks the same to the application,
GPFS transparently handles I/O requests.
GPFS clusters can use the NSD protocol to provide high speed data
access to applications running on LAN-attached nodes. Data is
served to these client nodes from one or more NSD servers. In this
configuration, disks are attached only to the NSD servers. Each
NSD server is attached to all or a portion of the disk collection.
Figure 1: SAN attached Storage
GPFS Nodes
SAN
SAN Storage
General Parallel File System (GPFS)
What is GPFS?
215
With GPFS you can define up to eight NSD servers per disk and
it is recommended that at least two NSD servers are defined for
each disk to avoid a single point of failure.
GPFS uses the NSD protocol over any TCP/IP capable network
fabric. On Linux GPFS can use the VERBS RDMA protocol on com-
patible fabrics (such as InfiniBand) to transfer data to NSD cli-
ents. The network fabric does not need to be dedicated to GPFS;
but should provide sufficient bandwidth to meet your GPFS
performance expectations and for applications which share the
bandwidth. GPFS has the ability to define a preferred network
subnet topology, for example designate separate IP subnets for
intra-cluster communication and the public network. This pro-
vides for a clearly defined separation of communication traffic
and allows you to increase the throughput and possibly the
number of nodes in a GPFS cluster. Allowing access to the same
disk from multiple subnets means that all of the NSD clients do
not have to be on a single physical network. For example you
can place groups of clients onto separate subnets that access
a common set of disks through different NSD servers so not all
NSD servers need to serve all clients. This can reduce the net-
working hardware costs and simplify the topology reducing
support costs, providing greater scalability and greater overall
performance.
An example of the NSD server model is shown in Figure 2. In this
configuration, a subset of the total node population is defined
as NSDserver nodes. The NSD Server is responsible for the ab-
straction of disk datablocks across a TCP/IP or Infiniband VERBS
(Linux only) based network. The factthat the disks are remote
is transparent to the application. Figure 2 shows an example of
a configuration where a set of compute nodes are connected
to a set of NSD servers using a high-speed interconnect or an
IP-based network such as Ethernet. In this example, data to the
NSD servers flows over the SAN and both data and control infor-
mation to the clients flow across the LAN. Since the NSD servers
are serving data blocks from one or more devices data access is
similar to a SAN attached environment in that data flows from
all servers simultaneously to each client. This parallel data ac-
cess provides the best possible throughput to all clients. In ad-
dition it provides the ability to scale up the throughput even to
a common data set or even a single file.
The choice of how many nodes to configure as NSD servers is
based on performance requirements, the network architecture
and the capabilities of the storage subsystems. High bandwidth
LAN connections should be used for clusters requiring signifi-
cant data transfer rates. This can include 1Gbit or 10 Gbit Ether-
net. For additional performance or reliability you can use link ag-
gregation (EtherChannel or bonding), networking technologies
like source based routing or higher performance networks such
as InfiniBand.
The choice between SAN attachment and network block I/O is a
performance and economic one. In general, using a SAN provides
the highest performance; but the cost and management com-
plexity of SANs for large clusters is often prohibitive. In these
cases network block I/O provides an option. Network block I/O
is well suited to grid computing and clusters with sufficient net-
work bandwidth between the NSD servers and the clients. For
example, an NSD protocol based grid is effective for web applica-
tions, supply chain management or modeling weather patterns.
Figure 2: Network block IO
NSD Clients
I/O Servers
LAN
SAN
SAN Storage
NSD Clients
LAN
SAN
NSD Servers
SAN Storage
Application Nodes
216
Mixed ClustersThe last two sections discussed shared disk and network attached
GPFS cluster topologies. You can mix these storage attachment
methods within a GPFS cluster to better matching the IO require-
ments to the connection technology. A GPFS node always tries to
find the most efficient path to the storage. If a node detects a block
device path to the data it is used. If there is no block device path
then the network is used. This capability can be leveraged to pro-
vide additional availability. If a node is SAN attached to the storage
and there is an HBA failure, for example, GPFS can fail over to using
the network path to the disk. A mixed cluster topology can provide
direct storage access to non-NSD server nodes for high performance
operations including backups or data ingest.
Sharing data between clustersThere are two methods available to share data across GPFS clusters:
GPFS multi-cluster and a new feature called Active File Management
(AFM).
Figure 3: Mixed Cluster Atchitecture
General Parallel File System (GPFS)
What is GPFS?
LANSAN
Cluster A
Cluster C
Cluster B
LAN
217
GPFS Multi-cluster allows you to utilize the native GPFS protocol to
share data across clusters. Using this feature you can allow other
clusters to access one or more of your file systems and you can
mount file systems that belong to other GPFS clusters for which
you have been authorized. A multi-cluster environment allows the
administrator to permit access to specific file systems from another
GPFS cluster. This feature is intended to allow clusters toshare data
at higher performance levels than file sharing technologies like NFS
or CIFS. It is not intended to replace such file sharing technologies
which are optimized for desktop access or for access across unrelia-
ble network links. Multi-cluster capability is useful for sharing across
multiple clusters within a physical location or across locations. Clus-
ters are most often attached using a LAN, but in addition the cluster
connection could include a SAN. Figure 3 illustrates a multi-cluster
configuration with both LAN and mixed LAN and SAN connections.
In Figure 3, Cluster B and Cluster C need to access the data from
Cluster A. Cluster A owns the storage and manages the file system.
It may grant access to file systems which it manages to remote clus-
ters such as Cluster B and Cluster C. In this example, Cluster B and
Cluster C do not have any storage but that is not a requirement. They
could own file systems which may be accessible outside their clus-
ter. Commonly in the case where a cluster does not own storage, the
nodes are grouped into clusters for ease of management.
When the remote clusters need access to the data, they mount the
file system by contacting the owning cluster and passing required
security checks. In Figure 3, Cluster B acesses the data through the
NSD protocol. Cluster C accesses data through an extension of the
SAN. For cluster C access to the data is similar to the nodes in the
host cluster. Control and administrative traffic flows through the IP
network shown in Figure 3 and data access is direct over the SAN.
Both types of configurations are possible and as in Figure 3 can be
mixed as required. Multi-cluster environments are well suited to
sharing data across clusters belonging to different organizations
for collaborative computing , grouping sets of clients for adminis-
trative purposes or implementing a global namespace across sep-
arate locations. A multi-cluster configuration allows you to connect
GPFS clusters within a data center, across campus or across reliable
WAN links. For sharing data between GPFS clusters across less re-
liable WAN links or in cases where you want a copy of the data in
multiple locations you can use a new feature introduced in GPFS 3.5
called Active File Management.
Active File Management (AFM) allows you to create associations be-
tween GPFS clusters. Now the location and flow of file data between
GPFS clusters can be automated. Relationships between GPFS clus-
ters using AFM are defined at the fileset level. A fileset in a file sys-
tem can be created as a “cache” that provides a view to a file system
in another GPFS cluster called the “home.” File data is moved into
a cache fileset on demand. When a file is read the in the cache file-
set the file data is copied from the home into the cache fileset. Data
consistency and file movement into and out of the cache is man-
aged automatically by GPFS.
Cache filesets can be read-only or writeable. Cached data is local-
ly read or written. On read if the data is not in in the “cache” then
GPFS automatically creates a copy of the data. When data is writ-
ten into the cache the write operation completes locally then GPFS
asynchronously pushes the changes back to the home location.
You can define multiple cache filesets for each home data source.
The number of cache relationships for each home is limited only by
Figure 4: Multi-cluster
218
the bandwidth available at the home location. Placing a quota on a
cache fileset causes the data to be cleaned (evicted) out of the cache
automatically based on the space available. If you do not set a quota
a copy of the file data remains in the cache until manually evicted
or deleted.
You can use AFM to create a global namespace within a data center,
across a campus or between data centers located around the world.
AFM is designed to enable efficient data transfers over wide area
network (WAN) connections. When file data is read from home into
cache that transfer can happen in parallel within a node called a
gateway or across multiple gateway nodes.
Using AFM you can now create a truly world-wide global namespace.
Figure 5 is an example of a global namespace built using AFM. In
this example each site owns 1/3 of the namespace, store3 “owns”
/data5 and /data6 for example. The other sites have cache filesets
that point to the home location for the data owned by the other
clusters. In this example store3 /data1-4 cache data owned by the
other two clusters. This provides the same namespace within each
GPFS cluster providing applications one path to a file across the en-
tire organization.
You can achieve this type of global namespace with multi-cluster or
AFM, which method you chose depends on the quality of the WAN
link and the application requirements.
General Parallel File System (GPFS)
What is GPFS?
Cache: /data1Local: /data3
Cache: /data5
Local: /data1Cache: /data3 Cache: /data5
Cache: /data1Cache: /data3Local: /data3
Store 1
Store 2
Store 3
Client access:/global/data1/global/data2/global/data3/global/data3/global/data4/global/data5/global/data6
Client access:/global/data1/global/data2/global/data3/global/data3/global/data4/global/data5/global/data6
Client access:/global/data1/global/data2/global/data3/global/data3/global/data4/global/data5/global/data6
219
Figure 5: Global Namespace using AFM
S
N
EW
SESW
NENW
tech BOX
220
General Parallel File System (GPFS)
What’s new in GPFS Version 3.5 What’s new in GPFS Version 3.5For those who are familiar with GPFS 3.4 this section provides a
list of what is new in GPFS Version 3.5.
Active File ManagementWhen GPFS was introduced in 1998 it represented a revolution
in file storage. For the first time a group of servers could share
high performance access to a common set of data over a SAN or
network. The ability to share high performance access to file data
across nodes was the introduction of the global namespace.
Later GPFS introduced the ability to share data across multiple
GPFS clusters. This multi-cluster capability enabled data sharing
between clusters allowing for better access to file data.
This further expanded the reach of the global namespace from
within a cluster to across clusters spanning a data center or a
country. There were still challenges to building a multi-cluster
global namespace. The big challenge is working with unreliable
and high latency network connections between the servers.
Active File Management (AFM) in GPFS addresses the WAN band-
width issues and enables GPFS to create a world-wide global name-
space. AFM ties the global namespace together asynchronous-
ly providing local read and write performance with automated
namespace management. It allows you to create associations be-
tween GPFS clusters and define the location and flow of file data.
High Performance Extended AttributesGPFS has long supported the use of extended attributes, though
in the past they were not commonly used, in part because of
221
performance concerns. In GPFS 3.4, a comprehensive redesign of
the extended attributes support infrastructure was implement-
ed, resulting in significant performance improvements. In GPFS
3.5, extended attributes are accessible by the GPFS policy engine
allowing you to write rules that utilize your custom file attrib-
utes.
Now an application can use standard POSIX interfaces t to man-
age extended attributes and the GPFS policy engine can utilize
these attributes.
Independent FilesetsTo effectively manage a file system with billions of files requires
advanced file management technologies. GPFS 3.5 introduces
a new concept called the independent fileset. An independent
fileset has its own inode space. This means that an independent
fileset can be managed similar to a separate file system but still
allow you to realize the benefits of storage consolidation.
An example of an efficiency introduced with independent file-
sets is improved policy execution performance. If you use an in-
dependent fileset as a predicate in your policy, GPFS only needs
to scan the inode space represented by that fileset, so if you have
1 billion files in your file system and a fileset has an inode space
of 1 million files, the scan only has to look at 1 million inodes. This
instantly makes the policy scan much more efficient. Independ-
ent filesets enable other new fileset features in GPFS 3.5.
Fileset Level Snapshots
Snapshot granularity is now at the fileset level in addition to file
system level snapshots.
Fileset Level Quotas
User and group quotas can be set per fileset
File CloningFile clones are space efficient copies of a file where two instances
of a file share data they have in common and only changed
blocks require additional storage. File cloning is an efficient way
to create a copy of a file, without the overhead of copying all of
the data blocks.
IPv6 SupportIPv6 support in GPFS means that nodes can be defined using mul-
tiple addresses, both IPv4 and IPv6.
GPFS Native RAIDGPFS Native RAID (GNR) brings storage RAID management into
the GPFS NSD server. With GNR GPFS directly manages JBOD
based storage. This feature provides greater availability, flexibili-
ty and performance for a variety of application workloads.
GNR implements a Reed-Solomon based de-clustered RAID tech-
nology that can provide high availability and keep drive failures
from impacting performance by spreading the recovery tasks
over all of the disks. Unlike standard Network Shared Disk (NSD)
data access GNR is tightly integrated with the storage hardware.
For GPFS 3.5 GNR is available on the IBM Power 775 Supercom-
puter platform.
222
ACML (“AMD Core Math Library“)
A software development library released by AMD. This library
provides useful mathematical routines optimized for AMD pro-
cessors. Originally developed in 2002 for use in high-performance
computing (HPC) and scientific computing, ACML allows nearly
optimal use of AMD Opteron processors in compute-intensive
applications. ACML consists of the following main components:
� A full implementation of Level 1, 2 and 3 Basic Linear Algebra
Subprograms (→ BLAS), with optimizations for AMD Opteron
processors.
� A full suite of Linear Algebra (→ LAPACK) routines.
� A comprehensive suite of Fast Fourier transform (FFTs) in sin-
gle-, double-, single-complex and double-complex data types.
� Fast scalar, vector, and array math transcendental library
routines.
� Random Number Generators in both single- and double-pre-
cision.
AMD offers pre-compiled binaries for Linux, Solaris, and Win-
dows available for download. Supported compilers include gfor-
tran, Intel Fortran Compiler, Microsoft Visual Studio, NAG, PathS-
cale, PGI compiler, and Sun Studio.
BLAS (“Basic Linear Algebra Subprograms“)
Routines that provide standard building blocks for performing
basic vector and matrix operations. The Level 1 BLAS perform
scalar, vector and vector-vector operations, the Level 2 BLAS
perform matrix-vector operations, and the Level 3 BLAS per-
form matrix-matrix operations. Because the BLAS are efficient,
portable, and widely available, they are commonly used in the
development of high quality linear algebra software, e.g. → LA-
PACK . Although a model Fortran implementation of the BLAS
in available from netlib in the BLAS library, it is not expected to
perform as well as a specially tuned implementation on most
high-performance computers – on some machines it may give
much worse performance – but it allows users to run → LAPACK
software on machines that do not offer any other implementa-
tion of the BLAS.
Cg (“C for Graphics”)
A high-level shading language developed by Nvidia in close col-
laboration with Microsoft for programming vertex and pixel
shaders. It is very similar to Microsoft’s → HLSL. Cg is based on
the C programming language and although they share the same
syntax, some features of C were modified and new data types
were added to make Cg more suitable for programming graph-
ics processing units. This language is only suitable for GPU pro-
gramming and is not a general programming language. The Cg
compiler outputs DirectX or OpenGL shader programs.
CISC (“complex instruction-set computer”)
A computer instruction set architecture (ISA) in which each in-
struction can execute several low-level operations, such as
a load from memory, an arithmetic operation, and a memory
store, all in a single instruction. The term was retroactively
coined in contrast to reduced instruction set computer (RISC).
The terms RISC and CISC have become less meaningful with the
continued evolution of both CISC and RISC designs and imple-
mentations, with modern processors also decoding and split-
ting more complex instructions into a series of smaller internal
micro-operations that can thereby be executed in a pipelined
fashion, thus achieving high performance on a much larger sub-
set of instructions.
Glossary
223
cluster
Aggregation of several, mostly identical or similar systems to
a group, working in parallel on a problem. Previously known
as Beowulf Clusters, HPC clusters are composed of commodity
hardware, and are scalable in design. The more machines are
added to the cluster, the more performance can in principle be
achieved.
control protocol
Part of the → parallel NFS standard
CUDA driver API
Part of → CUDA
CUDA SDK
Part of → CUDA
CUDA toolkit
Part of → CUDA
CUDA (“Compute Uniform Device Architecture”)
A parallel computing architecture developed by NVIDIA. CUDA
is the computing engine in NVIDIA graphics processing units
or GPUs that is accessible to software developers through in-
dustry standard programming languages. Programmers use
“C for CUDA” (C with NVIDIA extensions), compiled through a
PathScale Open64 C compiler, to code algorithms for execution
on the GPU. CUDA architecture supports a range of computation-
al interfaces including → OpenCL and → DirectCompute. Third
party wrappers are also available for Python, Fortran, Java and
Matlab. CUDA works with all NVIDIA GPUs from the G8X series
onwards, including GeForce, Quadro and the Tesla line. CUDA
provides both a low level API and a higher level API. The initial
CUDA SDK was made public on 15 February 2007, for Microsoft
Windows and Linux. Mac OS X support was later added in ver-
sion 2.0, which supersedes the beta released February 14, 2008.
CUDA is the hardware and software architecture that enables
NVIDIA GPUs to execute programs written with C, C++, Fortran,
→ OpenCL, → DirectCompute, and other languages. A CUDA pro-
gram calls parallel kernels. A kernel executes in parallel across
a set of parallel threads. The programmer or compiler organizes
these threads in thread blocks and grids of thread blocks. The
GPU instantiates a kernel program on a grid of parallel thread
blocks. Each thread within a thread block executes an instance
of the kernel, and has a thread ID within its thread block, pro-
gram counter, registers, per-thread private memory, inputs, and
output results.
Thread
Thread Block
Grid 0
Grid 1
per-Thread PrivateLocal Memory
per-BlockShared Memory
per-Application
ContextGlobal
Memory
...
...
224
A thread block is a set of concurrently executing threads that
can cooperate among themselves through barrier synchroniza-
tion and shared memory. A thread block has a block ID within
its grid. A grid is an array of thread blocks that execute the same
kernel, read inputs from global memory, write results to global
memory, and synchronize between dependent kernel calls. In
the CUDA parallel programming model, each thread has a per-
thread private memory space used for register spills, function
calls, and C automatic array variables. Each thread block has a
per-block shared memory space used for inter-thread communi-
cation, data sharing, and result sharing in parallel algorithms.
Grids of thread blocks share results in global memory space af-
ter kernel-wide global synchronization.
CUDA’s hierarchy of threads maps to a hierarchy of processors
on the GPU; a GPU executes one or more kernel grids; a stream-
ing multiprocessor (SM) executes one or more thread blocks;
and CUDA cores and other execution units in the SM execute
threads. The SM executes threads in groups of 32 threads called
a warp. While programmers can generally ignore warp ex-
ecution for functional correctness and think of programming
one thread, they can greatly improve performance by having
threads in a warp execute the same code path and access mem-
ory in nearby addresses. See the main article “GPU Computing”
for further details.
DirectCompute
An application programming interface (API) that supports gen-
eral-purpose computing on graphics processing units (GPUs)
on Microsoft Windows Vista or Windows 7. DirectCompute is
part of the Microsoft DirectX collection of APIs and was initially
released with the DirectX 11 API but runs on both DirectX 10
and DirectX 11 GPUs. The DirectCompute architecture shares a
range of computational interfaces with → OpenCL and → CUDA.
ETL (“Extract, Transform, Load”)
A process in database usage and especially in data warehousing
that involves:
� Extracting data from outside sources
� Transforming it to fit operational needs (which can include
quality levels)
� Loading it into the end target (database or data warehouse)
The first part of an ETL process involves extracting the data from
the source systems. In many cases this is the most challenging
aspect of ETL, as extracting data correctly will set the stage for
how subsequent processes will go. Most data warehousing proj-
ects consolidate data from different source systems. Each sepa-
rate system may also use a different data organization/format.
Common data source formats are relational databases and flat
files, but may include non-relational database structures such
as Information Management System (IMS) or other data struc-
tures such as Virtual Storage Access Method (VSAM) or Indexed
Sequential Access Method (ISAM), or even fetching from outside
sources such as through web spidering or screen-scraping. The
streaming of the extracted data source and load on-the-fly to
the destination database is another way of performing ETL
when no intermediate data storage is required. In general, the
goal of the extraction phase is to convert the data into a single
format which is appropriate for transformation processing.
An intrinsic part of the extraction involves the parsing of ex-
tracted data, resulting in a check if the data meets an expected
pattern or structure. If not, the data may be rejected entirely or
in part.
Glossary
225
The transform stage applies a series of rules or functions to the
extracted data from the source to derive the data for loading
into the end target. Some data sources will require very little or
even no manipulation of data. In other cases, one or more of the
following transformation types may be required to meet the
business and technical needs of the target database.
The load phase loads the data into the end target, usually the
data warehouse (DW). Depending on the requirements of the
organization, this process varies widely. Some data warehouses
may overwrite existing information with cumulative informa-
tion, frequently updating extract data is done on daily, weekly
or monthly basis. Other DW (or even other parts of the same
DW) may add new data in a historicized form, for example, hour-
ly. To understand this, consider a DW that is required to main-
tain sales records of the last year. Then, the DW will overwrite
any data that is older than a year with newer data. However, the
entry of data for any one year window will be made in a histo-
ricized manner. The timing and scope to replace or append are
strategic design choices dependent on the time available and
the business needs. More complex systems can maintain a his-
tory and audit trail of all changes to the data loaded in the DW.
As the load phase interacts with a database, the constraints de-
fined in the database schema – as well as in triggers activated
upon data load – apply (for example, uniqueness, referential in-
tegrity, mandatory fields), which also contribute to the overall
data quality performance of the ETL process.
FFTW (“Fastest Fourier Transform in the West”)
A software library for computing discrete Fourier transforms
(DFTs), developed by Matteo Frigo and Steven G. Johnson at the
Massachusetts Institute of Technology. FFTW is known as the
fastest free software implementation of the Fast Fourier trans-
form (FFT) algorithm (upheld by regular benchmarks). It can
compute transforms of real and complex-valued arrays of arbi-
trary size and dimension in O(n log n) time.
floating point standard (IEEE 754)
The most widely-used standard for floating-point computa-
tion, and is followed by many hardware (CPU and FPU) and soft-
ware implementations. Many computer languages allow or re-
quire that some or all arithmetic be carried out using IEEE 754
formats and operations. The current version is IEEE 754-2008,
which was published in August 2008; the original IEEE 754-1985
was published in 1985. The standard defines arithmetic formats,
interchange formats, rounding algorithms, operations, and ex-
ception handling. The standard also includes extensive recom-
mendations for advanced exception handling, additional opera-
tions (such as trigonometric functions), expression evaluation,
and for achieving reproducible results. The standard defines
single-precision, double-precision, as well as 128-byte quadru-
ple-precision floating point numbers. In the proposed 754r ver-
sion, the standard also defines the 2-byte half-precision number
format.
FraunhoferFS (FhGFS)
A high-performance parallel file system from the Fraunhofer
Competence Center for High Performance Computing. Built on
scalable multithreaded core components with native → Infini-
Band support, file system nodes can serve → InfiniBand and
Ethernet (or any other TCP-enabled network) connections at the
same time and automatically switch to a redundant connection
path in case any of them fails. One of the most fundamental
226
concepts of FhGFS is the strict avoidance of architectural bottle
necks. Striping file contents across multiple storage servers is
only one part of this concept. Another important aspect is the
distribution of file system metadata (e.g. directory information)
across multiple metadata servers. Large systems and metadata
intensive applications in general can greatly profit from the lat-
ter feature.
FhGFS requires no dedicated file system partition on the servers
– it uses existing partitions, formatted with any of the standard
Linux file systems, e.g. XFS or ext4. For larger networks, it is also
possible to create several distinct FhGFS file system partitions
with different configurations. FhGFS provides a coherent mode,
in which it is guaranteed that changes to a file or directory by
one client are always immediately visible to other clients.
Global Arrays (GA)
A library developed by scientists at Pacific Northwest National
Laboratory for parallel computing. GA provides a friendly API
for shared-memory programming on distributed-memory com-
puters for multidimensional arrays. The GA library is a predeces-
sor to the GAS (global address space) languages currently being
developed for high-performance computing. The GA toolkit has
additional libraries including a Memory Allocator (MA), Aggre-
gate Remote Memory Copy Interface (ARMCI), and functionality
for out-of-core storage of arrays (ChemIO). Although GA was ini-
tially developed to run with TCGMSG, a message passing library
that came before the → MPI standard (Message Passing Inter-
face), it is now fully compatible with → MPI. GA includes simple
matrix computations (matrix-matrix multiplication, LU solve)
and works with → ScaLAPACK. Sparse matrices are available but
the implementation is not optimal yet. GA was developed by Jar-
ek Nieplocha, Robert Harrison and R. J. Littlefield. The ChemIO li-
brary for out-of-core storage was developed by Jarek Nieplocha,
Robert Harrison and Ian Foster. The GA library is incorporated
into many quantum chemistry packages, including NWChem,
MOLPRO, UTChem, MOLCAS, and TURBOMOLE. The GA toolkit is
free software, licensed under a self-made license.
Globus Toolkit
An open source toolkit for building computing grids developed
and provided by the Globus Alliance, currently at version 5.
GMP (“GNU Multiple Precision Arithmetic Library”)
A free library for arbitrary-precision arithmetic, operating on
signed integers, rational numbers, and floating point numbers.
There are no practical limits to the precision except the ones
implied by the available memory in the machine GMP runs on
(operand dimension limit is 231 bits on 32-bit machines and 237
bits on 64-bit machines). GMP has a rich set of functions, and
the functions have a regular interface. The basic interface is
for C but wrappers exist for other languages including C++, C#,
OCaml, Perl, PHP, and Python. In the past, the Kaffe Java virtual
machine used GMP to support Java built-in arbitrary precision
arithmetic. This feature has been removed from recent releases,
causing protests from people who claim that they used Kaffe
solely for the speed benefits afforded by GMP. As a result, GMP
support has been added to GNU Classpath. The main target ap-
plications of GMP are cryptography applications and research,
Internet security applications, and computer algebra systems.
GotoBLAS
Kazushige Goto’s implementation of → BLAS.
Glossary
227
grid (in CUDA architecture)
Part of the → CUDA programming model
GridFTP
An extension of the standard File Transfer Protocol (FTP) for use
with Grid computing. It is defined as part of the → Globus toolkit,
under the organisation of the Global Grid Forum (specifically, by
the GridFTP working group). The aim of GridFTP is to provide a
more reliable and high performance file transfer for Grid com-
puting applications. This is necessary because of the increased
demands of transmitting data in Grid computing – it is frequent-
ly necessary to transmit very large files, and this needs to be
done fast and reliably. GridFTP is the answer to the problem of
incompatibility between storage and access systems. Previous-
ly, each data provider would make their data available in their
own specific way, providing a library of access functions. This
made it difficult to obtain data from multiple sources, requiring
a different access method for each, and thus dividing the total
available data into partitions. GridFTP provides a uniform way
of accessing the data, encompassing functions from all the dif-
ferent modes of access, building on and extending the univer-
sally accepted FTP standard. FTP was chosen as a basis for it
because of its widespread use, and because it has a well defined
architecture for extensions to the protocol (which may be dy-
namically discovered).
Hierarchical Data Format (HDF)
A set of file formats and libraries designed to store and orga-
nize large amounts of numerical data. Originally developed at
the National Center for Supercomputing Applications, it is cur-
rently supported by the non-profit HDF Group, whose mission
is to ensure continued development of HDF5 technologies, and
the continued accessibility of data currently stored in HDF. In
keeping with this goal, the HDF format, libraries and associated
tools are available under a liberal, BSD-like license for general
use. HDF is supported by many commercial and non-commer-
cial software platforms, including Java, MATLAB, IDL, and Py-
thon. The freely available HDF distribution consists of the li-
brary, command-line utilities, test suite source, Java interface,
and the Java-based HDF Viewer (HDFView). There currently exist
two major versions of HDF, HDF4 and HDF5, which differ signifi-
cantly in design and API.
HLSL (“High Level Shader Language“)
The High Level Shader Language or High Level Shading Lan-
guage (HLSL) is a proprietary shading language developed by
Microsoft for use with the Microsoft Direct3D API. It is analo-
gous to the GLSL shading language used with the OpenGL stan-
dard. It is very similar to the NVIDIA Cg shading language, as it
was developed alongside it.
HLSL programs come in three forms, vertex shaders, geometry
shaders, and pixel (or fragment) shaders. A vertex shader is ex-
ecuted for each vertex that is submitted by the application, and
is primarily responsible for transforming the vertex from object
space to view space, generating texture coordinates, and calcu-
lating lighting coefficients such as the vertex’s tangent, binor-
mal and normal vectors. When a group of vertices (normally 3,
to form a triangle) come through the vertex shader, their out-
put position is interpolated to form pixels within its area; this
process is known as rasterisation. Each of these pixels comes
through the pixel shader, whereby the resultant screen colour
is calculated. Optionally, an application using a Direct3D10 in-
228
terface and Direct3D10 hardware may also specify a geometry
shader. This shader takes as its input the three vertices of a tri-
angle and uses this data to generate (or tessellate) additional
triangles, which are each then sent to the rasterizer.
InfiniBand
Switched fabric communications link primarily used in HPC and
enterprise data centers. Its features include high throughput,
low latency, quality of service and failover, and it is designed to
be scalable. The InfiniBand architecture specification defines a
connection between processor nodes and high performance
I/O nodes such as storage devices. Like → PCI Express, and many
other modern interconnects, InfiniBand offers point-to-point
bidirectional serial links intended for the connection of pro-
cessors with high-speed peripherals such as disks. On top of
the point to point capabilities, InfiniBand also offers multicast
operations as well. It supports several signalling rates and, as
with PCI Express, links can be bonded together for additional
throughput.
The SDR serial connection’s signalling rate is 2.5 gigabit per sec-
ond (Gbit/s) in each direction per connection. DDR is 5 Gbit/s
and QDR is 10 Gbit/s. FDR is 14.0625 Gbit/s and EDR is 25.78125
Gbit/s per lane. For SDR, DDR and QDR, links use 8B/10B encod-
ing – every 10 bits sent carry 8 bits of data – making the use-
ful data transmission rate four-fifths the raw rate. Thus single,
double, and quad data rates carry 2, 4, or 8 Gbit/s useful data,
respectively. For FDR and EDR, links use 64B/66B encoding – ev-
ery 66 bits sent carry 64 bits of data.
Implementers can aggregate links in units of 4 or 12, called 4X or
12X. A 12X QDR link therefore carries 120 Gbit/s raw, or 96 Gbit/s
of useful data. As of 2009 most systems use a 4X aggregate, im-
plying a 10 Gbit/s (SDR), 20 Gbit/s (DDR) or 40 Gbit/s (QDR) con-
nection. Larger systems with 12X links are typically used for
cluster and supercomputer interconnects and for inter-switch
connections.
The single data rate switch chips have a latency of 200 nano-
seconds, DDR switch chips have a latency of 140 nanoseconds
and QDR switch chips have a latency of 100 nanoseconds. The
end-to-end latency range ranges from 1.07 microseconds MPI
latency to 1.29 microseconds MPI latency to 2.6 microseconds.
As of 2009 various InfiniBand host channel adapters (HCA) exist
in the market, each with different latency and bandwidth char-
acteristics. InfiniBand also provides RDMA capabilities for low
CPU overhead. The latency for RDMA operations is less than 1
microsecond.
See the main article “InfiniBand” for further description of In-
finiBand features
Intel Integrated Performance Primitives (Intel IPP)
A multi-threaded software library of functions for multimedia
and data processing applications, produced by Intel. The library
supports Intel and compatible processors and is available for
Windows, Linux, and Mac OS X operating systems. It is available
separately or as a part of Intel Parallel Studio. The library takes
advantage of processor features including MMX, SSE, SSE2,
SSE3, SSSE3, SSE4, AES-NI and multicore processors. Intel IPP is
divided into four major processing groups: Signal (with linear ar-
ray or vector data), Image (with 2D arrays for typical color spac-
es), Matrix (with n x m arrays for matrix operations), and Cryp-
tography. Half the entry points are of the matrix type, a third
are of the signal type and the remainder are of the image and
Glossary
229
cryptography types. Intel IPP functions are divided into 4 data
types: Data types include 8u (8-bit unsigned), 8s (8-bit signed),
16s, 32f (32-bit floating-point), 64f, etc. Typically, an application
developer works with only one dominant data type for most
processing functions, converting between input to processing
to output formats at the end points. Version 5.2 was introduced
June 5, 2007, adding code samples for data compression, new
video codec support, support for 64-bit applications on Mac OS
X, support for Windows Vista, and new functions for ray-tracing
and rendering. Version 6.1 was released with the Intel C++ Com-
piler on June 28, 2009 and Update 1 for version 6.1 was released
on July 28, 2009.
Intel Threading Building Blocks (TBB)
A C++ template library developed by Intel Corporation for writ-
ing software programs that take advantage of multi-core pro-
cessors. The library consists of data structures and algorithms
that allow a programmer to avoid some complications arising
from the use of native threading packages such as POSIX →
threads, Windows → threads, or the portable Boost Threads in
which individual → threads of execution are created, synchro-
nized, and terminated manually. Instead the library abstracts
access to the multiple processors by allowing the operations
to be treated as “tasks”, which are allocated to individual cores
dynamically by the library’s run-time engine, and by automat-
ing efficient use of the CPU cache. A TBB program creates, syn-
chronizes and destroys graphs of dependent tasks according
to algorithms, i.e. high-level parallel programming paradigms
(a.k.a. Algorithmic Skeletons). Tasks are then executed respect-
ing graph dependencies. This approach groups TBB in a family
of solutions for parallel programming aiming to decouple the
programming from the particulars of the underlying machine.
Intel TBB is available commercially as a binary distribution with
support and in open source in both source and binary forms.
Version 4.0 was introduced on September 8, 2011.
iSER (“iSCSI Extensions for RDMA“)
A protocol that maps the iSCSI protocol over a network that
provides RDMA services (like → iWARP or → InfiniBand). This
permits data to be transferred directly into SCSI I/O buffers
without intermediate data copies. The Datamover Architecture
(DA) defines an abstract model in which the movement of data
between iSCSI end nodes is logically separated from the rest of
the iSCSI protocol. iSER is one Datamover protocol. The inter-
face between the iSCSI and a Datamover protocol, iSER in this
case, is called Datamover Interface (DI).
iWARP (“Internet Wide Area RDMA Protocol”)
An Internet Engineering Task Force (IETF) update of the RDMA
Consortium’s → RDMA over TCP standard. This later standard is
SDRSingle Data Rate
DDRDouble Data Rate
QDRQuadruple Data Rate
FDRFourteen Data Rate
EDREnhanced Data Rate
1X 2 Gbit/s 4 Gbit/s 8 Gbit/s 14 Gbit/s 25 Gbit/s
4X 8 Gbit/s 16 Gbit/s 32 Gbit/s 56 Gbit/s 100 Gbit/s
12X 24 Gbit/s 48 Gbit/s 96 Gbit/s 168 Gbit/s 300 Gbit/s
230
zero-copy transmission over legacy TCP. Because a kernel imple-
mentation of the TCP stack is a tremendous bottleneck, a few
vendors now implement TCP in hardware. This additional hard-
ware is known as the TCP offload engine (TOE). TOE itself does
not prevent copying on the receive side, and must be combined
with RDMA hardware for zero-copy results. The main compo-
nent is the Data Direct Protocol (DDP), which permits the ac-
tual zero-copy transmission. The transmission itself is not per-
formed by DDP, but by TCP.
kernel (in CUDA architecture)
Part of the → CUDA programming model
LAM/MPI
A high-quality open-source implementation of the → MPI speci-
fication, including all of MPI-1.2 and much of MPI-2. Superseded
by the → OpenMPI implementation-
LAPACK (“linear algebra package”)
Routines for solving systems of simultaneous linear equations,
least-squares solutions of linear systems of equations, eigen-
value problems, and singular value problems. The original goal
of the LAPACK project was to make the widely used EISPACK and
→ LINPACK libraries run efficiently on shared-memory vector
and parallel processors. LAPACK routines are written so that as
much as possible of the computation is performed by calls to
the → BLAS library. While → LINPACK and EISPACK are based on
the vector operation kernels of the Level 1 BLAS, LAPACK was
designed at the outset to exploit the Level 3 BLAS. Highly effi-
cient machine-specific implementations of the BLAS are avail-
able for many modern high-performance computers. The BLAS
enable LAPACK routines to achieve high performance with por-
table software.
layout
Part of the → parallel NFS standard. Currently three types of
layout exist: file-based, block/volume-based, and object-based,
the latter making use of → object-based storage devices
LINPACK
A collection of Fortran subroutines that analyze and solve linear
equations and linear least-squares problems. LINPACK was de-
signed for supercomputers in use in the 1970s and early 1980s.
LINPACK has been largely superseded by → LAPACK, which has
been designed to run efficiently on shared-memory, vector su-
percomputers. LINPACK makes use of the → BLAS libraries for
performing basic vector and matrix operations.
The LINPACK benchmarks are a measure of a system‘s float-
ing point computing power and measure how fast a computer
solves a dense N by N system of linear equations Ax=b, which is a
common task in engineering. The solution is obtained by Gauss-
ian elimination with partial pivoting, with 2/3•N³ + 2•N² floating
point operations. The result is reported in millions of floating
point operations per second (MFLOP/s, sometimes simply called
MFLOPS).
LNET
Communication protocol in → Lustre
logical object volume (LOV)
A logical entity in → Lustre
Glossary
231
Lustre
An object-based → parallel file system
management server (MGS)
A functional component in → Lustre
MapReduce
A framework for processing highly distributable problems
across huge datasets using a large number of computers
(nodes), collectively referred to as a cluster (if all nodes use the
same hardware) or a grid (if the nodes use different hardware).
Computational processing can occur on data stored either in a
filesystem (unstructured) or in a database (structured).
“Map” step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes. A
worker node may do this again in turn, leading to a multi-level
tree structure. The worker node processes the smaller problem,
and passes the answer back to its master node.
“Reduce” step: The master node then collects the answers to all
the sub-problems and combines them in some way to form the
output – the answer to the problem it was originally trying to
solve.
MapReduce allows for distributed processing of the map and
reduction operations. Provided each mapping operation is in-
dependent of the others, all maps can be performed in parallel
– though in practice it is limited by the number of independent
data sources and/or the number of CPUs near each source. Simi-
larly, a set of ‘reducers’ can perform the reduction phase - pro-
vided all outputs of the map operation that share the same key
are presented to the same reducer at the same time. While this
process can often appear inefficient compared to algorithms
that are more sequential, MapReduce can be applied to signifi-
cantly larger datasets than “commodity” servers can handle – a
large server farm can use MapReduce to sort a petabyte of data
in only a few hours. The parallelism also offers some possibility
of recovering from partial failure of servers or storage during
the operation: if one mapper or reducer fails, the work can be
rescheduled – assuming the input data is still available.
metadata server (MDS)
A functional component in → Lustre
metadata target (MDT)
A logical entity in → Lustre
MKL (“Math Kernel Library”)
A library of optimized, math routines for science, engineering,
and financial applications developed by Intel. Core math func-
tions include → BLAS, → LAPACK, → ScaLAPACK, Sparse Solvers,
Fast Fourier Transforms, and Vector Math. The library supports
Intel and compatible processors and is available for Windows,
Linux and Mac OS X operating systems.
MPI, MPI-2 (“message-passing interface”)
A language-independent communications protocol used to
program parallel computers. Both point-to-point and collective
communication are supported. MPI remains the dominant mod-
el used in high-performance computing today. There are two
versions of the standard that are currently popular: version 1.2
(shortly called MPI-1), which emphasizes message passing and
has a static runtime environment, and MPI-2.1 (MPI-2), which in-
cludes new features such as parallel I/O, dynamic process man-
232
agement and remote memory operations. MPI-2 specifies over
500 functions and provides language bindings for ANSI C, ANSI
Fortran (Fortran90), and ANSI C++. Interoperability of objects de-
fined in MPI was also added to allow for easier mixed-language
message passing programming. A side effect of MPI-2 stan-
dardization (completed in 1996) was clarification of the MPI-1
standard, creating the MPI-1.2 level. MPI-2 is mostly a superset
of MPI-1, although some functions have been deprecated. Thus
MPI-1.2 programs still work under MPI implementations com-
pliant with the MPI-2 standard. The MPI Forum reconvened in
2007, to clarify some MPI-2 issues and explore developments for
a possible MPI-3.
MPICH2
A freely available, portable → MPI 2.0 implementation, main-
tained by Argonne National Laboratory
MPP (“massively parallel processing”)
So-called MPP jobs are computer programs with several parts
running on several machines in parallel, often calculating simu-
lation problems. The communication between these parts can
e.g. be realized by the → MPI software interface.
MS-MPI
Microsoft → MPI 2.0 implementation shipped with Microsoft
HPC Pack 2008 SDK, based on and designed for maximum com-
patibility with the → MPICH2 reference implementation.
MVAPICH2
An → MPI 2.0 implementation based on → MPICH2 and devel-
oped by the Department of Computer Science and Engineer-
ing at Ohio State University. It is available under BSD licens-
ing and supports MPI over InfiniBand, 10GigE/iWARP and
RDMAoE.
NetCDF (“Network Common Data Form”)
A set of software libraries and self-describing, machine-inde-
pendent data formats that support the creation, access, and
sharing of array-oriented scientific data. The project homep-
age is hosted by the Unidata program at the University Cor-
poration for Atmospheric Research (UCAR). They are also the
chief source of NetCDF software, standards development,
updates, etc. The format is an open standard. NetCDF Clas-
sic and 64-bit Offset Format are an international standard
of the Open Geospatial Consortium. The project is actively
supported by UCAR. The recently released (2008) version 4.0
greatly enhances the data model by allowing the use of the
→ HDF5 data file format. Version 4.1 (2010) adds support for
C and Fortran client access to specified subsets of remote
data via OPeNDAP. The format was originally based on the
conceptual model of the NASA CDF but has since diverged
and is not compatible with it. It is commonly used in clima-
tology, meteorology and oceanography applications (e.g.,
weather forecasting, climate change) and GIS applications. It
is an input/output format for many GIS applications, and for
general scientific data exchange. The NetCDF C library, and
the libraries based on it (Fortran 77 and Fortran 90, C++, and
all third-party libraries) can, starting with version 4.1.1, read
some data in other data formats. Data in the → HDF5 format
can be read, with some restrictions. Data in the → HDF4 for-
mat can be read by the NetCDF C library if created using the
→ HDF4 Scientific Data (SD) API.
Glossary
233
NetworkDirect
A remote direct memory access (RDMA)-based network interface
implemented in Windows Server 2008 and later. NetworkDirect
uses a more direct path from → MPI applications to networking
hardware, resulting in very fast and efficient networking. See the
main article “Windows HPC Server 2008 R2” for further details.
NFS (Network File System)
A network file system protocol originally developed by Sun Mi-
crosystems in 1984, allowing a user on a client computer to ac-
cess files over a network in a manner similar to how local stor-
age is accessed. NFS, like many other protocols, builds on the
Open Network Computing Remote Procedure Call (ONC RPC)
system. The Network File System is an open standard defined in
RFCs, allowing anyone to implement the protocol.
Sun used version 1 only for in-house experimental purposes.
When the development team added substantial changes to NFS
version 1 and released it outside of Sun, they decided to release
the new version as V2, so that version interoperation and RPC
version fallback could be tested. Version 2 of the protocol (de-
fined in RFC 1094, March 1989) originally operated entirely over
UDP. Its designers meant to keep the protocol stateless, with
locking (for example) implemented outside of the core protocol
Version 3 (RFC 1813, June 1995) added:
� support for 64-bit file sizes and offsets, to handle files larger
than 2 gigabytes (GB)
� support for asynchronous writes on the server, to improve
write performance
� additional file attributes in many replies, to avoid the need
to re-fetch them
� a READDIRPLUS operation, to get file handles and attributes
along with file names when scanning a directory
� assorted other improvements
At the time of introduction of Version 3, vendor support for TCP
as a transport-layer protocol began increasing. While several
vendors had already added support for NFS Version 2 with TCP
as a transport, Sun Microsystems added support for TCP as a
transport for NFS at the same time it added support for Version 3.
Using TCP as a transport made using NFS over a WAN more fea-
sible.
Version 4 (RFC 3010, December 2000; revised in RFC 3530, April
2003), influenced by AFS and CIFS, includes performance im-
provements, mandates strong security, and introduces a state-
ful protocol. Version 4 became the first version developed with
the Internet Engineering Task Force (IETF) after Sun Microsys-
tems handed over the development of the NFS protocols.
NFS version 4 minor version 1 (NFSv 4.1) has been approved by
the IESG and received an RFC number since Jan 2010. The NFSv
4.1 specification aims: to provide protocol support to take ad-
vantage of clustered server deployments including the ability
to provide scalable parallel access to files distributed among
multiple servers. NFSv 4.1 adds the parallel NFS (pNFS) capabil-
ity, which enables data access parallelism. The NFSv 4.1 protocol
defines a method of separating the filesystem meta-data from
the location of the file data; it goes beyond the simple name/
data separation by striping the data amongst a set of data serv-
ers. This is different from the traditional NFS server which holds
the names of files and their data under the single umbrella of
the server.
234
In addition to pNFS, NFSv 4.1 provides sessions, directory del-
egation and notifications, multi-server namespace, access con-
trol lists (ACL/SACL/DACL), retention attributions, and SECIN-
FO_NO_NAME. See the main article “Parallel Filesystems” for
further details.
Current work is being done in preparing a draft for a future ver-
sion 4.2 of the NFS standard, including so-called federated file-
systems, which constitute the NFS counterpart of Microsoft’s
distributed filesystem (DFS).
NUMA (“non-uniform memory access”)
A computer memory design used in multiprocessors, where the
memory access time depends on the memory location relative
to a processor. Under NUMA, a processor can access its own local
memory faster than non-local memory, that is, memory local to
another processor or memory shared between processors.
object storage server (OSS)
A functional component in → Lustre
object storage target (OST)
A logical entity in → Lustre
object-based storage device (OSD)
An intelligent evolution of disk drives that can store and serve
objects rather than simply place data on tracks and sectors.
This task is accomplished by moving low-level storage func-
tions into the storage device and accessing the device through
an object interface. Unlike a traditional block-oriented device
providing access to data organized as an array of unrelated
Glossary
blocks, an object store allows access to data by means of stor-
age objects. A storage object is a virtual entity that groups data
together that has been determined by the user to be logically
related. Space for a storage object is allocated internally by the
OSD itself instead of by a host-based file system. OSDs man-
age all necessary low-level storage, space management, and
security functions. Because there is no host-based metadata
for an object (such as inode information), the only way for an
application to retrieve an object is by using its object identifier
(OID). The SCSI interface was modified and extended by the OSD
Technical Work Group of the Storage Networking Industry Asso-
ciation (SNIA) with varied industry and academic contributors,
resulting in a draft standard to T10 in 2004. This standard was
ratified in September 2004 and became the ANSI T10 SCSI OSD
V1 command set, released as INCITS 400-2004. The SNIA group
continues to work on further extensions to the interface, such
as the ANSI T10 SCSI OSD V2 command set.
OLAP cube (“Online Analytical Processing”)
A set of data, organized in a way that facilitates non-predeter-
mined queries for aggregated information, or in other words,
online analytical processing. OLAP is one of the computer-
based techniques for analyzing business data that are collec-
tively called business intelligence. OLAP cubes can be thought
of as extensions to the two-dimensional array of a spreadsheet.
For example a company might wish to analyze some financial
data by product, by time-period, by city, by type of revenue and
cost, and by comparing actual data with a budget. These addi-
tional methods of analyzing the data are known as dimensions.
Because there can be more than three dimensions in an OLAP
system the term hypercube is sometimes used.
235
OpenCL (“Open Computing Language”)
A framework for writing programs that execute across hetero-
geneous platforms consisting of CPUs, GPUs, and other proces-
sors. OpenCL includes a language (based on C99) for writing ker-
nels (functions that execute on OpenCL devices), plus APIs that
are used to define and then control the platforms. OpenCL pro-
vides parallel computing using task-based and data-based par-
allelism. OpenCL is analogous to the open industry standards
OpenGL and OpenAL, for 3D graphics and computer audio,
respectively. Originally developed by Apple Inc., which holds
trademark rights, OpenCL is now managed by the non-profit
technology consortium Khronos Group.
OpenMP (“Open Multi-Processing”)
An application programming interface (API) that supports
multi-platform shared memory multiprocessing programming
in C, C++ and Fortran on many architectures, including Unix and
Microsoft Windows platforms. It consists of a set of compiler di-
rectives, library routines, and environment variables that influ-
ence run-time behavior.
Jointly defined by a group of major computer hardware and
software vendors, OpenMP is a portable, scalable model that
gives programmers a simple and flexible interface for develop-
ing parallel applications for platforms ranging from the desk-
top to the supercomputer.
An application built with the hybrid model of parallel program-
ming can run on a computer cluster using both OpenMP and
Message Passing Interface (MPI), or more transparently through
the use of OpenMP extensions for non-shared memory systems.
OpenMPI
An open source → MPI-2 implementation that is developed and
maintained by a consortium of academic, research, and indus-
try partners.
parallel NFS (pNFS)
A → parallel file system standard, optional part of the current →
NFS standard 4.1. See the main article “Parallel Filesystems” for
further details.
PCI Express (PCIe)
A computer expansion card standard designed to replace the
older PCI, PCI-X, and AGP standards. Introduced by Intel in 2004,
PCIe (or PCI-E, as it is commonly called) is the latest standard
for expansion cards that is available on mainstream comput-
ers. PCIe, unlike previous PC expansion standards, is structured
around point-to-point serial links, a pair of which (one in each
direction) make up lanes; rather than a shared parallel bus.
These lanes are routed by a hub on the main-board acting as
a crossbar switch. This dynamic point-to-point behavior allows
PCIe 1.x PCIe 2.x PCIe 3.0 PCIe 4.0
x1 256 MB/s 512 MB/s 1 GB/s 2 GB/s
x2 512 MB/s 1 GB/s 2 GB/s 4 GB/s
x4 1 GB/s 2 GB/s 4 GB/s 8 GB/s
x8 2 GB/s 4 GB/s 8 GB/s 16 GB/s
x16 4 GB/s 8 GB/s 16 GB/s 32 GB/s
x32 8 GB/s 16 GB/s 32 GB/s 64 GB/s
236
more than one pair of devices to communicate with each other
at the same time. In contrast, older PC interfaces had all devices
permanently wired to the same bus; therefore, only one device
could send information at a time. This format also allows “chan-
nel grouping”, where multiple lanes are bonded to a single de-
vice pair in order to provide higher bandwidth. The number of
lanes is “negotiated” during power-up or explicitly during op-
eration. By making the lane count flexible a single standard can
provide for the needs of high-bandwidth cards (e.g. graphics
cards, 10 Gigabit Ethernet cards and multiport Gigabit Ethernet
cards) while also being economical for less demanding cards.
Unlike preceding PC expansion interface standards, PCIe is a
network of point-to-point connections. This removes the need
for “arbitrating” the bus or waiting for the bus to be free and
allows for full duplex communications. This means that while
standard PCI-X (133 MHz 64 bit) and PCIe x4 have roughly the
same data transfer rate, PCIe x4 will give better performance
if multiple device pairs are communicating simultaneously or
if communication within a single device pair is bidirectional.
Specifications of the format are maintained and developed by a
group of more than 900 industry-leading companies called the
PCI-SIG (PCI Special Interest Group). In PCIe 1.x, each lane carries
approximately 250 MB/s. PCIe 2.0, released in late 2007, adds a
Gen2-signalling mode, doubling the rate to about 500 MB/s. On
November 18, 2010, the PCI Special Interest Group officially pub-
lishes the finalized PCI Express 3.0 specification to its members
to build devices based on this new version of PCI Express, which
allows for a Gen3-signalling mode at 1 GB/s.
On November 29, 2011, PCI-SIG has annonced to proceed to
PCI Express 4.0 featuring 16 GT/s, still on copper technology.
Glossary
Additionally, active and idle power optimizations are to be in-
vestigated. Final specifications are expected to be released in
2014/15.
PETSc (“Portable, Extensible Toolkit for Scientific Computation”)
A suite of data structures and routines for the scalable (parallel)
solution of scientific applications modeled by partial differential
equations. It employs the → Message Passing Interface (MPI) stan-
dard for all message-passing communication. The current ver-
sion of PETSc is 3.2; released September 8, 2011. PETSc is intended
for use in large-scale application projects, many ongoing compu-
tational science projects are built around the PETSc libraries. Its
careful design allows advanced users to have detailed control
over the solution process. PETSc includes a large suite of parallel
linear and nonlinear equation solvers that are easily used in ap-
plication codes written in C, C++, Fortran and now Python. PETSc
provides many of the mechanisms needed within parallel appli-
cation code, such as simple parallel matrix and vector assembly
routines that allow the overlap of communication and computa-
tion. In addition, PETSc includes support for parallel distributed
arrays useful for finite difference methods.
process
→ thread
PTX (“parallel thread execution”)
Parallel Thread Execution (PTX) is a pseudo-assembly language
used in NVIDIA’s CUDA programming environment. The ‘nvcc’
compiler translates code written in CUDA, a C-like language, into
PTX, and the graphics driver contains a compiler which translates
the PTX into something which can be run on the processing cores.
237
RDMA (“remote direct memory access”)
Allows data to move directly from the memory of one computer
into that of another without involving either one‘s operating
system. This permits high-throughput, low-latency networking,
which is especially useful in massively parallel computer clus-
ters. RDMA relies on a special philosophy in using DMA. RDMA
supports zero-copy networking by enabling the network adapt-
er to transfer data directly to or from application memory, elimi-
nating the need to copy data between application memory and
the data buffers in the operating system. Such transfers require
no work to be done by CPUs, caches, or context switches, and
transfers continue in parallel with other system operations.
When an application performs an RDMA Read or Write request,
the application data is delivered directly to the network, reduc-
ing latency and enabling fast message transfer. Common RDMA
implementations include → InfiniBand, → iSER, and → iWARP.
RISC (“reduced instruction-set computer”)
A CPU design strategy emphasizing the insight that simplified
instructions that “do less“ may still provide for higher perfor-
mance if this simplicity can be utilized to make instructions ex-
ecute very quickly → CISC.
ScaLAPACK (“scalable LAPACK”)
Library including a subset of → LAPACK routines redesigned
for distributed memory MIMD (multiple instruction, multiple
data) parallel computers. It is currently written in a Single-
Program-Multiple-Data style using explicit message passing
for interprocessor communication. ScaLAPACK is designed for
heterogeneous computing and is portable on any computer
that supports → MPI. The fundamental building blocks of the
ScaLAPACK library are distributed memory versions (PBLAS) of
the Level 1, 2 and 3 → BLAS, and a set of Basic Linear Algebra
Communication Subprograms (BLACS) for communication tasks
that arise frequently in parallel linear algebra computations. In
the ScaLAPACK routines, all interprocessor communication oc-
curs within the PBLAS and the BLACS. One of the design goals of
ScaLAPACK was to have the ScaLAPACK routines resemble their
→ LAPACK equivalents as much as possible.
service-oriented architecture (SOA)
An approach to building distributed, loosely coupled applica-
tions in which functions are separated into distinct services
that can be distributed over a network, combined, and reused.
See the main article “Windows HPC Server 2008 R2” for further
details.
single precision/double precision
→ floating point standard
SMP (“shared memory processing”)
So-called SMP jobs are computer programs with several
parts running on the same system and accessing a shared
memory region. A usual implementation of SMP jobs is
→ multi-threaded programs. The communication between the
single threads can e.g. be realized by the → OpenMP software
interface standard, but also in a non-standard way by means of
native UNIX interprocess communication mechanisms.
SMP (“symmetric multiprocessing”)
A multiprocessor or multicore computer architecture where
two or more identical processors or cores can connect to a
238
Glossary
single shared main memory in a completely symmetric way, i.e.
each part of the main memory has the same distance to each of
the cores. Opposite: → NUMA
storage access protocol
Part of the → parallel NFS standard
STREAM
A simple synthetic benchmark program that measures sustain-
able memory bandwidth (in MB/s) and the corresponding com-
putation rate for simple vector kernels.
streaming multiprocessor (SM)
Hardware component within the → Tesla GPU series
subnet manager
Application responsible for configuring the local → InfiniBand
subnet and ensuring its continued operation.
superscalar processors
A superscalar CPU architecture implements a form of parallel-
ism called instruction-level parallelism within a single proces-
sor. It thereby allows faster CPU throughput than would other-
wise be possible at the same clock rate. A superscalar processor
executes more than one instruction during a clock cycle by si-
multaneously dispatching multiple instructions to redundant
functional units on the processor. Each functional unit is not
a separate CPU core but an execution resource within a single
CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.
Tesla
NVIDIA‘s third brand of GPUs, based on high-end GPUs from the
G80 and on. Tesla is NVIDIA‘s first dedicated General Purpose
GPU. Because of the very high computational power (measured
in floating point operations per second or FLOPS) compared to
recent microprocessors, the Tesla products are intended for the
HPC market. The primary function of Tesla products are to aid in
simulations, large scale calculations (especially floating-point
calculations), and image generation for professional and scien-
tific fields, with the use of → CUDA. See the main article “NVIDIA
GPU Computing” for further details.
thread
A thread of execution is a fork of a computer program into two
or more concurrently running tasks. The implementation of
threads and processes differs from one operating system to an-
other, but in most cases, a thread is contained inside a process.
On a single processor, multithreading generally occurs by multi-
tasking: the processor switches between different threads. On
a multiprocessor or multi-core system, the threads or tasks will
generally run at the same time, with each processor or core run-
ning a particular thread or task. Threads are distinguished from
processes in that processes are typically independent, while
threads exist as subsets of a process. Whereas processes have
separate address spaces, threads share their address space,
which makes inter-thread communication much easier than
classical inter-process communication (IPC).
thread (in CUDA architecture)
Part of the → CUDA programming model
239
thread block (in CUDA architecture)
Part of the → CUDA programming model
thread processor array (TPA)
Hardware component within the → Tesla GPU series
10 Gigabit Ethernet
The fastest of the Ethernet standards, first published in 2002
as IEEE Std 802.3ae-2002. It defines a version of Ethernet with
a nominal data rate of 10 Gbit/s, ten times as fast as Gigabit
Ethernet. Over the years several 802.3 standards relating to
10GbE have been published, which later were consolidated
into the IEEE 802.3-2005 standard. IEEE 802.3-2005 and the other
amendments have been consolidated into IEEE Std 802.3-2008.
10 Gigabit Ethernet supports only full duplex links which can
be connected by switches. Half Duplex operation and CSMA/
CD (carrier sense multiple access with collision detect) are not
supported in 10GbE. The 10 Gigabit Ethernet standard encom-
passes a number of different physical layer (PHY) standards. As
of 2008 10 Gigabit Ethernet is still an emerging technology with
only 1 million ports shipped in 2007, and it remains to be seen
which of the PHYs will gain widespread commercial acceptance.
warp (in CUDA architecture)
Part of the → CUDA programming modelOLAP cube
Always keep in touch with the latest newsVisit us on the Web!
Here you will find comprehensive information about HPC, IT solutions for the datacenter, services and
high-performance, efficient IT systems.
Subscribe to our technology journals, E-News or the transtec newsletter and always stay up to date.
www.transtec.de/en/hpc
244
transtec Germany
Tel +49 (0) 7121/2678 - 400
www.transtec.de
transtec Switzerland
Tel +41 (0) 44 / 818 47 00
www.transtec.ch
transtec United Kingdom
Tel +44 (0) 1295 / 756 500
www.transtec.co.uk
ttec Netherlands
Tel +31 (0) 24 34 34 210
www.ttec.nl
transtec France
Tel +33 (0) 3.88.55.16.00
www.transtec.fr
Texts and concept:
Layout and design:
Dr. Oliver Tennert | [email protected]
Silke Lerche | [email protected]
Fanny Schwarz | [email protected]
© transtec AG, March 2015
The graphics, diagrams and tables found herein are the intellectual property of transtec AG and may be reproduced or published only with its express permission.
No responsibility will be assumed for inaccuracies or omissions. Other names or logos may be trademarks of their respective owners.