11
7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 1/11  1 ExecutiveSummary  Advances in High Performance Computing (HPC) have resulted in dramatic improvements in application processing performance across a wide range of disciplines that range from manufacturing, finance, geological, life and earth sciences and many more. This mainstreaming of  HPC has driven solution providers towards innovative Technical Computing solutions that are faster,  scalable, reliable, and secure. Today, these mission critical technical computing clusters are challenged with reducing cost and managing complexity. Besides cost and complexity, data explosion in technical computing has transformed compute-intensive application workloads to both compute and data-intensive. There continues to be an unrelenting appetite to solve newer problems that are larger and even more complex. This is straining technical computing environments beyond current limits. While today’s technical computing application demands are growing, there are newer applications across several domains that now demand HPC scale solutions. These newer business problems include fraud detection, anti-terrorist analysis, social and biological network analysis, semantic analysis, drug discovery and epidemiology, weather and climate modeling, oil exploration, and power grid management 1 .  Although most technical computing environments are quite sophisticated, many IT organizations cannot fully utilize the available processing capacity in order to address newer business needs adequately. For these organizations, effective resource management and job submission is an extremely complex process that needs to meet stringent service level agreement (SLA) requirements across multiple departments. This demands higher levels of shared infrastructure utilization and better application processing throughput, while keeping costs lower. It is hard to optimize the execution of a wide range of applications using clusters and ensure high resource utilization given diverse workloads, business priorities and application resource needs. To address these complex technical computing needs, IBM ® Platform ™ LSF ® is successfully deployed across many industries and is continuously evolving to address contemporary needs. The flagship  product of the IBM Platform Computing portfolio, IBM Platform LSF provides comprehensive, intelligent, policy-driven scheduling features that enable users to fully utilize all their IT infrastructure resources while ensuring optimal application performance. This whitepaper describes key architectural aspects of IBM Platform LSF including its use model,  scheduling architecture, other core components and installation architecture. It highlights the  product’  s architectural strengths that help address current business challenges by optimizing the use of shared HPC resources. The target audience includes chief technical officers (CTOs), technical evaluators and purchase decision makers, who need to understand the architectural capabilities of  LSF, and relate them to business benefits such as containing operational and infrastructure costs while increasing scale, utilization, productivity and resource sharing in technical computing environments.  1 Big Data in HPC – Back to the future http://blogs.amd.com/work/2011/04/13/big-data-in-hpc-back-to-the-future/ How the IBM Platform LSF Architecture Accelerates Technical Computing Sponsored by IBM Srini Chari, Ph.D., MBA October, 2012 mailto:[email protected]     C     a     b     o     t     P     a     r     t     n     e     r     s     G     r     o     u     p  ,     I     n     c  .     1     0     0     W     o     o     d     c     r     e     s     t     L     a     n     e  ,     D     a     n     b     u     r     y     C     T     0     6     8     1     0      w     w     w  .     c     a     b     o     t     p     a     r     t     n     e     r     s  .     c     o     m  Cabot Partners Optimizing Business Value  

How the IBM Platform LSF Architecture Accelerates Technical Computing

Embed Size (px)

Citation preview

Page 1: How the IBM Platform LSF Architecture Accelerates Technical Computing

7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing

http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 1/11

  1

Executive Summary Advances in High Performance Computing (HPC) have resulted in dramatic improvements inapplication processing performance across a wide range of disciplines that range from

manufacturing, finance, geological, life and earth sciences and many more. This mainstreaming of  HPC has driven solution providers towards innovative Technical Computing solutions that are faster, scalable, reliable, and secure.

Today, these mission critical technical computing clusters are challenged with reducing cost and 

managing complexity. Besides cost and complexity, data explosion in technical computing hastransformed compute-intensive application workloads to both compute and data-intensive. Therecontinues to be an unrelenting appetite to solve newer problems that are larger and even more

complex. This is straining technical computing environments beyond current limits. While today’s

technical computing application demands are growing, there are newer applications across several domains that now demand HPC scale solutions. These newer business problems include fraud detection, anti-terrorist analysis, social and biological network analysis, semantic analysis, drug 

discovery and epidemiology, weather and climate modeling, oil exploration, and power grid management 1.

 Although most technical computing environments are quite sophisticated, many IT organizationscannot fully utilize the available processing capacity in order to address newer business needs

adequately. For these organizations, effective resource management and job submission is anextremely complex process that needs to meet stringent service level agreement (SLA) requirementsacross multiple departments. This demands higher levels of shared infrastructure utilization and 

better application processing throughput, while keeping costs lower. It is hard to optimize the

execution of a wide range of applications using clusters and ensure high resource utilization givendiverse workloads, business priorities and application resource needs.

To address these complex technical computing needs, IBM ® Platform™ LSF ® is successfully deployed 

across many industries and is continuously evolving to address contemporary needs. The flagship product of the IBM Platform Computing portfolio, IBM Platform LSF provides comprehensive,intelligent, policy-driven scheduling features that enable users to fully utilize all their IT infrastructure resources while ensuring optimal application performance.

This whitepaper describes key architectural aspects of IBM Platform LSF including its use model, scheduling architecture, other core components and installation architecture. It highlights the

 product’  s architectural strengths that help address current business challenges by optimizing the use

of shared HPC resources. The target audience includes chief technical officers (CTOs), technical evaluators and purchase decision makers, who need to understand the architectural capabilities of 

 LSF, and relate them to business benefits such as containing operational and infrastructure costswhile increasing scale, utilization, productivity and resource sharing in technical computing 

environments. 

1 Big Data in HPC – Back to the future http://blogs.amd.com/work/2011/04/13/big-data-in-hpc-back-to-the-future/ 

How the IBM Platform LSF Architecture Accelerates Technical Computing

Sponsored by IBM 

Srini Chari, Ph.D., MBA

October, 2012

mailto:[email protected]

    C    a

    b    o    t    P    a    r    t    n    e    r    s    G    r    o    u    p ,    I    n    c .    1    0    0    W

    o    o    d    c    r    e    s    t    L    a    n    e ,    D    a    n    b    u    r    y    C    T    0    6    8    1    0

 

    w    w    w .    c    a    b    o    t    p    a    r    t    n    e    r    s .    c    o    m

 

Cabot

PartnersOptimizing Business Value  

Page 2: How the IBM Platform LSF Architecture Accelerates Technical Computing

7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing

http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 2/11

  2

Introduction – Tuning Technical Computing Tasks 

Advances in HPC and technical computing have resulted in dramatic improvements in application processing performance across a wide range of disciplines. Although most technical computingenvironments are quite sophisticated, many IT organizations find it challenging to maximize

 productivity with available processing capacity and meet newer business needs adequately.

Today, HPC clusters typically consist of hundreds or thousands of compute servers, storage andnetwork interconnect components. These require substantial investment and drive up capital,

 personnel and operating costs. For maximum Return on Investment (ROI), these technical computingenvironments must be shared across several users and departments within an organization. The ever 

increasing computing demands in a continuously growing compute cluster requires fair sharing andeffective utilization of raw clustered compute capability. Sharing is made possible throughintelligent workload and resource management that includes job scheduling and fine grained control

over shared resources. Effective workload and resource management boosts cluster resourceutilization and Quality of Service (QoS) necessary for meeting business priorities and SLAs.

Technical compute cluster owners need to manage their existing deployed applications and also plan

for new business and application requirements. Maximizing throughput2 and maintaining optimalapplication performance are primary challenges that are hard to address simultaneously. Highthroughput requires elimination of load imbalance among constituent compute nodes in a cluster.Optimal application performance necessitates reduction in communication overhead byappropriately mapping application workload to the best available compute resources in the cluster.

Such needs are addressed by workload management solutions that typically consist of a resourcemanager and a job scheduler. Together, these prevent jobs from competing with each other for limited shared resources in large clusters.

IBM Platform LSF is a powerful and comprehensive technical computing workload management

 platform that supports diverse workloads, across several industry verticals, on a computationallydistributed system. It has proven capabilities such as the ability to scale to thousands of nodes, built-in high availability, intelligent job scheduling and sophisticated yet simple-to-use resource

management capabilities that improve management of shared clusters. Features such as effectivemonitoring and fine-grained control over workload scheduling policies are well suited for multiple

lines of business users within an organization. By maximizing heterogeneous shared resources in ashared computing environment, LSF ensures that resource allocation is always aligned with business

 priorities. System utilization and QoS improve as job throughput and application performance is

maximized. This reduces cycle times and maximizes productivity in mission critical HPCenvironments.

This whitepaper covers key aspects of the IBM Platform LSF architecture and how this architectureis optimized to address technical computing challenges. Highlights include key architectural aspects

of IBM Platform LSF including its use model, scheduling architecture, other core components andinstallation architecture that together help optimize the use of shared resources. This paper aims toempower CTOs, technical evaluators and purchase decision makers with a perspective on how the

architectural capabilities of LSF are well equipped to address today’s HPC challenges specific totheir business. Also included are the latest LSF features and benefits and how these help incontaining operational and infrastructure costs while increasing scale, utilization, productivity andresource sharing in technical computing organizations.

2 Throughput – number of jobs completed per unit of time

Technical 

computing 

environments

challenged tomaximize

productivity

Intelligent 

workload and 

resourcemanagement 

are needed to

maximize ROI 

and guarantee

stringent SLAs

IBM Platform

LSF 

intelligently

schedules and 

guarantees

completion of 

workloads

across a

distributed,

heterogeneous,virtualized IT 

environment 

Page 3: How the IBM Platform LSF Architecture Accelerates Technical Computing

7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing

http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 3/11

  3

The IBM Platform LSF Architecture

IBM Platform LSF provides resource-aware scheduling through its highly scalable and reliablearchitecture with built-in availability features. It has a comprehensive set of intelligent, policy-drivenscheduling capabilities that enable full utilization of distributed cluster compute resources. The LSFarchitecture is geared to address technical computing challenges faced by users as well as

administrators. Together with IBM Platform Application Center, LSF allows users to schedule

complex workloads through easy to use interfaces. With LSF, administrators can easily manageshared cluster resources up to petaflop-scale while increasing application throughput, maintaining

optimum performance levels, and QoS that is consistent with business requirements and priorities.Its modular architecture is unique and provides both higher scalability and flexibility by clearly

separating the key elements of job scheduling and resource management that are critical for HPCworkload management needs. These key elements are:

   Task Placement Policies that govern exchange of load information within cluster nodes and are

used in decision making for task placement on cluster nodes

  Mechanisms for transparent remote execution of scheduled jobs

  Interfaces that support load sharing applications, and

  Performance optimization of highly scalable HPC applications.

The following sections highlight the how LSF works, how users access its key features, the LSFscheduling architecture and its other core elements. Then, we briefly describe the installationarchitecture indicating where each LSF component is active within a cluster and how it helps in job

scheduling and resource management tasks.

 

LSF Cluster Use Model

This section describes how a typical IBM Platform LSF cluster is accessed and used. Individualcompute resources in a technical computing organization are usually grouped into one or more

clusters that are managed by LSF. Figure 1 shows this cluster use model, and how the jobmanagement and the resource management roles are played by different nodes in a LSF cluster. One

machine in the cluster is selected by LSF as the “master” node or master host. The master node playsa key role in resource management and job scheduling functions of workload management. Theother nodes in the cluster act as slave nodes and can be harnessed by the scheduler, through its

scheduling algorithms, for executing jobs.

Master Nodes: When nodes start up, LSF uses intelligent, fault-tolerant algorithms for master nodeselection. During system operation, if the master node fails, LSF ensures that another node takes the

 place of the master, thus keeping the master node highly available and system services accessible tousers at all times. Job scheduling decisions are governed by business priorities and policies that areset up by the LSF system administrator.

Figure 1: LSF cluster use model (source: IBM) 

Technical 

computing 

environments

challenged to

maximize

productivity

ntelligent 

workload and 

esource

management 

are needed to

maximize ROI 

and guarantee

tringent SLAs.

Platform LSF 

ntelligently

chedules and 

guarantees

completion of 

workloads

across a

distributed,

heterogeneous,

virtualized IT 

environment 

he modular 

BM Platform

SF architecture

rovides users

nd 

dministratorsetter flexibility

nd scalability

ith separation

f scheduling 

nd resource

anagement 

ements

n intelligent 

aster-slave

odel for 

cheduling and 

anagement 

mproves

eliability and 

erformance

Page 4: How the IBM Platform LSF Architecture Accelerates Technical Computing

7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing

http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 4/11

  4

Users connect to a distributed system via a client and submit their jobs to the job submission node.As these user jobs queue up, the master decides where to dispatch the job for execution, based on the

resources required and current availability of the resources among slave nodes.

Slave Nodes: Each slave machine in the system collects its own “vital signs” or the load information periodically and reports them back to the master. Detailed information on the load index3 for eachnode in the distributed system is analyzed and used for scheduling decisions in order to reduce job

turnaround time and increase system throughput. LSF has unique algorithms for smart informationdissemination of the load index and resource usage status to optimize system scalability and

reliability. These algorithms are proven to scale up to thousands of nodes.

Workload Execution: LSF has a remote execution component that starts or stops the jobs on the

assigned slave node. Once the scheduled jobs complete on slave nodes, the completion results and job status are communicated to the user. LSF also generates reports on resource usage and detailed

 job execution logs. Users can obtain job execution results on a local node, transparently, as if theywere executing those jobs locally. LSF frees users from having to decide which nodes are best for 

executing a job while allowing administrators to set up policies for job execution logic that are bestsuited to business needs.

There are options to checkpoint a job that is running on a slave node, or move a running job to adifferent slave node and then resume execution. This feature can help to temporarily suspend

running jobs, free up resources for any critical jobs, and then resume jobs from the last execution point instead of having to restart them all over, thus improving system flexibility and utilization.

LSF Scheduler

Scheduling is a key component of any workload and resource management solution. Figure 2 shows

the central component of the LSF scheduling architecture, which provides support for multiple

scheduling policies. When a job is submitted to LSF, many factors control when and where the jobstarts to run. These factors include the active time window of the queues or hosts, resourcerequirements of the job, availability of eligible hosts, various job slot limits, job dependencyconditions, fair-share constraints and load conditions.

3 Load Index: LSF defines a load -index for each type of resource. Load index quantifies each node’s loading condition. Depending on the nature of the

resource, some possibilities are queue length, utilization, or the amount of free resource. Reference: Utopia – a load sharing facility for a large scale

heterogeneous system

http://cse.unl.edu/~lwang/project/Utopia_A%20Load%20Sharing%20Facility%20for%20Large,%20Heterogeneous%20Distributed%20Computer%20Syst

ems.pdf  

Figure 2: LSF scheduling architecture (source: IBM) 

Smart 

scheduling 

algorithms

reduce time to

results and maximize

throughput 

while

improving 

reliability

The LSF 

scheduler 

supports

multiple

policies

aligned with

business needs

Page 5: How the IBM Platform LSF Architecture Accelerates Technical Computing

7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing

http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 5/11

  5

One unique architectural feature of the LSF scheduler is that it allows multiple scheduling policies tocoexist in the same system. This means that to make scheduling decisions, LSF accommodates

multiple scheduling approaches that can run concurrently and be used in any combination, includinguser-defined custom scheduling approaches. The LSF scheduler plug-in API can be used to

customize existing scheduling policies or implement new ones that can operate with existing LSFscheduler plug-in modules. These custom scheduling policies can influence, modify, or override LSFscheduling decisions, thus empowering administrators to model the job scheduling decisions aligned

with business priorities. The scheduler plug-in architecture is fully external and modular; newscheduling policies can be prototyped and deployed without changing the compiled code of LSF.

LSF Core Components 

LSF takes job requirements as inputs, finds the best resources to run the job, schedules and executes jobs and monitors its progress. Jobs always run according to host load and site policies. This section

 provides an overview of some of the core components of LSF and their key role in job schedulingand resource management functions. LSF is a layer of software services on top of UNIX and

Windows operating systems that creates a single pool of networked compute and storage resources.

This layered service model (Figure 3) provides a resource management framework to allocate,manage and use resources as a single entity. The three basic components of this layer are LSF Base,LSF Batch and LSF Libraries and together they help in distributing work across existing

heterogeneous IT resources; creating a shared, scalable, and fault-tolerant infrastructure that deliversfaster and more reliable workload performance.

LSF Base provides basic load-sharing services for the distributed system such as resource usageinformation, host selection, job placement advice, transparent remote execution of jobs and remote

file options. These services are provided through the following sub-components:  Load Information Manager (LIM)

  Process Information Manager (PIM)

  Remote Execution Server (RES)

  LSF Base application programming interface (API)

  Utilities such as lstools, lstcsh and lsmake.

LSF Batch extends LSF base services to provide a batch job processing system along with load balancing and policy-driven resource allocation control. To provide this functionality, LSF Batch

uses the following LSF base services:

Figure 3: LSF services - high level architecture (source: IBM) 

The LSF 

scheduler 

minimizeslatencies for 

short jobs

while

improving 

performance

for long jobs

LSF core

components

help in

distributing 

work across

existing 

heterogeneous

IT resources;creating a

shared,

scalable, and 

fault-tolerant 

infrastructure

Page 6: How the IBM Platform LSF Architecture Accelerates Technical Computing

7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing

http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 6/11

  6

  Resource and load information from LIM to perform load balancing activities

  Cluster configuration information and master LIM election service from LIM

  RES for interactive batch job execution

  Remote file operation service provided by RES for file transfer.

LSF Libraries provide APIs for distributed computing application developers to access job

scheduling and resource management functions. There are two LSF libraries: LSLIB and LSBLIB.

  LSLIB is the core library that provides basic workload management services to applicationsacross a shared cluster and is a runtime library to easily develop load sharing applications.

  LSLIB implements a high level procedural interface that allows applications to interact withLIM and RES. The other library, LSBLIB, is the batch library and it provides batch servicesthat are required to submit, control, manipulate, and queue jobs on system nodes.

LSF Installation Architecture

LSF consists of a number of servers or daemon processes that run with root privileges on each participating host (Figure 4) in the system and a comprehensive set of utilities that are built on top of the LSF API. There are multiple LSF processes running on each host in the distributed system. The

type and number of processes running depend on whether the host is a master host, a compute or slave host or one of the master node candidates as shown in Figure 5.

LSF libraries

provide APIs

for applicationdevelopers to

access job

scheduling and 

resource

management 

functionality of 

LSF.

LSF consists of 

a number of servers or 

daemon

processes that 

run with root 

privileges on

each

participating 

host 

Figure 4: LSF daemons and their functions in scheduling & resource management (source: IBM)

Page 7: How the IBM Platform LSF Architecture Accelerates Technical Computing

7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing

http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 7/11

  7

On each participating host in a LSF cluster, an instance of LIM runs and exchanges load informationwith its peers on other hosts and provides applications and associated tasks with a list of hosts that

are best for execution. Multiple resources on each host and resource demands of each application areconsidered in LIM placement decisions. In addition to help LSF make placement decisions, LIM

also provides load information to those applications that make their own placement decisions.Besides LIM, RES is another server or daemon on each host. RES provides the mechanisms for transparent remote execution of arbitrary tasks. Typically, after placement advice has been obtainedfrom LIM, a stream connection is established between the local application and its remote task 

through RES on the target host. This is followed by remote task initiation. LSF supports several

models of remote execution to meet the diverse functional and performance requirements of applications. A LIM and a RES run on every Platform LSF server host. They interface with thehost’s operating system to give users a uniform, host-independent environment. Figure 6 shows

sample job submission steps, for regular as well as batch jobs that run on a LSF system and variousinteractions between LSF components during job submission and execution.

LSF Architectural Strengths

The architectural strength of LSF results from its modular structure that even allows parts of thesystem to be used independent of other parts. For instance, a task can be executed on a remote host

specified by the user so that LSLIB can contact the remote RES component, without needing theLIM component. Similarly, load information and placement advice from LIM may be obtained for 

Figure 5: Installation architecture with various LSF processes running on different nodes in a LSF managed cluster (source:

Figure 6: Interactions between various LSF components during job submission and execution (source: IBM)  

LSF supports

several models

of remoteexecution to

meet the

diverse

functional and 

performance

requirements

of applications

The LSF modular 

structure even

allows parts of 

the system to

be used 

independent of 

other parts

Page 8: How the IBM Platform LSF Architecture Accelerates Technical Computing

7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing

http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 8/11

  8

 purposes other than remote execution. Another advantage of the LSF architecture is that policies andmechanisms of load sharing may be changed independent of each other as well as independent of the

applications running on the system. This provides significant fine grain control over resource sharingand job scheduling.

While LSF manages distributed system sharing and job scheduling complexities with its smartarchitecture, it also provides easy-to-use and simple interfaces that improve productivity of both

users and administrators and boosts collaboration in technical computing organizations. The highlyavailable single master node concept for managing an entire cluster simplifies distributed systems

management and frees up domain experts to focus on value added work instead of the tedious jobscheduling and system management tasks. At higher scale, LSF deploys a hierarchical master nodeconcept internally but all that complexity is hidden and does not impact its simplified use model.

Users can access systems with thousands of nodes that could be spread across geographies throughadditional LSF components such as LSF Multi-Cluster. LSF is architected to run on a variety of x86

hardware and operating environments including the latest generation of IBM System x servers and isalso certified on IBM Power Systems servers running the AIX and Linux operating systems.

IBM Platform LSF Benefits

LSF allows multiple users to share heterogeneous assets more effectively in a shared computingenvironment.

Consequently, people are more

 productive, projects are completedearlier and because computer utilization is better, infrastructure

costs are contained.

By consolidating compute resourcesfrom multiple, distributed systems,workload can be distributed more

efficiently across an organization’stechnical computing assets that aregeographically dispersed. With thiscapability, effective sharing of resources can be extended from a

single cluster to enable flexiblehierarchical or peer-to-peer workloaddistributions between multiple clusters.

LSF improves efficiency by removing the problem of underutilized compute resources by enabling

local administrators to retain control of their own assets while still permitting remote systems to tapinto idle capacity.

Cluster-level capabilities in LSF transparently extend to the grid. This makes it exceptionally fast andcost-efficient to deploy on grids, eliminating the need for sites to implement an expensive,customized scheduling layer to share resources between clusters.

With simple interfaces and a plug-in modular architecture, LSF lowers the learning curve andincreases cluster user productivity, reduces application integration and training costs, and speeds up

 job completion by eliminating manual job submission errors through automation. Technicalcomputing users obtain faster results and complete more jobs using shared cluster resources at lower costs.

BM PlatformLSF: Complete,powerful,scalable

Workload Management Solution.

Benefits:Advanced,

feature-rich

workloadscheduling

Robust set of add-on

features

Integrated

applicationsupport

Policy & 

Resource

awarescheduling

Resource

consolidation

for maximumperformance

Automation & 

Advanced self 

management

Thousands of 

concurrent

users & jobs

Optimalutilization, less

infrastructure

costs

Better user

productivity,

faster time to

results

Best TCO – 

flexible control,multiple

policies, robustcapabilities,

administratorproductivity

The LSF 

smarter 

scheduling 

advantages:

Higher throughput at alower cost

Flexibility toaddresschanging

 business needs

Better assetutilization &ROI

Better servicelevels to end-users

Increasedautomation &

reducedmanualintervention 

Page 9: How the IBM Platform LSF Architecture Accelerates Technical Computing

7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing

http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 9/11

  9

In short, LSF equips technical computing environments to achieve the following benefits:

 Obtain higher-quality results faster 

 Reduce infrastructure and management costs, and

 Easily adapt to changing user requirements.

Conclusions

Flexibility, scalability and agility are the key requirements of technical computing environments4.Technical computing users typically run varied applications and workloads on clusters and largescale distributed systems. These workloads range from performance sensitive, compute-intensive,data-intensive or a combination.

To support large technical computing clusters, customers are challenged with manual tasks andcumbersome tools, issues related to integration and the need for multiple dedicated personnel todevelop and maintain custom integration between various tools and applications. This increases costs

and business risks because a lot of the mission-critical functionality could be expensive or time-consuming to realize. Instead of focusing on core high-value tasks, administrators could also be

consumed by mundane manual systems management tasks. These environments demand reliability aswell as scalability from the underlying IT infrastructure. However, budgetary constraints and

competitive pressures make it imperative to increase resource utilization and improve infrastructuresharing efficiencies to achieve better collaboration, productivity and faster time to results.

In such large scale distributed systems, computing resources are made available to users throughdynamic and transparent load sharing provided by IBM Platform LSF. Through its transparent

remote job execution, LSF harnesses powerful remote hosts to improve application performance,enabling users to access resources from anywhere in the system . The IBM Platform LSF productfamily has the broadest set of capabilities in the industry which are tightly integrated and fullysupported by IBM. As part of an even broader portfolio of offerings from IBM and IBM BusinessPartners, LSF can be packaged with more engineering, integration and process capabilities. This

further enhances productivity of technical computing users, enabling them to focus more on their core business, engineering or scientific tasks. It also reduces future strategic risk as the business

evolves.

The IBM Platform LSF architecture is geared to create a scalable, reliable, highly utilized and

manageable shared infrastructure for technical computing environments with powerful resourcemanagement and scheduling solutions cutting across cluster silos. Its modular architecture provides

the much needed flexibility and fine-grained control while speeding up job turnaround times andimproving productivity. Simple interfaces and easy customization features of LSF andcomplementary products reduce complexity and management costs; facilitate better collaboration,

tighter integration and alignment of scheduling and resource management tasks with businessobjectives and priorities. LSF is architected to optimally place workloads not only based on the

capability of a cluster machine to run a workload, but based on a determination of what host is bestable to run the workload while ensuring broader business policies and requirements are met.

IBM Platform LSF lowers operating costs by smartly matching the limited supply of sharedresources with application demands and business priorities through features such as guaranteed

resources, live re-configuration, fair-share and pre-emptive scheduling enhancements, better  performance and scalability. IBM continues to enhance the capabilities of LSF and LSF-add oncomponents. Clients can expect IBM to deliver capabilities to deploy new LSF add on componentson demand to keep up with ever changing requirements of the technical computing marketplace.

4 Trends from the trenches: Bio IT World 2012 http://www.slideshare.net/chrisdag/2012-trends-from-the-trenches 

Technical 

computing organizations

need 

flexibility,

scalability,

and agility at 

lowers costs

and risks

LSF can be

packaged with

engineering,

integration

and processes

so that 

technical 

computing 

organizations

can become

more

productive and 

focus on their 

core business,

engineering or 

scientific tasks

LSF lowers IT 

costs smartly

by matching 

the limited 

supply of 

shared 

resources with

application

demands and 

businesspriorities

LSF is tuned to

technical 

computing now

and in the

future

Page 10: How the IBM Platform LSF Architecture Accelerates Technical Computing

7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing

http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 10/11

  10

Appendix: What’s new in LSF Version 9.1? 

 IBM Platform LSF virtualizes heterogeneous IT infrastructure and offers customers complete freedom of 

choice. Through fully integrated and certified applications, custom application integration, support for a wide

variety of operating systems, it ensures that current investments are preserved while providing the strategic

benefit of freedom of choice to run the best platform for the best job. The current LSF (Version 9.1) release

delivers improvements in performance and scalability over prior versions while introducing several additional 

new features that simplify administration and boost productivity of cluster users.

New Featuresin LSF

Functional details  Business Benefits Performanceand scalability

 Improved Query Response ~10ms, decrease in Scheduling cycle, memoryoptimization, decrease start/restart time, parallel start-up/restart.

 LSF has been extended to support an unparalleled scale of up to 160,000 cores

and two million queued jobs for very high throughput EDA workloads. On very large clusters with large numbers of user groups employing fair-share

scheduling, the memory footprint of the master batch scheduler in LSF has

been reduced by approximately 70% and scheduler cycle time has beenreduced by 25%.

 Faster job turnaround times. Fastertime to results.

 For a very large fair-share tree (e.g.

4K user group, 500 users with -g), jobelection performance has beenimproved 10x.

Better usability& manageability

 Clearer reporting of resource usage and pending reasons. Better alternative job resource options for timely job execution

 Enhanced process tracking: LSF 9.1 leverages kernel cgroup functionality to

replace/improve existing functionality for Process Tracking and Topology

CPU/memory enforcement. Fast detection of hung hosts/jobs, directory management. New multi-threaded

communication mechanism allows faster detection of unavailable hosts.

 Speeds up troubleshooting, fasterdetection of failed or hung jobs, self-

tuning, and better admin productivity. Protection against user initiated

actions that can result in denial of service.

 Timely job turnaround with alternateresources

Schedulingenhancements 

 LSF 8 provided guaranteed resource scheduling feature for groups of jobsbased on slots (cores), LSF 9.1 extended this feature for more complex

resource guarantees to support multi-dimensional packages. A package is acombination of slots and memory. This enables SLA scheduling to considermemory in addition to cores.

 Besides numerous multi-cluster scheduling enhancements such as enhancedinteroperability across clusters and exchange of all load information between allclusters, it also provides CPU and memory affinity

 LSF 9.1 also provides alternative or time based resource requirements to betteralign with business priorities with a much finer -grained control.

 LSF scheduling enhancements makethe cluster more stable and reliable

 Better job control and more accuratelight weight CPU and memoryaccounting even for run away and

short job processes. Fine-grained tuning and customization

of infrastructure sharing policies

ensure flexibility and agility inresource sharing that match closelywith evolving business requirements.

IBM Platform - AdvancedEdition Architecture

 The new LSF - Advanced Edition architecture separates user interaction from

scheduling, and divides the compute resource into a number of executionclusters, while presenting it to the users as a single cluster.

 This new architecture delivers the

expected increase in performance withthe increase in capacity resulting inconsistent user experience with scale.

LSF Add-on modules have also been enhanced in the latest version 9.1. The LSF License Scheduler handles parallel jobs where each rank checks out a license directly, more efficiently and does notneed a restart for making configuration changes. There are enhanced filtering and drill downcapabilities in IBM Platform RTM along with support for IBM General Parallel File System (GPFS)

monitoring. LSF Process Manager now supports non-LSF batch systems and the IBM PlatformSymphony product. IBM Platform Application Center has improved the interface with IBM PlatformAnalytics and the latter now supports Tableau (v8) and Vertica (5.1) and latest BI reportingcapabilities.

IBM Platform LSF V9.1 delivers significantly enhanced performance, scalability, manageability andusability as well as new scheduling capabilities. The new Platform LSF – Advanced Edition

 provides greater than three times more scalability than prior versions of LSF, enabling clients to

consolidate their compute resources to achieve maximum flexibility and utilization.

For clients looking to improve service levels and utilization with a dynamic, shared HPC cloudenvironment, IBM Platform Dynamic Cluster V9.1 is now available as an add-on to IBM Platform

LSF. Platform Dynamic Cluster turns static Platform LSF clusters into dynamic, shared cloudinfrastructure. By automatically changing the composition of clusters to meet ever-changingworkload demands, service levels are improved and organizations can do more work with lessinfrastructure. With smart policies and numerous features such as live job migration and checkpoint-restart, Platform Dynamic Cluster enables clients to realize improved utilization, better reliability,

and increased productivity, while reducing administrator workload.

Page 11: How the IBM Platform LSF Architecture Accelerates Technical Computing

7/29/2019 How the IBM Platform LSF Architecture Accelerates Technical Computing

http://slidepdf.com/reader/full/how-the-ibm-platform-lsf-architecture-accelerates-technical-computing 11/11

  11

The new IBM Platform Session Scheduler V9.1 is designed to work with Platform LSF to providehigh throughput, low-latency scheduling for a wide-range of workloads. It is particularly well suited

to environments that run high-volumes of short duration jobs, and where users require faster and more predictable job turnaround times. Unlike traditional batch schedulers that make resource allocation

decisions for every job submission, Platform Session Scheduler enables users to specify resourceallocation decisions only once for multiple jobs in a user session, providing users with their ownvirtual private cluster. With this more efficient scheduling model, users benefit from higher job

throughput and faster response times while cluster administrators realize an overall improvement incluster utilization.

To learn more about current IBM Platform LSF product features, visit:

http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/lsf/index.html  

Copyright ® 2012. Cabot Partners Group. Inc. All rights reserved. Other companies’ product names, trademarks, or service marks are used herein for identification only andbelong to their respective owner. All images and supporting data were obtained from IBM or from public sources. The information and product recommendations made by the Cabot Partners Group are based upon public information and sources and may also include personal opinions both of the Cabot Partners Group and others, all of which we believe to be accurate and reliable.However, as market conditions change and not within our control, the information and recommendations are made without warranty of any kind. The Cabot Partners Group, Inc. assumesno responsibility or liability for any damages whatsoever (including incidental, consequential or otherwise), caused by your use of, or reliance upon, t he information and

recommendations presented herein, nor for any inadvertent errors which may appear in this document. This document was developed with IBM funding. Although the document mayutilize publicly available material from various vendors, including IBM, it does not necessarily reflect the positions of such vendors on the issues addressed in this document.