Realizing a multitenant big data infrastructure 3

Realizing a shared, multi-tenant infrastructure for Big Data and Analytic applications with IBM® InfoSphere® BigInsights and IBM Platform Computing™

Page 1

Realizing a shared, multi-tenant infrastructure for Big Data

and Analytic applications using IBM® InfoSphere

®

BigInsights and IBM Platform Computing™

Last revised: April 19, 2014

By: Gord Sissons Steven Sit Eric Fiala Michael Feiman


Page 2

Contents Document History ......................................................................................................................................... 4

Introduction .............................................................................................................................................. 4

Disclaimers and limitations ....................................................................................................................... 4

About the customer described in this use case ........................................................................................ 5

Industry Challenges ................................................................................................................................... 5

Impact on Information Technology ...................................................................................................... 6

The Big Data Environment ........................................................................................................................ 7

Hardware Infrastructure ....................................................................................................................... 7

The Software Environment ................................................................................................................... 7

Customer Requirements ....................................................................................................................... 8

Installing InfoSphere BigInsights for Multi-tenant services ...................................................................... 9

Installation steps ................................................................................................................................... 9

Accessing the Platform Symphony Management Console ................................................................. 12

Accessing the Platform Symphony knowledge center ........................................................................ 14

Platform Symphony Concepts................................................................................................................. 15

An example of configuring a cluster for multi-tenancy .......................................................................... 18

Adding users to run MapReduce applications .................................................................................... 19

Provide access to the BigInsights / Platform Computing cluster ........................................................ 23

Understanding Platform Symphony Impersonation ........................................................................... 24

Configuring OS groups for the multitenant environment................................................................... 25

Submitting a test job as a user to verify the configuration ................................................................ 25

Associating BigInsights with a Symphony Application ........................................................................ 28

Enabling Symphony Repository Services ............................................................................................ 29

Adding a new Application / Tenant .................................................................................................... 30

Configuring application properties ..................................................................................................... 34

Associating applications with consumers ........................................................................................... 40

Accessing Consumer Definitions ......................................................................................................... 41

Manually editing Consumer Tree definitions ...................................................................................... 42


Page 3

Controlling access to applications and consumers ............................................................................. 43

Determining the execution user for a consumer ................................................................................ 44

Configuring Sharing Policies .................................................................................................................... 46

Summary ................................................................................................................................................. 48


Page 4

Document History

Date of this revision is Saturday April 19, 2014

Revision Date Summary of changes

0.9 March 23, 2014 Initial draft

0.95 April 19, 2014 Incorporate many valuable comments from Steven Sit based on his direct client experience – thank you Steven.

Introduction

This document is written for IBM and partner architects. It is intended to be a guide for those working

with customers deploying IBM InfoSphere BigInsights and other Hadoop offerings together with IBM

Platform Symphony. While this paper describes the details of one customer implementation, we believe

that this use case is relevant to others as well. Challenges related to Hadoop multitenancy are faced by

customers across multiple industries.

The target audience for this document includes:

Architects responsible for deploying big data or analytic workloads

Technical users looking for ways to deploy Hadoop on shared clusters

IBM architects, ISVs or business partners interested in building multitenant Big Data

environments to help customers reduce infrastructure requirements and save cost

This paper does not delve into YARN. YARN is another important (but less mature) technology that

delivers some of the capabilities described herein. It is important for IBM customers to understand that

IBM BigInsights is a safer choice in the sense that it supports open source technologies like YARN while

simultaneously offering more advanced capabilities. IBM’s view is the clients can best determine what

capabilities they need, but IBM InfoSphere BigInsights provides customers with flexibility. The best of a

100% open source distribution along with significant value added capability.

In the customer example documented here, the business advantage of using proprietary capabilities

(IBM Platform Symphony) dramatically outweighed the benefits of being “pure” from an open source

standpoint. The client was able to consolidate roughly 30 applications onto a shared infrastructure and

avoid significant incremental capital expense that would have been required to setup separate clusters

had the client decided to proceed with open source YARN only.

Disclaimers and limitations

The details of the customer implementation are proprietary and confidential. As such, while we can

describe what was done technically, we cannot share details of how this customer used particular

applications. As a result, the examples provided herein are meant to explain qualitatively what was

achieved by the customer without betraying confidential information. The details and screenshots in this


Page 5

document are not from the customer environment. They have been reproduced on a small test cluster

to explain particular capabilities that the client chose to take advantage of.

About the customer described in this use case

The customer described in this paper is a full-service financial service provider. They offer a broad range

of products to their clients including insurance, banking, investing, real estate, retirement planning,

wealth management and health insurance. Like many in the financial services sector, this customer is

increasingly deploying Hadoop based applications to augment their data warehouse. They are motivated

by the following imperatives:

The need to leverage big data analytics to make better business decisions, improve customer

relations and develop innovative new products and services

The need to contain or reduce costs (the cost of storing and processing data on a Hadoop cluster

is an order or magnitude less than persisting the same data in their data warehouse)

The desire to architect their environment as a shared service to avoid each line of business

building their own discrete analytic environments on premise or in the cloud

Industry Challenges

Like many industries, the sector represented by this client is going through significant change. As a full-

spectrum provider, the client is disproportionally impacted by regulation. As a bank, not only are they

subject to various provisions in legislation like Dodd Frank, but they are also impacted by insurance

industry requirements such as the NAIC’s Risk Management and Own Risk Solvency Act (RMORSA) and

other initiatives around Enterprise Risk Management that have occurred as a response to the financial

crisis of 2008.

Of particular consequence is the Volcker rule, a US Senate bill that would give regulators the ability to

limit or prohibit certain types of proprietary trading activities. While the legislation is directed at retail

banks, this client will be impacted across their insurance and wealth management businesses where

proprietary trading is important to maximizing investment gains.

As if this tsunami of new regulation was not enough, fundamental changes are taking place in the

insurance industry as well driven by external factors. Among these factors are new disruptive

technologies. Big data, social and mobile technologies are prominent drivers of change. Some specific

challenges to the business are:

Driven by high-profile events, and the increased frequency of natural catastrophes, contingent

business interruption (CBI) modeling is emerging as a priority for insurance firms

Dramatic changes driven by technology are promising to fundamentally change auto-insurance.

Among these factors are collision avoidance technologies that promise to shift liability from

drivers to manufacturers, social media technologies enabling insurers to seek out and market to


Page 6

lower risk consumer pools, and advances in GPS and vehicle telematics that promise to provide

insurers with more granular data on which to base risk assessments

Technological advances are leading to an explosion in available information and firms that

aggregate such information to help insurers better quality risk

Widespread consumer use of mobile technologies and social technologies are causing firms to

rethink how they promote their brand and provide services to both their customers and

agents/advisors

Advances in analytic techniques are making it easier for insurers to collect process and visualize

information. This is extending beyond core actuarial techniques to include approaches like

predictive analytics, natural language processing, social network analysis and simulation-based

analytics.

Additionally, new technologies are changing how information is stored and processed.

Distributed file systems and clustered technologies like Hadoop can provide a significant per-

terabyte cost advantage over traditional warehouses. Because of these cost advantages, and

because the framework is well suited to storing and processing unstructured or semi-structured

data, this customer and similar firms are embracing Hadoop as a platform for many new

applications.

The reason we point this out is that that risk management that relies heavily on Monte Carlo simulation

for simulation and actuarial modeling, and big data analytics are converging. Both depend on scaled out

infrastructure. Firms that understand this convergence can obtain a cost advantage relative to their

competitors.

Impact on Information Technology

Both the regulatory challenges described above as well as the technological shifts and business

pressures are driving the need for greater data processing and analytic capacity.

Traditional data warehouses cannot scale cost-efficiently to manage the vast amounts of data

being collected and processed, nor can they handle raw volumes of unstructured data involved.

Organizations need more agile application development methodologies and toolsets that allow

them to evolve data schemas and applications on the fly as they continuously incorporate new

sources of data into their models.

A one-to-one mapping between applications and infrastructure is no longer practical. Many applications

(Hadoop, scenario generation, Monte Carlo simulation and ETL processing) rely on distributed

infrastructure that scales horizontally. Replicating this clustered infrastructure for each line of business

and each application would be cost prohibitive.


Page 7

The Big Data Environment

Hardware Infrastructure

The physical infrastructure deployed by this client is shown pictorially in Figure 1. While there are

actually four identical 16 node clusters, only the production environment is shown here. The server

infrastructure is based on an IBM System X based reference architecture for InfoSphere BigInsights. Each

cluster node has 12 CPUs, over 60 GB or memory and 12 locally connected physical disks. The

production cluster has 192 TB of disk and approximately 1 TB of memory.

A unique feature of this environment is that the cluster is shared by several lines of business comprising

approximately 30 different user groups across different lines of business.

Figure 1: Physical infrastructure for shared Hadoop Platform

The Software Environment

The Linux based infrastructure supports multiple big data and analytic applications.

Among these applications are:


Page 8

IBM InfoSphere BigInsights (providing core Hadoop services)

Datameer (for data visualization)

IBM TeaLeaf – customer experience analytics platform

Open source Sqoop 1.2.4 – used to perform bulk data transfers to and from various data sources

including an operational data warehouse and the production Hadoop cluster

Various MapReduce streaming applications, where for convenience of development Map and

Reduce logic is expressed as Perl scripts

Many in-house developed Java applications

Various ETL scripts running in and out of the Hadoop MapReduce framework

The IBM furnished software environment is comprised of the following major components

IBM InfoSphere BigInsights Enterprise Edition

IBM Platform Symphony Advanced Edition (Software is bundled with BigInsights Enterprise

Edition for a single tenant, and this client has purchased a production licenses)

IBM GPFS FPO (providing a POSIX compliant file system that fully preserves HDFS semantics)

Customer Requirements

This customer requires a multi-tenant environment for several business reasons listed below.

They wish to share infrastructure between multiple departments and lines of business both to

boost capacity (by allowing departments to tap capacity not being used by others) and to reduce

costs by avoiding the need for separate physical environments.

They need the ability to guarantee service levels to different tenants to ensure that business

critical applications can run in a predictable fashion. For example ETL or specific database load

operations must run with an overnight batch window.

Because many services are long-running, to make sharing practical, agile pre-emption is required

to make sure that urgent jobs do not need to wait behind long running jobs on the cluster.

The client needs to ensure that data is segmented between different tenants on the shared

environment for security and privacy reasons.

Finally, the client requires multi-tenancy for technical reasons that are sometimes overlooked.

As the environment evolves, they need the flexibility to deploy different versions of software

components that may have specific dependencies. A specific example is this client’s requirement

to use a more recent version of open-source Sqoop, distinct from the version included in

BigInsights 2.1.0.1, the version deployed at the time of this writing.


Page 9

Different Hadoop vendors have different definitions of what they mean by multi-tenancy, so it is

important that we not confuse the multitenant capabilities offered by IBM in Platform Symphony with

open source offerings like YARN which is much less capable. While YARN is an important technology

being supported by IBM, the capabilities of YARN are well behind those described here.

Installing InfoSphere BigInsights for Multi-tenant services

Realizing a multitenant environment for BigInsights or other applications requires the use of IBM

Platform Symphony Advanced Edition. A run-time version of IBM Platform Symphony Advanced Edition

that enables a single tenant is included with IBM InfoSphere BigInsights Enterprise Edition 2.1 or later.

The Platform Symphony resource manager and workload manager is referred to in the BigInsights

documentation as Adaptive MapReduce for historical reasons. Clients wanting the multitenant

capabilities required in this document will need to license a full version of Platform Symphony Advanced

Edition.

Note that licensing is not enforced by the software directly. Customers can pilot these multitenant

capabilities using only the software included in the BigInsights 2.1 Enterprise Edition or later release

along with appropriate patches.

Installation steps

Fortunately, it is constantly getting much easier to have these products work together. While manual

configuration was required in prior releases, as of BigInsights 2.1 EE a simple patch can be applied to

unlock all of the features of Platform Symphony Advanced Edition and have it work with BigInsights. For

future releases starting in the spring of 2014, full functionality of Platform Symphony will be provided

“out of the box” with BigInsights with no requirement for a patch. (Please note the customers will still

need to license the software before using it in production)

The high-level steps to implement InfoSphere BigInsights 2.1 (or later) with IBM Platform Symphony

Advanced Edition are as follows:

Install IBM InfoSphere BigInsights Enterprise Edition by following the installation instructions.

When installing BigInsights it is important to install Adaptive MapReduce. This is the choice that

causes the Platform Symphony software to be installed and configured with BigInsights.

To do this, you will need to edit a file in the installation directory called install.properties before

starting the BigInsights installation process as shown below:

# set AdaptiveMR.Enable to true if you want to install AdaptiveMR

instead of Apache MapReduce

AdaptiveMR.Enable=true

# set AdaptiveMR.HA.Enable to true if you want to install AdaptiveMR

High Availability, this will also install AdaptiveMR instead of Apache

MapReduce

AdaptiveMR.HA.Enable=true


Page 10

For multitenant environments, GPFS FPO is recommended, however Symphony can be

configured to support multiple tenants regardless of whether HDFS or GPFS FPO is chosen as the

cluster file system.

BigInsights can be installed by using a web-based installation process. The web-based install

process generates an XML file that governs the installation process that is used for installation

via the GUI or optionally via the install.sh shell script. The name of this file will vary depending

on how the software is installed, but as of release 2.1 the file is called either simple-

fullinstall.xml or fullinstall.xml.

The reason we mention this is that an apparent bug in BigInsights 2.1 caused the XML tag

<apache-mapred> to be set to true when Adaptive MapReduce was requested in the

install.properties file above. It might be worth validating that this setting is correct in the

simple-fullinstall.xml or fullinstall.xml file.

[biadmin@biginsights]$ grep "apache-mapred" simple-fullinstall.xml

<apache-mapred>false</apache-mapred>

[biadmin@biginsights]$

As you proceed with the installation, you should see the BigInsights installation script install the

“HAManager” software components as part of the installation. This is where the Platform

Symphony software is located that supports HA functionality and Adaptive MapReduce

functionality. You can watch for this either through the web installation GUI or by checking the

installation log file.

If you are installing BigInsights 2.1 Enterprise Edition you will need to install a patch by following

the procedure documented in the publication “Enabling the full functionality of IBM Platform

Symphony in your BigInsights 2.1 cluster”1. This document is freely downloadable for users with

an IBM Developer Works ID.

You can download a small patch for Platform Symphony 6.1.0.1 (the Symphony version included

in BigInsights 2.1) from https://www.ibm.com/support/fixcentral/ following instructions in the

document referenced above. At the time of this writing you can find and download the needed

package from Fix Central by searching for “Platform Symphony” and downloading the package

named “sym-6.1.0.1-build225866”. This package applies to both 64 bit Linux on Intel as well as

IBM PowerLinux machines. Later versions of BigInsights will not require this patch.

Follow the instructions in the README file. If you are installing the patch as user “root” on the

BigInsights cluster, it would be a good idea to source the BigInsights environment before

attempting to install the patch since the patch procedure assumes the environment variables are

already set.

1 This documentation can be obtained from: https://www.ibm.com/developerworks/community/wikis/form/api/wiki/ee59a95e-5867-4deb-

90af-6bed6b0759b8/page/91903357-0a7d-4a96-bb70-520fb2acdc1b/attachment/52d79fbe-dc37-42f0-be3f-5f4b75f14a05/media/Enable%20the%20full%20functionality%20of%20IBM%20Platform%20Symphony%20in%20BigInsight%202.1%20Cluster.pdf

https://www.ibm.com/support/fixcentral/

https://www.ibm.com/developerworks/community/wikis/form/api/wiki/ee59a95e-5867-4deb-90af-6bed6b0759b8/page/91903357-0a7d-4a96-bb70-520fb2acdc1b/attachment/52d79fbe-dc37-42f0-be3f-5f4b75f14a05/media/Enable%20the%20full%20functionality%20of%20IBM%20Platform%20Symphony%20in%20BigInsight%202.1%20Cluster.pdf





Page 11

[biadmin@biginsights opt]$ cd /opt/ibm/biginsights/conf

[biadmin@biginsights conf]$ . biginsights-env.sh

[biadmin@biginsights conf]$ echo $EGO_TOP

/opt/ibm/biginsights/HAManager/data

[biadmin@biginsights conf]$

When this patch is applied, the multitenant capabilities of IBM Platform Symphony will become

functional and will be accessible through the Platform Symphony graphical user interface.

When BigInsights is installed, the BigInsights web console by default is available on port 8080 on the

BigInsights management host (as long as BigInsights services are started).

Check the status of the cluster using this command:

$ /opt/ibm/biginsights/bin/status.sh

If necessary, start BigInsights (which will also start Platform Symphony services):

$ /opt/ibm/biginsights/bin/start-all.sh

While logged in as the BigInsights administrator, if Symphony is properly installed with BigInsights you

should be able to run Symphony specific commands. As an example, the user biadmin should be able to

run the following command:

$ egosh service list

This command will list various software services associated with Symphony and show their status.

When the Platform Computing components are installed (Adaptive MapReduce), the Platform

Computing resource manager (EGO) is used to persist BigInsights services. You will notice that

Symphony services are associated with a consumer called “/Management”. If you are running HDFS,

HDFS services like the DataNode and Secondary Data node are associated with an “/HDFS” consumer.

The MapReduce shuffle service is start on Compute hosts in the cluster.

[biadmin@biginsights ~]$ egosh service list

SERVICE STATE ALLOC CONSUMER RGROUP RESOURCE SLOTS SEQ_NO INST_STATE ACTI

derbydb DEFINED /Manage* Manag*

purger DEFINED /Manage* Manag*

plc DEFINED /Manage* Manag*

WEBGUI STARTED 54 /Manage* Manag* biginsi* 1 1 RUN 121

RS DEFINED /Manage* Manag*

Seconda* DEFINED /HDFS/S*

MRSS STARTED 55 /Comput* MapRe* biginsi* 1 1 RUN 120

DataNode DEFINED /HDFS/D*

SD STARTED 56 /Manage* Manag* biginsi* 1 1 RUN 119

Service* DEFINED /Manage* Manag*

WebServ* DEFINED /Manage* Manag*

NameNode DEFINED /HDFS/N*

[biadmin@biginsights ~]$


Page 12

Accessing the Platform Symphony Management Console

The Platform Symphony console will usually be on the same host if you follow the installation

recommendations above, but will be on a different port. Port 18080 is the default. You should be able to

log into the Platform Symphony management console at http://<master-host>:18080/platform. The

default administrator login for Platform Symphony is “Admin / Admin”.

In production clusters there will normally be multiple Platform Symphony management hosts. Setting

this up is beyond the scope of this paper and is covered in the Platform Symphony documentation.

Figure 2- Logging into the Platform Symphony Management Console

If you are having trouble connecting to the Symphony web console you can use the command “egosh

service view WEBGUI” to see details about the web service.

The WEBGUI services should be started automatically by EGO, but if it becomes necessary to start or

stop the service, you can use the following commands:

$ egosh logon

Enter Admin / Admin as the username and the password when prompted

$ egosh service start WEBGUI

$ egosh service stop WEBGUI

The WEBGUI service is implemented using Apache TomCat.

If there are problems with the WEBGUI you can inspect the logs at ${EGO_TOP}/gui/logs/catalina.out

for information about what might be wrong with the service.


Page 13

If you cannot connect to the Symphony console, this may be blocked by your firewall configuration. You

can disable your firewall temporarily to see if this is the cause.

# service iptables stop

If you are not sure what port or host the Platform Symphony GUI was installed on, you should be able to

find it in the XML file that governs the BigInsights installation process (described earlier).

This XML file is generated by the web-based installation process. Platform Symphony related setup

details are found under “high-availability” section of the XML file that governs the installation process.

<high-availability>

<configure>false</configure>

<master-nodes/>

<baseport>7869</baseport>

<web-port>18080</web-port>

<log-directory>var/ibm/biginsights/ps-mapred/logs</log-directory>

<preferred-ip-mask/>

..

<max-retries>3</max-retries>

<failover>failover</failover>

</high-availability>

Once a user logs in to the Platform Symphony console on port 18080, they will see the main Platform

Symphony dashboard. This view is mostly used to monitor the high level status of the various

applications and tenants on a Platform Symphony cluster.

For BigInsights users, most of the action will center around the “MapReduce Workload” screen

accessible under “Quick Links”.


Page 14

Figure 3 - view of Platform Symphony console when logged in as an Administrator

Accessing the Platform Symphony knowledge center

Once you are able to access the Platform Symphony console above, you may want to access the

Platform Symphony Knowledge Center and bookmark it in your browser. The knowledge center is

accessible in a pull down menu under the question mark in the top bar on the Platform Symphony web

interface.

The knowledge center aggregates all of the various Platform Symphony documentation into a

searchable interface. This will prove handy as you learn about Platform Symphony.

A direct link to the knowledge center can be found at this URL (depending on the hostname where the

web interface is running).

http://<masterhost-name>:18080/doc/symphony/6.1/index.html

The command egosh services list shown earlier will show the names of the host running the web

interface (listed as the WEBGUI) if you are running on a cluster with multiple master hosts.

The Platform Symphony knowledge center, in particular the documentation dealing with the Platform

Symphony MapReduce framework, will be useful to BigInsights administrators since if you are using

Adaptive MapReduce you are in fact using the Platform Symphony MapReduce framework.


Page 15

Figure 4 - Platform Symphony Knowledge Center

Platform Symphony Concepts

While the reader of this document is likely to be familiar with Hadoop and various commercial

distributions, they may be less familiar with IBM Platform Symphony. IBM Platform Symphony is a

commercial grid workload and resource management solution that has been use to share resources

among diverse applications in multitenant environments for over a decade. Platform Symphony is

widely deployed as a shared services infrastructure in some of the world’s largest investment banks.

As a quick primer to some of the terminology referenced, in this document some definitions are offered

below. We would recommend that the interested reader please review a document called “IBM

Platform Symphony Foundations” available at http://publibfp.dhe.ibm.com/epubs/pdf/c2750652.pdf .

Session Manager – service-oriented applications in Platform Symphony are managed by a

session manager. The session manager is responsible for dispatching tasks to service instances,

and collecting and assembling results. The Symphony session manager provides a function

simply in concept to a Hadoop application manager, although it has considerably more

capabilities. Platform Symphony implements job tracker functionality using the session

manager. In this paper the terms job tracker, application manager and session manager are used

interchangeably. While the concept of multiple concurrent application managers in Hadoop is

new with YARN. Platform Symphony has always featured a multitenant design.

http://publibfp.dhe.ibm.com/epubs/pdf/c2750652.pdf


Page 16

Resource Groups – Unlike Hadoop clusters, Platform Symphony does not make assumptions

about the capabilities of hosts that participate in the cluster. While Hadoop generally assumes

that member nodes are 64-bit Linux hosts running Java, Platform Symphony supports a variety

of hardware platforms and operating environments. Platform Symphony allows hosts to be

grouped in flexible ways into different resource groups, and different types of applications can

share these underlying resource groups in flexible ways.

Applications – The term application can be a little bit confusing as it is applied to Platform

Symphony. Symphony views an application as the combination of the client-side and service-

side code that comprise a distributed application. This is a more expansive definition than most

people are used to. By this definition an instance of BigInsights might be viewed as a single

application. Examples of Platform Symphony applications are custom applications written in

C++, a commercial ISV application like IBM Algorithmics, Calypso or Murex or a commercial or

Open Source Hadoop application like Cloudera, BigInsights or open source Hadoop. Platform

Symphony views applications as being an instance of middleware. Various client side tools

associated with a particular version of Hadoop (Pig, Hive, Sqoop etc) can all run against a single

Hadoop application definition. An important concept for those not familiar with Symphony is

that Symphony provisions service instances associated with different applications dynamically.

As a result, there is nothing technically stopping a Platform Symphony cluster from supporting

multiple instances of Hadoop and non-Hadoop environments concurrently.

Application profiles – As explained above, applications in Symphony are flexible and highly

configurable constructs. An Application Profile in Symphony defines the characteristics of an

application and various behaviors at runtime.

Consumers – From the viewpoint of a resource manager, an application or tenant on the cluster

is defined as something that needs particular types of resources at runtime. Platform Symphony

uses the term “consumer” to define these consumers of resources and provides capabilities to

define hierarchical consumer trees and express business rules about how consumers share

various types of resources collected into resource groups. The leaf nodes in consumer trees map

to a Symphony application.

Services – Services are the portions of applications that run on cluster nodes. In a Hadoop

context, administrators likely think of services as equating to a task tracker that runs Map and

Reduce logic. Here again, Symphony takes a broader view. Symphony services are generic. A

service may be a task-tracker associated with a particular version of Hadoop or it may be

something else entirely. When the MapReduce framework is used in Platform Symphony, the

Hadoop service-side code that implements that Task Tracker logic is dynamically provisioned by

Symphony. Symphony owes its name to this ability to orchestrate a variety of services quickly

and dynamically according to sophisticated sharing policies.

Sessions – A session in Symphony equates to the notion of a job in Hadoop. A client application

in Symphony normally opens a connection the cluster, selects an application and opens a


Page 17

session. Behind the scenes Symphony will provision a Symphony Session Manager to manage

the lifecycle of the job. A single Symphony Session Manager may support multiple sessions

(Hadoop jobs) concurrently. A Hadoop job is a special case of a Symphony job. The Hadoop

client will start a session manager that provides JobTracker functionality. Platform Symphony

actually uses the job tracker and task tracker code provided in a Hadoop distribution, however it

uses its own low-latency middleware to more efficiently orchestrate these services on a shared

cluster.

Repositories – As explained previously, Platform Symphony dynamically orchestrates service-

side code in response to application demand. The binary code that comprises an application

service is stored in a Symphony repository. Normally for Symphony applications, Symphony

services are distributed to compute nodes from a repository service. For Hadoop applications,

code can be distributed either via the repository service, or it can be distributed via the HDFS /

GPFS FPO file system.

Tasks – Symphony jobs are collections of tasks. Symphony jobs are managed by a session

manager that runs on a management host. The session manager makes sure that instances of

the needed service are running on compute nodes / data nodes on the cluster. Services

instances run under the control of a Symphony Service Instance Manager (SIM). MapReduce

jobs in the Symphony work the same way, but in this case the Symphony service is essentially

the Hadoop task tracker logic. On Hadoop clusters, slots are normally designated as running

either map logic or reduce logic. Again in Symphony, this is fluid. Because services are

orchestrated dynamically service instances can be either Map or Reduce tasks. This is an

advantage because it allows full utilization of the cluster as the job progresses. At the start of a

job the majority of slots can be allocated to map tasks while towards the end of the job the

function of slots can be shifted to perform the reduce function.


Page 18

An example of configuring a cluster for multi-tenancy

In this section we describe the step-by-step procedure to setup multiple tenants on the BigInsights

environments. In order to provide a realistic multitenant scenario, the diagram roughly models our

actual customer environment with names changed of course to protect client confidentiality.

The actual environment is more complex with hundreds of users, dozens of groups and approximately

thirty different applications planned, but the application sharing is similar to the diagram below. This

diagram maps to the “Consumer Tree” in Platform Symphony. Consumer is a term used from the

resource manager’s perspective. The resource manager views an application as a consumer of

resources, and the resource manager is responsible for allocating requested resources according to

policies that will be described shortly.

Figure 5 - an example consumer hierarchy for applications and departments

By default, BigInsights (which is just a single application on the cluster) maps to a single application and

associated is consumer called “MapReduce61” (the name corresponds to the version of Platform

Symphony used to support MapReduce processing in BigInsights – in this case 6.1.0.1). This is done so

that Symphony can accommodate future versions of MapReduce that will be provided in future versions

of BigInsights and will allow versions to co-exist. This is first consumer in the consumer tree above.

In the production environment the customer has specific needs:

They wish to structure “sub-consumers” under the BigInsights consumer definition

(MapReduce61). This gives the cluster administrator the ability to have different run-time

characteristics for different BigInsights applications. It also allows us to setup configurable

sharing policies between our different applications and groups, control what users are allowed


Page 19

to access what applications, and ensure security between tenants by having different

applications run under different user-IDs if desired.

In this example, under the BigInsights tenant (MapReduce61) we have several different

applications. We’ve arbitrarily called them “MR_AppA” through “MR_AppN” although in the real

environment these are the names of the client’s business applications. Note that we need to

configure each application (tenant) so that it runs under a different operating system level user-

id for security isolation. We also want to control in a granular way which users and groups have

access to these various applications.

Also, as shown in figure 4, the client has additional applications used by particular lines of

business that they would also like to deploy on the same cluster. As examples, some Sqoop

workloads, DataMeer, IBM Tealeaf, various in-house developed streaming applications and

others. In this particular customer implementation all of these applications will just happen to

share the BigInsights MapReduce infrastructure, however it is important to under that this need

not be the case. As we’ll see shortly these applications can be totally different and still be

configured to share infrastructure.

Adding users to run MapReduce applications

In our example we want to show that how multiple users, grouped arbitrarily into one or groups for

security management can access tenant applications subject to access controls.

We create some sample cluster users for our illustration. These names represent individual cluster

users. For some lines of business, application administrators may choose to create a shared login like

“fraud” for a group authorized to use a particular fraud analytics application.

InfoSphere BigInsights has a recommend procedure for adding users. When using Platform Symphony

together with BigInsights, it is recommended that users follow procedures covered in the BigInsights

documentation and use the tool createosuser.sh included in the BigInsights distribution to automate the

create of OS level users. Doing this ensures that users can access the BigInsights console to run

applications deployed using the BigInsights application framework.

For convenience, the BigInsights infocenter is available on the public internet. For information on adding

users in BigInsights, you can learn more here: http://www-

01.ibm.com/support/knowledgecenter/SSPT3X_2.1.1/com.ibm.swg.im.infosphere.biginsights.admin.doc

/doc/bi_admin_add_users.html?lang=en

The specific procedures will depend on whether you are authenticating access via flat files, LDAP, PAM

or PAM+LDAP. In the example below we are using flat files for simplicity.

To create users known to BigInsights, edit the following file:

$BIGINSIGHTS_HOME/console/conf/security/biginsights_users.xml

http://www-01.ibm.com/support/knowledgecenter/SSPT3X_2.1.1/com.ibm.swg.im.infosphere.biginsights.admin.doc/doc/bi_admin_add_users.html?lang=en




Page 20

Add users as shown below.

<?xml version="1.0" encoding="UTF-8"?>

<server>

<featureManager/>

<basicRegistry id="basic" realm="Auth">

<user name="hadoop" password="passw0rd"/>

<user name="biadmin" password="temp4now"/>

<user name="sysadmin2" password="passw0rd"/>

<user name="appadmin2" password="passw0rd"/>

<user name="sysadmin1" password="passw0rd"/>

<user name="appadmin1" password="passw0rd"/>

<user name="dataadmin2" password="passw0rd"/>

<user name="dataadmin1" password="passw0rd"/>

<user name="user3" password="passw0rd"/>



<user name="vivian" password="temp4now"/>

<user name="gord" password="temp4now"/>

<user name="eric" password="temp4now"/>

<user name="michael" password="temp4now"/>

<user name="vince" password="temp4now"/>

<user name="steven" password="temp4now"/>

<user name="tiffany" password="temp4now"/>

<user name="appA" password="temp4now"/>

<user name="appB" password="temp4now"/>

<user name="appC" password="temp4now"/>

</basicRegistry>

</server>

The next step is to define groups and associated users with groups. This is an example only. The specific

will depend on how you wish to structure your own users and groups

<?xml version="1.0" encoding="UTF-8"?>

<server>

<featureManager/>

<basicRegistry id="basic" realm="Auth">

<group name="supergroup" gid="4000">

<member name="hadoop" uid="4000"/>

<member name="biadmin" uid="200"/>

</group>

<group name="appAdmins" gid="4100">

<member name="appA" uid="4100"/>

<member name="appB" uid="4101"/>

<member name="appC" uid="4101"/>

</group>

<group name="sysAdmins" gid="4200">

<member name="sysadmin1" uid="4200"/>

<member name="sysadmin2" uid="4201"/>

</group>


Page 21

<group name="dataAdmins" gid="4300">

<member name="dataadmin1" uid="4300"/>

<member name="dataadmin2" uid="4301"/>

</group>

<group name="users" gid="4400">

<member name="vivian" uid="6001"/>

<member name="gord" uid="6002"/>

<member name="eric" uid="6003"/>

<member name="michael" uid="6004"/>

<member name="vince" uid="6005"/>

<member name="steven" uid="6006"/>

<member name="tiffany" uid="6007"/>

</group>

<group name="groupA" gid="5000">








</group>

<group name="groupB" gid="5001">








</group>

<group name="groupC" gid="5002">








</group>

</basicRegistry>

</server>

In addition to have user IDs that map to individuals, I may want particular applications to execute on the

cluster under a specific user ID. For example, if my application is called “appA” I may want to have it

execute under a Linux user ID with the same name for simplicity. To accommodate this notice that

we’ve added application specific users to the biginsights_users.xml file in the example above.

You can add users using operating system facilities, but if you do, these users will not be recognized as

having credentials within the BigInsights web interface. They will still work with Symphony and the

BigInsights Hadoop framework however.


Page 22

The example below shows how additional users can be added at the OS level, but be unable to login to

the BigInsights console.

# useradd fred

# useradd george

# useradd frank

Once you have edited the BigInsights XML files to define users and groups as shown above, you are

ready to run the createosusers.sh script to create these accounts and groups at the operating system

level as well.

Run the createosusers.sh script as user “biadmin”.

#createosusers.sh

$BIGINSIGHTS_HOME/console/conf/security/biginsights_groups.xml

$BIGINSIGHTS_HOME/console/conf/security/biginsights_users.xml <biadmin's

password>

By following the procedure above to create users and groups, you will be able to run and monitor jobs from both BigInsights Console as well as the Platform Symphony console.

Figure 6 - user Tiffany known as a BigInsights user is known to the Platform Symphony GUI


Page 23

Figure 7 - user Tiffany and others can also runs jobs via the BigInsights console.

Provide access to the BigInsights / Platform Computing cluster

For each operating system user who will be submitting jobs, make sure that their .bashrc file (or

equivalent depending on your shell) in the user’s home directory is configured to source the BigInsights

environment as shown below. If you have followed the procedures above, this should be done for you

automatically. We include these details because you may have additional users not known to BigInsights

that require access to Platform Symphony.

Sourcing the BigInsights environment will ensure that various shell variables like $PATH and

$CLASSPATH as well as environment variables specific to BigInsights and Platform Symphony are in the

environment when the user logs on. This will allow them to immediately run both BigInsights and

Symphony commands. If you are adding many users outside the procedure recommended above to add

BigInsights users, and you want them all to have access to the cluster, it will be faster to adjust the

system-wide template for .bashrc file (in /etc/skel) or adjust the common /etc/bashrc depending on

your preference.

If you have followed the instructions above, this step may not be necessary, but it is a good idea to

check that when users login they are inheriting an environment appropriate for running BigInsights jobs

and that they have access to the Platform Symphony environment.

In our case we want both our named users, as well as the user-ids that our applications will run under in

Symphony(see the concept of impersonation explained later) to source the environment and be able to

run commands.

[root@biginsights gord]# cat .bashrc

# .bashrc

# Source global definitions

if [ -f /etc/bashrc ]; then

. /etc/bashrc

fi


Page 24

# User specific aliases and functions

# source the environment for BigInsights and Platform Symphony

source /opt/ibm/biginsights/conf/biginsights-env.sh

You should be able su to your created user ID after this and run Symphony or BigInsights commands.

Below we see that I can run a Symphony command confirming that my environment is setup correctly.

Note that with the installation of BigInsights we are entitled to user Platform Symphony Advanced

Edition which is the version of Symphony that supports the Hadoop MapReduce framework. We are not

entitled to use some other add-on products listed.

[root@biginsights /]# su - gord

[gord@biginsights ~]$ egosh entitlement info

Symphony Edition : Advanced

Desktop Harvesting : Not Entitled

Server Harvesting : Not Entitled

Virtual Server Harvesting : Not Entitled

GPU : Not Entitled

[gord@biginsights ~]$

After following the procedure above, it is a good idea to make sure that our /etc/group file reflects that

setup we’ve configured in the BigInsights XML files.

In /etc/group, create define the users that will be allowed to submit workloads on behalf of each group.

This is a very simple example. In reality, different users would belong to different groups and these

group names would be meaningful in the context of how the customer organizes their business.

groupA:x:5000:vivian,gord,eric,michael,vince,steven,biadmin

groupB:x:5001:vivian,gord,eric,michael,vince,steven,biadmin

groupC:x:5002:vivian,gord,eric,michael,vince,steven,biadmin

groupD:x:5003:vivian,gord,eric,michael,vince,steven,biadmin

groupF:x:5004:vivian,gord,eric,michael,vince,steven,biadmin

groupG:x:5005:vivian,gord,eric,michael,vince,steven,biadmin

groupH:x:5006:vivian,gord,eric,michael,vince,steven,biadmin

groupI:x:5007:vivian,gord,eric,michael,vince,steven,biadmin

Understanding Platform Symphony Impersonation

Now is a good time to explain the concept of “impersonation” in Platform Symphony. Symphony has

two different workload execution modes:

Simple Workload Execution Mode

Advanced Workload Execution Mode

This is normally an installation option with Platform Symphony. BigInsights Enterprise Edition installation

automatically installs Platform Symphony in Advanced Workload Execution Mode. This term is

frequently abbreviated as WEM in the Symphony documentation. In advanced workload execution

mode, core Symphony services will run as root as application administrators will be able to control the

user ID that clustered applications run under.


Page 25

Our approach to security hinges on this concept of impersonation in Symphony and we will see shortly

how we configure our applications to run under specific user credentials and control what users have

access to what applications and resources. The section called “Security within the MapReduce

framework” in the MapReduce user guide in the Platform Symphony documentation discusses this in

detail.

The customer that this paper is modeled after employs Kerberos authentication for their MapReduce

jobs to ensure security and that a particular service support impersonation cannot be spoofed. Details

on configuring Kerberos is too much detail for this short document, but customers will be pleased that

this capability exists. Symphony is frequently deployed in secure environments where these capabilities

are important.

Configuring OS groups for the multitenant environment

For users making use of Platform Symphony (both named users and the user IDs that applications will

run under via impersonation) these IDs need to be part of the OS group that owns the BigInsights (and

by extension the Symphony) installation.

In our installation, BigInsights was installed as part of the “biadmin” group, so we adjust the group

membership so that each application ID that Symphony jobs will run under is a part of the BigInsights

group.

biadmin:x:0:root,biadmin,gord,eric,vivian,appA,appB,appC,appD,appE,appF,appG

bin:x:1:root,bin,daemon

daemon:x:2:root,bin,daemon

..

If you are unsure what group BigInsights was installed under, issue a command like

$ ls -al ${EGO_TOP}

You will see the user and group that own each file. This will vary depending on how you installed

BigInsights but the default group is biadmin.

Submitting a test job as a user to verify the configuration

As we mentioned before, by default BigInsights is configured to use an Application called MapReduce61

which maps to the consumer called /MapReduceConsumer/MapReduce61.

I should be able to login to any of the accounts created, and run a sample Hadoop job. The sleep

command included with the BigInsights examples is a convenient Hadoop application for testing the

MapReduce framework. This command submits variable numbers of Map and Reduce tasks that simply

sleep for variable amounts of time. The example below submits two mappers that will sleep for 2

seconds (2,000 msec) followed by ten reducers that in the example below will sleep for 1 second.

Besides being a useful validation that everything is working, this test illustrates the performance

advantage of using Platform Symphony as the MapReduce framework over open-source Hadoop.


Page 26

Platform Symphony can run tests like this short running map and reduce tasks dramatically faster than

open source Hadoop – often more than ten times faster, even when a competing cluster is configured

with a short polling interval.

Note that as the test Hadoop job runs, everything is identical to open source Hadoop (it is actually the

BigInsights supplied Hadoop classes that are running) except that see that our JobTracker logic in

Hadoop is running inside a Symphony Session Manager.

Note also that the running job is given a Platform Symphony job ID (job_ssm_0401 in this example).

Because Platform Symphony is managing the job execution, it is able to manage this job as well as other

jobs on the cluster including non-Hadoop jobs.

[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -m 2

-r 10 -mt 2000 -rt 2000

14/03/15 13:14:25 INFO internal.MRJobSubmitter: Connected to JobTracker(SSM)

14/03/15 13:14:26 INFO internal.MRJobSubmitter: Job <Sleep job> submitted,

job id <401>

14/03/15 13:14:26 INFO internal.MRJobSubmitter: Job will not verify

intermediate data integrity using checksum.

14/03/15 13:14:26 INFO mapred.JobClient: Running job: job_ssm_0401

14/03/15 13:14:27 INFO mapred.JobClient: map 0% reduce 0%







14/03/15 13:14:59 INFO mapred.JobClient: Job complete: job_ssm_0401

14/03/15 13:15:00 INFO mapred.JobClient: Counters: 18

14/03/15 13:15:00 INFO mapred.JobClient: Shuffle Errors

14/03/15 13:15:00 INFO mapred.JobClient: WRONG_PATH=0

14/03/15 13:15:00 INFO mapred.JobClient: CONNECTION=0

14/03/15 13:15:00 INFO mapred.JobClient: IO_ERROR=0

14/03/15 13:15:00 INFO mapred.JobClient: FileSystemCounters

14/03/15 13:15:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=5146

14/03/15 13:15:00 INFO mapred.JobClient: Map-Reduce Framework

14/03/15 13:15:00 INFO mapred.JobClient: Reduce input groups=400

14/03/15 13:15:00 INFO mapred.JobClient: Combine output records=0

14/03/15 13:15:00 INFO mapred.JobClient: Map output records=400

14/03/15 13:15:00 INFO mapred.JobClient: SHUFFLED_MAPS=20

14/03/15 13:15:00 INFO mapred.JobClient: Reduce shuffle bytes=2440

14/03/15 13:15:00 INFO mapred.JobClient: Combine input records=0

14/03/15 13:15:00 INFO mapred.JobClient: Spilled Records=800

14/03/15 13:15:00 INFO mapred.JobClient: SPLIT_RAW_BYTES=0

14/03/15 13:15:00 INFO mapred.JobClient: Map output bytes=1600

14/03/15 13:15:00 INFO mapred.JobClient: Reduce input records=400

14/03/15 13:15:00 INFO mapred.JobClient: GC_TIME_MILLIS=0

14/03/15 13:15:00 INFO mapred.JobClient: FAILED_SHUFFLE=0

14/03/15 13:15:00 INFO mapred.JobClient: MERGED_MAP_OUTPUTS=20

14/03/15 13:15:00 INFO mapred.JobClient: Reduce output records=0



Page 27

As this job runs, we can monitor the job in the Symphony GUI by using the QuickLinks menu and

accessing “MapReduce Workload” to access the MapReduce workload screen shown below. As the

MapReduce jobs runs, you will see a view like the one shown in figure 6.

Figure 8 - monitoring our job using the Platform Symphony web interface

Note that the submitted job is associated with the application MapReduce 6.1 (this is the application

that BigInsights by default submits jobs to)

You can also launch jobs via the standard BigInsights Web GUI and watch them run either from within

the BigInsights console or from within the Platform Symphony Web interface.

Figure 9: Launching a terasort job from BigInsights

The Terasort example in BigInsights uses oozie to manage the sequence of running the teragen

application to generate the dataset to be sorted followed by Terasort itself.


Page 28

As the job runs in the BigInsights context, we see them running in Platform Symphony associated with

the MapReduce6.1 application that BigInsights is bound to.

Any BigInsights application that exercises the MapReduce framework including services like Hive, Pig,

Big SQL, Bigsheets and others will work with Symphony in this same way.

Figure 10 - Platform Symphony monitoring Terasort job run from BigInsights

Associating BigInsights with a Symphony Application

We’ve mentioned a few times that BigInsights is associated with the Symphony MapReduce6.1

application and customers frequently ask where this association is made.

[biadmin@biginsights ~]$ cd $HADOOP_CONF_DIR

[biadmin@biginsights hadoop-conf]$ cat pmr-site.xml

<?xml version="1.0"?>







<configuration>

<property>

<name>mapreduce.application.name</name>

<value>MapReduce6.1</value>

<description>The mapreduce application name.</description>

</property>

<property>

<name>mapreduce.map.skip.commit.task</name>

<value>false</value>

</property>

By changing to the BigInsights directory $HADOOP_CONF_DIR you can modify Symphony application

name that BigInsights will submit jobs to in the file pmr-site.xml. It is important to have this flexibility,

because over time customers may end up with different versions of BigInsights along with other

applications co-existing on the same cluster.


Page 29

Enabling Symphony Repository Services

By default, when Platform Symphony is installed the repository service in Symphony is disabled. The

function of the repository service is to store the application services and distribute the code that

implements services dynamically to service instances on the cluster.

The MapReduce framework in Platform Symphony by default distributes the application service code

(specifically the application logic that implements the task tracker functionality and Jar files that

implement map and reduce logic) by copying them to HDFS with a high block replication factor so that

the files will be accessible on all nodes.

If you are planning to add and remove application profiles in Symphony or Consumers you will to start

the Symphony repository service. Otherwise you will encounter errors as some of these services assume

that the repository service in Symphony is running.

This can be done through the web interface by following these steps:

From the QuickLinks menu select system services

For the service abbreviated as RS, select “Start” from the Actions pull-down menu

After you refresh the GUI view you should see the service has started on a master host


Page 30

Figure 11 - Managing system services in Platform Symphony

The system services view is useful. This shows a list of system services that EGO is managing. Note that

EGO is managing not only native Platform Symphony services, but BigInsights services as well.

Adding a new Application / Tenant

Fundamental to the design of BigInsights 2.1 (and Open Source Hadoop) is the idea that there is only a

single instance of a Hadoop cluster.

Platform Symphony supports multiple applications however sharing the same cluster. It is also flexible

enough to support multiple instances of an application environment like BigInsights, however

configuring this is out of the scope of this paper.

Examples of tenants we may want to add might be:

A native Symphony application written to the Platform Symphony APIs

A batch-oriented workload (when Platform LSF is installed as an add-on to Platform Symphony)

A distinct Hadoop MapReduce environment

Third party applications like SAS, MatLab or Revolution R


Page 31

A separate Hadoop MapReduce application instance that shares resources between applications

but that shares the same Hadoop binaries and file system instance.

In this example we are showing the last case where multiple Hadoop applications share resources.

From the Platform Symphony Dashboard:

Use the QuickLinks menu and select Resources

Select Workload / MapReduce / Application profiles from the pull down menu

There will already be an application profile already defined for MapReduce6.1. This is installed

automatically with Symphony and is the application profile that is used by BigInsights by default.

To add a new application profile to support a new tenant, click the “Add” button. The screen shown in

figure 10 will appear.

Figure 12 - Adding a new Application definition

We supply the following parameters:

Our application name (SQOOP) – We require this tenant to use a different version of SQOOP

than the version including with BigInsights as mentioned earlier

We define the user-ID that starts the job tracker and runs jobs – This is the impersonation

feature described earlier. This particular application will run under the OS id AppB.


Page 32

Symphony has 10,000 priority levels. By default we are going to submit Sqoop jobs as having a

low priority.

We configure user accounts that have access to this application. Note that we’ve provided all

users in GroupA access to the application along with named operating system and Platform

Symphony users.

Based on this information, Platform Symphony adds an application named Sqoop with a set of

reasonable defaults for a Hadoop MapReduce job. To make sure that our new application is working, as

a user entitled to use the application I can submit a test job as I did before.

Note that in this I am specifying that I want to have the job handled by a different MapReduce

application definition so I specify Sqoop as the application name on the command line.

Test the new application consumer by submitting a job as before.

[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -

Dmapreduce.application.name=Sqoop -m 2 -r 10 -mt 2000 -rt 2000



job id <1>












14/03/13 12:33:09 INFO mapred.JobClient: Counters: 18

.. What has changed is that in figure 11 we see that our job is now running under our separate application

definition called Sqoop.

This shows the basic process of adding the new application profile for a MapReduce job to Symphony to

support our additional tenants. The next step of course is to edit the configuration of the tenant as

necessary to suit the unique needs of the application. For example, my requirement may be as simple as

simple re-pointing some environment variables for point to different installation and configuration

directories for Sqoop for jobs submitted to this application.

[biadmin@biginsights hadoop-conf]$ set | grep SQOOP

SQOOP_CONF_DIR=/opt/ibm/biginsights/sqoop/conf

SQOOP_HOME=/opt/ibm/biginsights/sqoop

[biadmin@biginsights hadoop-conf]$


Page 33

Note that below my Job ID has reset to “1” since this is the first job associated with this particular

application tenant.

Figure 13 - Sleep job running under newly created application definition

Under the “Workload” / “MapReduce” / “Application Profiles” we can define as many separate

applications as we’d like. The view below additional applications added using the same process detailed

for the Sqoop application.

Figure 14 - Available MapReduce Application Profiles

Only MapReduce applications appear because “Application Profiles” have been selected from the

MapReduce submenu. Figure 13 shows a similar view of “Applications” accessible from the same

workload dropdown menu except instead of looking at Application Profiles I’m looking at a dashboard of

the applications themselves with job related status.


Page 34

Figure 15- Dashboard of MapReduce applications

Configuring application properties

When new applications profiles are created for each new application, a default template is used

represent reasonable settings for a MapReduce workload. The next step is to configure application

profiles to meet the unique requirements of each application workload.

In the Platform Symphony reference manual accessible from the knowledge center, application profiles

are covered in detail. Some of the more commonly configured settings are shown below.

To configure application properties for Sqoop, modify the application profile by selecting “Workload” /

“MapReduce” / “Application Profiles” from the top menu on the MapReduce applications screen. Select

the application profile definition for Sqoop created earlier and select Modify.

A new window will appear that allows detailed settings for the application to be changed. This web

interface is affecting the application service profile definitions (discussed shortly) that are stored in the

directory $EGO_TOP/data/soam/profiles on the Platform Symphony master host. Enabled profiles

reside in a subdirectory called “enabled” and disabled profiles reside in a directory called “disabled”.

First tab in the interface called Application Profile allows application profile settings to be adjusted. The

second tab labeled Users provides an opportunity to modify the users and groups that will have access

to the application profile.


Page 35

Figure 16 - Application Profile

Some important tips about Application Profiles:

Application Profile names must be unique

An Application Profile can be associated with only a single consumer

In the consumer tree, MapReduce applications are by default placed under the

MapReduceConsumer tree

You can find templates for various application profiles in the directory

$SOAM_HOME/6.1/Samples/Templates. The term SOAM in Symphony refers to the service-

oriented application middleware on which the MapReduce service is implemented

The application profile can be viewed in an Advanced Configuration, a Basic Configuration or in a

Dynamic Configuration Update mode. The Dynamic Configuration Update mode is not covered here, but

essentially it allows an administrator to register a profile fragment (part of an application profile)

modifying either the session types or services sections of the profile.

In the General settings area, settings such as where metadata associated with jobs and job history are

stored, the default service definition to be used (MapReduce for MapReduce applications) and resource

requirements.


Page 36

Resource requirements are an important concept in Symphony. In this simple example by using the

syntax “select(!mg)” we are essentially saying run this service on any host that is not tagged as a

member of the management group.

Resource requirement selections in Symphony are flexible and are covered in the Symphony

documentation. I can use an SQL like resource-requirements strings to specify the types of resources I

would like to use in a granular way. If for example I know that a particular application runs best on a

large memory PowerLinux machine, I express a requirement (or preference) for this application with an

appropriate resource requirement string.

select(!mg) && select(PowerResourceGroup) && select(maxmem > 8000 && maxswp

>=16000)

The example above would indicate that this service requires resources that are part of a Power-based

resource group that are not management hosts where at least 8GB of physical memory and 16GB of

swap space are available.

Pre-starting application services is a useful feature in Symphony. Application services refer to the

Symphony session manager (SSM) as well as service instance managers and service instances associated

with the application. As a reminder, with MapReduce workloads the SSM can be viewed as an

Application Manager. This is the component that implements the JobTracker logic. Services instances

will load TaskTracker logic appropriate to the version of Hadoop and will start map or reduce tasks

appropriate to the application.

If you have many applications and are frequently sharing slots pre-starting applications may not be

useful. By default Symphony will start SSMs automatically as clients connect and request services from

the middleware. As resources are assigned to applications, Symphony will dynamically provision needed

service code and start services appropriate.

Pre-starting applications is useful for applications that need to respond quickly. You can control the

number of slots (each slot can support a map or reduce task) that are pre-started by default

Figure 17 - Optionally have an application pre-allocate services

A key thing to understand about that Platform Symphony session manager is that it is fully

multithreaded and can accommodate multiple sessions at the same time. A session equates to a

MapReduce user submitted a job. Each job maps to a session where each session may have large

numbers of tasks.


Page 37

When multiple users are concurrently submitting jobs to the same application, the scheduling policy

controls how resources are shared. This R_Proportion policy specifies that resources are shared in

proportion to the priority of the job which is often the most sensible choice.

As an example, if I had 5000 slots allocated to this application consumer definition and JobA was

submitted to the application with priority 4000 and JobB was submitted with priority 1000, Symphony

would run both workloads concurrently under the same application definition giving 80% of available

resources to JobA. Unlike standard Hadoop where resource assignments are static while the job is

executing, Symphony can respond quickly at run-time to re-balance resource allocations between jobs.

Note that since each SSM maps to an application (a MapReduce application in this case) this scheduling

policy controls how multiple jobs running in the same application context share resources. A separate

resource sharing plan discussed shortly controls how sharing is implemented more broadly between

applications and tenants.

The term application can be confusing to users not familiar with Symphony. Symphony is referring to an

application in the context of the Hadoop services themselves – the binary code that comprises

BigInsights services like the JobTracker and the TaskTracker. It is not referring to the actual application

code written by users that run on the Hadoop framework. A single Symphony application can run

different user applications within the context of the same Hadoop MapReduce context in this case.

Figure 18 - controlling how multiple jobs associated with an application share resources

The Symphony application profile definition provides precise control over how MapReduce workloads

run, and this is useful to advanced users (in our experience most sites running Hadoop are already quite

advanced and will appreciate this)

A nice feature of Symphony is that because the execution logic is provisioned dynamically so slots are

interchangeable between mappers and reducers. The settings in figure 17 allow this to be configured

along with preferences for default ratios between mappers and reducers and precise configuration on a

per resource group basis.


Page 38

Figure 19 - MapReduce Settings associated with an Application

Symphony can allow multiple service definitions to exist for each application and the service definition

section provides granular control over this capability. This is a useful for applications written to Platform

Symphony’s native APIs and may be useful for Hadoop developers. For BigInsights it is not necessary to

change this setting being Platform has already implemented a service called “RunMapReduce “ service

started by service-instance managers to handle MapReduce workloads. The process of starting this

service is automatic for the MapReduce service. The service itself can be found in the directory

${EGO_TOP}/soam/mapreduce/6.1/linux2.6-glibc2.3-x86_64/etc. Note that the Start Command in

figure 18 allows for operating system specific implementations of a service definition for an application.

Figure 20 - configuring service definitions for the application

In the application profile definition, administrator can control environment variables associated with the

application. This is an important capability for ensuring multitenancy. By using environment variables I

can control what applications run in granular ways. If I choose, I could have an application profile that


Page 39

associates itself with a separate Hadoop instance by defining application specific variables such as

$HADOOP_HOME, $HADOOP_CONF_DIR that reference different software versions and different

configuration files.

I can always resolve technical issues that often occur where particular applications are depend on

particular versions or distributions of the Java run-time environment be defining $JAVA_HOME to point

to the version of Java needed by a specific application.

Figure 21 - configuring the environment for the application

This is a good time to mention that while much of the discussion in Hadoop centers on Java because

Hadoop itself is written in Java, Symphony supports heterogeneous applications. It does not matter

whether application clients or services are written in C/C++, Java, scripting languages or even C# in

Microsoft .NET environments. The versatility to handle all types of workloads is what makes Symphony

powerful as a multitenant environment.

Another unique capability that Symphony brings to Hadoop is the notion of “Recoverable sessions”. This

concept does not existing in open source Hadoop where the job tracker is implemented in a simplistic

way. If the JobTracker fails at run-time, in standard Hadoop the job needs to be re-started.

The Symphony SOAM middleware however has long supported the notion of journaling transactions so

that Hadoop MapReduce jobs become inherently recoverable. If the software service running the

JobTracker logic fails (and re-starts on the same host or a different host) the Symphony job can recover

from where it left off. This is a major advantage for customers that have long-running Hadoop jobs that

need to complete within specific batch windows.


Page 40

This and other points of configurability are very important for specific workloads. As another example, if

I have execution logic where the reducer is multi-threaded I can control the ration of reducer services to

slots thereby giving a reducer multiple slots if it can take advantage of them.

Figure 22 - configuring session behaviors in an SSM / Application Manager

Associating applications with consumers

The last section provided some details on how application profiles are used in Symphony to customize

applications to support multi-tenancy. In the Symphony architecture, resources are not actually

allocated to applications directory. They are allocated to Consumer definitions which in turn map to

applications.

This is an important distinction between while that application space is essentially “flat” (I have multiple

applications and flavors of applications of different types) the structure of consumers is usually

hierarchical. This is because most organizational structures are hierarchical.

A bank may have several lines of business, each with various departments or application groups

A service provider may have multiple tenant customers, and may provide different application

services for each tenant

A government agency may have different divisions, each running different applications with a

particular need to segment data access


Page 41

Symphony allows consumer trees to be setup in flexible ways to accommodate the needs of almost any

organization. A key concept to understand is that the leaf-nodes of consumer trees are linked to the

application definitions we looked at in the previous section.

Accessing Consumer Definitions

To view consumer definitions, from the MapReduce screen in Symphony selected “Resources / Resource

Planning / Consumers”. This is the interface that is used to manage the Consumer Tree.

Setting up the consumer tree is reasonably straightforward. The left side panel us used to control where

you are on the tree and the right side of the interface allows one to perform operations relative to that

segment on the tree.

Recall from our scenario earlier, that we had multiple groups that would be running Datameer

workloads that we wanted to enforce sharing policies. Also Datameer workloads have specific setup

dependencies that are different that BigInsights workloads so the Datameer workloads require their

own application profile. Also, we wanted to provide isolation between the work done by different

Datameer application user groups. To achieve this policy, we have defined sub-consumers under

Datameer with a consumer appropriate for each group. Also, we can control what users have access to

the consumer. Note the heirchical notion of consumers in Symphony.

Figure 23 - A populated consumer tree in Symphony

The leaf nodes of the consumer tree under Datameer, each link to a specific application profile. The

associations between applications and the position in the consumer tree is made in the application

profile.


Page 42

Figure 24 - MapReduce applications

Manually editing Consumer Tree definitions

Advanced users may find it easier to manually edit the consumer tree.

Platform Symphony stores consumer tree definitions in $EGO_TOP/kernel/conf in the file

ConsumerTrees.xml.

If you hand edit this file, you will need to restart EGO services to bring the web-based view into

synchronization with the actual contents of the XML files where these settings are persisted.


Page 43

After editing the ConsumerTrees.xml file as shown above, while logged in as the cluster administrator

(biadmin) please stop and restart EGO services using the BigInsights scripts below to make sure that

changes are reflected in the Platform Symphony console.

$ stop.sh HAManager

$ start.sh HAManager

Controlling access to applications and consumers

In the Sqoop consumer definition above, the built-in Symphony user “Admin” has administrative

responsibility for the consumer. Several other users are listed as being able to access to consumer

application associated with the consumer. The user eric is not a member of the list of permitted users. If

an unauthorized user attempts to submit a job against the application definition (Sqoop) associated with

this Sqoop consumer, see an error as shown below as expected.

[eric@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -


java.io.IOException: interrupted

at

org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:1068)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:1032)

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1575)

at org.apache.hadoop.examples.SleepJob.run(SleepJob.java:174)

at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

..

Caused by: java.lang.InterruptedException: Domain <VEM>: Security error: User:

eric is not authorized to perform this operation.

If an authorized user (gord) submits the same workload, note that it runs successfully.

[gord@biginsights ~]$ hadoop jar ${HADOOP_HOME}/hadoop-example.jar sleep -




job id <102>














Page 44

Determining the execution user for a consumer

Earlier we explained that by using impersentation, Symphony can control the user IDs that different

application services run under. In the case of the Sqoop application defined earlier, we had set the

application user to appB and this is reflected in the ConsumerTrees.xml definition.

We can verify that impersonation is taking place and that processes are running under the expected

user ID by monitoring the process tree while executing MapReduce jobs like the one above.

The monitor the process tree, use a command like:

$ watch ‘ps -ef | grep appB’

As you run the job, you will see the SSM start-up unless it is pre-started or the SSM is lingering on a

management host waiting for another job. In this example are services are running on the same node as

the master host so we see the service instance managers and services instances starting locally to

manage the job. On a larger cluster you would need to watch the compute hosts to validate the services

are starting as expected and running under the correct user ID.

Figure 25 - verify that services are running under the expected user IDs

We can use the pstree command on the management host to understand the process tree.


Page 45

Figure 26 - pstree can be used to show the process hierarchy

On compute hosts, services are management by the pem process.

On response to a workload requirement pem launches a sim process (service instance manager) which

in turn runs a service instance. In this case the RunMapReduceService since this is a Symphony

MapReduce workload.

Figure 27 - process hierarchy on the execution host

When configuring several consumers and applications as we have shown here, it will be faster to hand

edit XML based application profile files also.


Page 46

To access XML application profiles, check the directory $EGO_TOP/data/soam/profiles. The associated

XML profiles will exist in subdirectories with names corresponding to their state. For example Sqoop.xml

can be found in an “enabled” subdirectory since the application is enabled and accepting workload.

Configuring Sharing Policies


Page 47


Page 48

Summary

In this document we’ve described a customer use case involving a multitenant implementation of

InfoSphere BigInsights that permits the following:

Concurrent execution of different Hadoop applications (including different versions of code) on

the same physical cluster

Dynamic sharing of resources between tenants in a fashion that maximizes performance and

resource utilization while respecting individual SLAs

Support for applications other than Hadoop MapReduce to maximize flexibility and allow

capital investments to be re-purposed for multiple requirements

Security isolation between tenants, removing a major barrier to sharing in many commercial

organizations

These advances in our view are significant. While Hadoop is advancing, competing open source and

commercial distributions are many years away from offering true multitenancy and practical solutions

for supporting multiple workloads on a shared infrastructure. The economic arguments in favor of

resource sharing are compelling. Analytic applications are increasingly comprised of multiple software

components that rely on distributed services. Rather than deploying separate “silos” of application

infrastructure, Platform Symphony provides the option to consolidate these different application

instances on a common foundation thus increasing infrastructure utilization, boosting service levels and

helping significantly reduce costs.

Engineering

Realizing a multitenant big data infrastructure 3