Download pdf - EMC Starter Kit - IBM BigInsights - EMC Isilon

#RememberRuddy

_____________________________

EMC ISILON HADOOP STARTER KIT Deploying IBM BigInsights v 4.0 with EMC ISILON

Release 1.0

October, 2015

EMC Isilon Hadoop Starter Kit for IBM BigInsights

__________________________________________________________________

EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 2

To learn more about how EMC products, services, and solutions can help solve your

business and IT challenges, contact your local representative or authorized reseller,

visit www.emc.com, or explore and compare products in the EMC Store

Copyright © 2015 EMC Corporation. All Rights Reserved.

EMC believes the information in this publication is accurate as of its publication date.

The information is subject to change without notice.

The information in this publication is provided “as is.” EMC Corporation makes no

representations or warranties of any kind with respect to the information in this

publication, and specifically disclaims implied warranties of merchantability or fitness

for a particular purpose.

Use, copying, and distribution of any EMC software described in this publication

requires an applicable software license.

For the most up-to-date listing of EMC product names, see EMC Corporation

Trademarks on EMC.com.

EMC are registered trademarks or trademarks of EMC, Inc. in the United States

and/or other jurisdictions. All other trademarks used herein are the property of their

respective owners.

http://www.emc.com/contact-us/contact-us.esp

http://www.emc.com/

https://store.emc.com/?EMCSTORE_CPP


__________________________________________________________________


Contents

INTRODUCTION ........................................................................................ 6

IBM & EMC Technology Highlights ........................................................................ 6

Audience ........................................................................................................... 7

Apache Hadoop Projects ...................................................................................... 7

IBM Open Platform and the Ambari Manager ......................................................... 8

Isilon Scale-Out NAS for HDFS ............................................................................. 8

Overview of Isilon Scale-Out NAS for Big Data ....................................................... 9

PRE-INSTALLATION CHECKLIST ............................................................. 10

Supported Software Versions ............................................................................. 10

Hardware Requirements and Suggested Hadoop Service Layout ............................. 10

INSTALLATION OVERVIEW ..................................................................... 12

Prerequisites ................................................................................................... 12

Isilon Scale-Out NAS or Isilon OneFS Simulator ........................................................... 12

Linux ...................................................................................................................... 13

Networking ............................................................................................................. 13

DNS ....................................................................................................................... 14

Other ..................................................................................................................... 15

Prepare Isilon .................................................................................................. 15

Assumptions ............................................................................................................ 15

SmartConnect for HDFS ............................................................................................ 16

OneFS Access Zones................................................................................................. 17

Sharing Data between Access Zones .......................................................................... 18

User & Group ID’s .................................................................................................... 19

Configuring Isilon for HDFS ....................................................................................... 19

Create DNS Records for Isilon .................................................................................... 25

Prepare Linux Compute Nodes ........................................................................... 25

Linux Operating System packages needed for IBM BigInsights: ...................................... 25

Enable NTP on all Linux Compute nodes ...................................................................... 26

Disable SELinux on each node if enabled before installing Ambari. ................................. 26


__________________________________________________________________


Check UMASK Settings ............................................................................................. 26

Set ulimit Properties ................................................................................................. 27

Kernel Modifications ................................................................................................. 27

Create IBM BigInsights Hadoop Users and Groups ........................................................ 27

Configure Passwordless SSH ...................................................................................... 28

Additional Linux Packages to Install ............................................................................ 28

Test DNS Resolution ................................................................................................. 29

Edit sudoers file on all Linux compute nodes. ............................................................... 29

INSTALLING IBM OPEN PLATFORM (OP) ................................................ 29

Download IBM Open Platform Software ............................................................... 29

Create IBM Open Platform Repository ................................................................. 30

Validating IBM Open Platform Install................................................................... 38

Adding a Hadoop User ...................................................................................... 40

Additional Service Tests .................................................................................... 40

HDFS ...................................................................................................................... 40

YARN/MAPREDUCE ................................................................................................... 41

HIVE ...................................................................................................................... 42

HBASE .................................................................................................................... 43

Ambari Service Check ....................................................................................... 44

INSTALLING IBM VALUE PACKAGES ....................................................... 45

Before You Begin ............................................................................................. 45

Installation Procedure ....................................................................................... 46

Select IBM BigInsights Service to Install ............................................................. 50

Installing BigInsights Home ............................................................................... 51

Configure Knox ................................................................................................ 52

Installing BigSheets .......................................................................................... 54

Installing Big SQL............................................................................................. 57

Connecting to Big SQL ...................................................................................... 62

Running JSqsh ......................................................................................................... 62

Connection setup ..................................................................................................... 62

Commands and queries ............................................................................................ 63

Command and query edit .......................................................................................... 65


__________________________________________________________________


Configuration variables ............................................................................................. 66

Installing Text Analytics .................................................................................... 67

Installing Big R ................................................................................................ 71

IBM BigInsights Online Tutorials................................................................................. 76

SECURITY CONFIGURATION AND ADMINISTRATION .............................. 77

Setting up HTTPS for Ambari ............................................................................. 77

Configuring SSL support for HBase REST gateway with Knox ................................. 78

Overview of Kerberos ....................................................................................... 82

Enabling Kerberos for IBM Open Platform ............................................................ 85

Manually generating keytabs for Kerberos authentication ...................................... 86

Setting up Active Directory or LDAP authentication in Ambari ................................ 91

Enabling Kerberos for HDFS on Isilon.................................................................. 97

Using MIT Kerberos 5 ............................................................................................... 97

Running the Ambari Kerberos Wizard .................................................................. 99

Trouble Shooting and Support .......................................................................... 104

EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS

6

EMC Isilon Hadoop Starter Kit for IBM BigInsights v 4.0

This document describes how to create a Hadoop environment utilizing IBM® Open Platform with Apache Hadoop and an EMC® Isilon® scale-out network-attached storage (NAS) for HDFS

accessible shared storage. Installation and configuration of IBM BigInsights Value Packages is also presented in this document.

Introduction

IBM & EMC Technology Highlights

The IBM® Open Platform with Apache Hadoop is comprised of entirely Apache Hadoop

open source components, such as Apache Ambari, YARN, Spark, Knox, Slider, Sqoop,

Flume, Hive, Oozie, HBase, ZooKeeper, and more. After installing IBM Open Platform, you

can install additional IBM value-add service modules.

These value-add service modules are installed separately, and they include IBM

BigInsights® Analyst, IBM BigInsights Data Scientist, and the IBM BigInsights Enterprise

Management module to provide enhanced capabilities to IBM Open Platform to accelerate

the conversion of all types of data into business insight and action.

The EMC® Isilon® Scale-Out Network-Attached Storage (NAS) platform provides Hadoop

clients with direct access to big data through a Hadoop File System (HDFS) interface.

Powered by the distributed EMC Isilon OneFS® operating system, an EMC Isilon cluster

delivers a powerful yet simple and highly efficient storage platform with native HDFS

integration to accelerate analytics, gain new flexibility, and avoid the costs of a separate

Hadoop infrastructure.


__________________________________________________________________


Audience

This document is intended for IT program managers, IT architects, Developers, and IT

management to easily deploy IBM BigInsights v4.0 with EMC Isilon OneFS v 7.2.0.3 for

HDFS storage. If a physical EMC Isilon Cluster is not available, download the free EMC Isilon

OneFS Simulator which can be installed as a virtual machine for integration testing and

training purposes. See http://www.emc.com/getisilon for EMC Isilon OneFS Simulator.

Apache Hadoop Projects

Apache Hadoop is an open source, batch data processing system for enormous amounts of

data. Hadoop runs as a platform that provides cost-effective, scalable infrastructure for

building Big Data analytic applications. All Hadoop clusters contain a distributed file system

called the Hadoop Distributed File System (HDFS) and a computation layer called

MapReduce.

The Apache Hadoop project contains the following subprojects:

• Hadoop Distributed File System (HDFS) – A distributed file system that provides

high-throughput access to application data.

• Hadoop MapReduce – A software framework for writing applications to reliably

process large amounts of data in parallel across a cluster.

Hadoop is supplemented by an ecosystem of Apache projects, such as Pig, Hive, Sqoop,

Flume, Oozie, Slider, HBase, Zookeeper and more that extend the value of Hadoop and

improves its usability.

Version 2 of Apache Hadoop introduces YARN, a sub-project of Hadoop that separates the

resource management and processing components. YARN was born of a need to enable a

broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-

based architecture of Hadoop 2.0 provides a more general processing platform that is not

constrained to MapReduce.

For full details of the Apache Hadoop project see http://hadoop.apache.org/.

http://www.emc.com/getisilon

http://hadoop.apache.org/


__________________________________________________________________


IBM Open Platform and the Ambari Manager

The IBM Open Platform with Apache Hadoop enables Enterprise Hadoop by providing the

complete set of essential Hadoop capabilities required for any enterprise. Utilizing YARN at

its core, it provides capabilities for several functional areas including Data Management,

Data Access, Data Governance, Integration, Security and Operations.

IBM Open Platform delivers the core elements of Hadoop - scalable storage and distributed

computing – as well as all of the necessary enterprise capabilities such as security, high

availability and integration with a broad range of hardware and software solutions.

Apache Ambari is an open operational framework for provisioning, managing and monitoring

Apache Hadoop clusters.

As of version 4.0 of IBM Open Platform, Ambari can be used to setup and deploy Hadoop

clusters for nearly any task. Ambari can provision, manage and monitor every aspect of a

Hadoop deployment.

More information on IBM Open Platform can be found at:

http://www-01.ibm.com/software/data/infosphere/hadoop/enterprise.html

Isilon Scale-Out NAS for HDFS

EMC Isilon is the only scale-out NAS platform natively integrated with the Hadoop

Distributed File System (HDFS). Using HDFS as an over-the-wire protocol, you can deploy a

powerful, efficient, and flexible data storage and analytics ecosystem.

In addition to native integration with HDFS, EMC Isilon storage easily scales to support

massively large Hadoop analytics projects. Isilon scale-out NAS also offers unmatched

simplicity, efficiency, flexibility, and reliability that you need to maximize the value of your

Hadoop data storage and analytics workflow investment.

http://www-01.ibm.com/software/data/infosphere/hadoop/enterprise.html


__________________________________________________________________


Overview of Isilon Scale-Out NAS for Big Data

The EMC Isilon scale-out platform combines modular hardware with unified software to

provide the storage foundation for data analysis. Isilon scale-out NAS is a fully distributed

system that consists of nodes of modular hardware arranged in a cluster. The distributed

Isilon OneFS operating system combines the memory, I/O, CPUs, and disks of the nodes into

a cohesive storage unit to present a global namespace as a single file system.

The nodes work together as peers in a shared-nothing hardware architecture with no single

point of failure. Every node adds capacity, performance, and resiliency to the cluster and

each node acts as a Hadoop namenode and datanode.

The namenode daemon is a distributed process that runs on all the nodes in the cluster. A

compute client can connect to any node through HDFS.

As nodes are added, the file system expands dynamically and redistributes data, eliminating

the work of partitioning disks and creating volumes. The result is a highly efficient and

resilient storage architecture that brings all the advantages of an enterprise scale-out NAS

system to storing data for analysis.

With traditional direct attached storage, the ratio of CPU, RAM, and disk space requirements

depends on the workload—these factors make it difficult to size a Hadoop cluster before you

have had a chance to measure your MapReduce workload. Expanding data sets also makes

sizing decisions upfront problematic. Isilon scale-out NAS lends itself perfectly to this

situation: Isilon scale-out NAS lets you increase CPUs, RAM, and disk space by adding nodes

to dynamically match storage capacity and performance with the demands of a dynamic

Hadoop workload.

An Isilon cluster optimizes data protection. OneFS more efficiently and reliably protects data

than HDFS. The HDFS protocol, by default, replicates a block of data three times. In

contrast, OneFS stripes the data across the cluster and protects the data with forward error

correction codes, which consume less space than replication with better protection.


__________________________________________________________________


Pre-installation Checklist

Supported Software Versions

The environment used for this document consists of the following software versions:

Ambari 1.7.0_IBM

IBM Open Platform v 4.0.0.0

Isilon OneFS 7.2.0.3 with patch-159065

All of IBM BigInsights v 4.0 value packs, i.e. Business Analyst, Data

Scientist, and Enterprise Management

______________________________________________________________________

Note: IBM BigInsights v 4.0 requires OneFS v 7.2.0.3 with patch-159065.

OneFS version 7.2.0.4 should also work as well as version 7.2.1.1 when available.

Do not install IBM BigInsights with OneFS versions lower than 7.2.0.3.

See EMC Isilon Supportability and Compatibility Guide for the latest compatibility updates:

https://support.emc.com/docu44518_Isilon-Supportability-and-Compatibility-

Guide.pdf?language=en_US

Hardware Requirements and Suggested Hadoop Service Layout

Detail system requirements for IBM BigInsights compute nodes can be found at:

http://www-01.ibm.com/support/docview.wss?uid=swg27027565

In a multi-node IBM BigInsights cluster, it is suggested that you have at least one

management node in your non-high availability environment, if performance is not an

issue. If performance is a concern, consider configuring at least three management nodes.

If you use the BigInsights - Big SQL service, consider configuring four management

nodes. If you use a high availability environment, consider six management nodes. Use

https://support.emc.com/docu44518_Isilon-Supportability-and-Compatibility-Guide.pdf?language=en_US

https://support.emc.com/docu44518_Isilon-Supportability-and-Compatibility-Guide.pdf?language=en_US

http://www-01.ibm.com/support/docview.wss?uid=swg27027565


__________________________________________________________________


the following list as a guide for the nodes in your IBM/EMC cluster. A suggested layout is

shown in Table 1 for both Non-High availability and High availability deployments.

________________________________________________________________________________________

Note: With both deployment options, EMC Isilon provides namenode, secondary namenode and datanode functions for the entire cluster. Do not designate any compute node as a namenode, secondary namenode, or datanode in any aspect of the IBM BigInsights configuration.

Table 1. Suggested Service Layout

Non-High availability High availability

Management node 1

Ambari

PostgreSQL

Knox

Zookeeper

Hive

Spark

Spark History Server

BigInsights Home

BigSheets

Big R

BigSQL Headnode

Text Analytics

Management node 2

Resource Manager

HBase Master

Zookeeper

Oozie

Ambari monitoring service

Management node 3

Job history server

Zookeeper

App Timeline Server

Kafka

Management node 4

Big SQL Scheduler

Hive Server (MySQL)

MySQL metastore

Hive/Oozie metastore

WebHCat Server

Data Server Manager

Management node 1

Ambari

PostgreSQL

Spark

Spark History Server

BigSQL Headnode

Management node 2

Resource Manager

Zookeeper

Oozie

Ambari monitoring service

BigInsights Home

Management node 3

Resource Manager (standby)

Job history server

Zookeeper

App Timeline Server

Kafka

Oozie (Standby)

Management node 4

Big SQL Scheduler

HBase Master (standby)

Hive Server

MySQL Server

Hive metastore

WebHCat Server

Data Server Manager

Management node 5

Big SQL Headnode (Standby)

Big SQL Scheduler (Standby)

HBase Master

Hive Server (Standby)

Hive Metastore (Standby)

Journal Node

Zookeeper


__________________________________________________________________


Installation Overview

Below is the overview of the installation process that this document will describe.

1. Confirm prerequisites.

2. Prepare your network infrastructure including DNS.

3. Prepare your Isilon cluster.

4. Prepare Linux compute nodes.

5. Install Ambari Server.

6. Use Ambari Manager to deploy IBM Open Platform to compute nodes.

7. Install IBM BigInsights Value Packages

8. Perform key functional tests.

Prerequisites

Isilon Scale-Out NAS or Isilon OneFS Simulator

For low-capacity, non-performance testing of Isilon, the EMC Isilon OneFS Simulator can

be used instead of a cluster of physical Isilon appliances. This can be downloaded for free

from http://www.emc.com/getisilon.

Refer to the EMC Isilon OneFS Simulator Install Guide for details. Be sure to follow the

section for running the virtual nodes in VMware ESX. Only a single virtual node is required

but adding additional nodes will allow you to explore other features such as data

protection, SmartPools (tiering), and SmartConnect (network load balancing).

For physical Isilon nodes, you should have already completed the console-based

installation process for your first Isilon node and added two other nodes for a

minimum of 3 Isilon nodes.

You should have OneFS version 7.2.0.3 + patch 159065 installed on Isilon.


__________________________________________________________________


You must obtain OneFS HDFS license code and install it on your Isilon cluster. You can

get your free OneFS HDFS license from:

http://www.emc.com/campaign/isilon-hadoop/index.htm.

It is recommended, but not required, to have a SmartConnect Advanced license for

your Isilon cluster.

To allow for scripts and other small files to be easily shared between all nodes in your

environment, it is highly recommended to enable NFS (Unix Sharing) on your Isilon

cluster. By default, the entire /ifs directory is already exported and this can remain

unchanged. This document assumes that a single Isilon cluster is used for this NFS

export as well as for HDFS. However, there is no requirement that the NFS export be

on the same Isilon cluster that you are using for HDFS.

Linux

RedHat Enterprise Linux (RHEL) Server 6 (Update 5 minimum) or comparable

CentOS Server.

100GB Root Partition

At a minimum, 96G RAM for production environments. The more RAM the better

for Hadoop.

Networking

For the best performance, a single 10 Gigabit Ethernet switch should connect to at

least one 10 Gigabit port on each Linux host. Additionally, the same switch should

connect to at least one 10 Gigabit port on each Isilon node.

A single dedicated layer-2 network can be used to connect all hosts and Isilon nodes.

Although multiple networks can be used for increased security, monitoring, and

robustness.


__________________________________________________________________


At least an entire /24 IP address block should be allocated to your network. This will

allow a DNS reverse lookup zone to be delegated to your Hadoop DNS server.

If using the EMC Isilon OneFS Simulator, you will need at least two static IP addresses

(one for the node’s ext-1 interface, another for the SmartConnect service IP). Each

additional Isilon node will require an additional IP address.

At a minimum, you will need to allocate to your Isilon cluster one IP address per

Access Zone per Isilon node. In general, you will need one Access Zone for each

separate Hadoop cluster that will use Isilon for HDFS storage.

For the best possible load balancing during an Isilon node failure scenario, the

recommended number of IP addresses is given by the formula below. Of course, this

is in addition to any IP addresses used for non-HDFS pools.

# of IP addresses = 2 * (# of Isilon Nodes) * (# of Access Zones)

For example, 20 IP addresses are recommended for 5 Isilon nodes and 2 Access Zones.

This document will assume that Internet access is available to all servers to download

various components from Internet repositories.

DNS

A DNS server is required and you must have the ability to create DNS records and

zone delegations.

It is recommended that your DNS server delegate a subdomain to your Isilon cluster.

For instance, DNS requests for subnet0-pool0.isiloncluster1.example.com or

isiloncluster1.example.com should be delegated to the Service IP defined on your

Isilon cluster.

To allow for a convenient way of changing the HDFS Namenode used by all Hadoop

applications and services, create a DNS record for your Isilon cluster’s HDFS

Namenode service. This should be a CNAME alias to your Isilon SmartConnect zone.

Specify a TTL of 1 minute to allow for quick changes. For example, create a CNAME

record for mycluster1-hdfs.example.com that targets subnet0-


__________________________________________________________________


pool0.isiloncluster1.example.com. If you later want to redirect all HDFS I/O to another

cluster or a different pool on the same Isilon cluster, you simply need to change the

DNS record and restart all Hadoop services.

Other

See http://www.github.com/bonibruno/BigInsights, there are three scripts to

download to help automate new IBM BigInsights installations with EMC Isilon:

1. bi_create_users.sh – use this script to create the users and groups on all the

Linux nodes before beginning installation.

2. isilon_create_users.sh – use this script to create the users and groups on

Isilon before beginning installation. You must first create your access zone

(described later in this document, e.g. ibm), before running this script.

3. isilon_create_directories.sh – run this after the script above.

More information on the use of these scripts is provided in the installation section of this

document.

Prepare Isilon

Assumptions

This document makes the assumptions listed below. These are not necessarily

requirements but they are usually valid and simplify the process.

It is assumed that you are not using a directory service such as Active

Directory for Hadoop users and groups.

It is assumed that you are not using Kerberos authentication for Hadoop.

http://www.github.com/bonibruno/BigInsights


__________________________________________________________________


SmartConnect for HDFS

A best practice for HDFS on Isilon is to utilize two SmartConnect IP address pools for each

access zone. One IP address pool should be used by Hadoop clients to connect to the HDFS

namenode service on Isilon and it should use the dynamic IP allocation method to

minimize connection interruptions in the event that an Isilon node fails.

____________________________________________________________________

Note: Dynamic IP allocation requires a SmartConnect Advanced license.

____________________________________________________________________

A Hadoop client uses a specific SmartConnect IP address pool simply by using its zone

name (DNS name) in the HDFS URI:

For example, hdfs://subnet0-pool1.isiloncluster1.example.com:8020

A second IP address pool should be used for HDFS datanode connections and it should also

use dynamic IP allocation method. To assign specific Smart-Connect IP address pools for

datanode connections, you will use the “isi hdfs racks modify” command. If the network

is flat, there is no need to use “isi hdfs racks modify”, the default configuration will suffice.

If IP addresses are limited and you have a SmartConnect Advanced license, you may

choose to use a single dynamic pool for namenode and datanode connections. This may

result in uneven utilization of Isilon nodes.

If you do not have a SmartConnect Advanced license, you may choose to use a single

static pool for namenode and datanode connections. This may result in some failed HDFS

connections in the event of a node failure.

For more information, see EMC Isilon Best Practices for Hadoop Data Storage white paper

online at: https://www.emc.com/collateral/white-papers/h13926-wp-emc-isilon-hadoop-

best-practices-onefs72.pdf

https://www.emc.com/collateral/white-papers/h13926-wp-emc-isilon-hadoop-best-practices-onefs72.pdf

https://www.emc.com/collateral/white-papers/h13926-wp-emc-isilon-hadoop-best-practices-onefs72.pdf


__________________________________________________________________


OneFS Access Zones

Access zones on OneFS are a way to select a distinct configuration for the OneFS cluster

based on the IP address that the client connects to. For HDFS, this configuration includes

authentication methods, HDFS root path, and authentication providers (AD, LDAP, local,

etc.). By default, OneFS includes a single access zone called System.

If you will only have a single Hadoop cluster connecting to your Isilon cluster, then you can

use the System access zone with no additional configuration. However, to have more than

one Hadoop cluster connect to your Isilon cluster, it is best to have each Hadoop cluster

connect to a separate OneFS access zone. This will allow OneFS to present each Hadoop

cluster with its own HDFS namespace and an independent set of users.

For more information, see Security and Compliance for Scale-out Hadoop Data Lakes

whitepaper.

To view your current list of access zones and the IP pools associated with them:

isiloncluster1-1# isi zone zones list

Name Path

------------

System /ifs

------------

Total: 1

isiloncluster1-1# isi networks list pools -v

subnet0:pool0

In Subnet: subnet0

Allocation: Static

Ranges: 1

10.111.129.115-10.111.129.126

Pool Membership: 4

1:10gige-1 (up)

2:10gige-1 (up)

3:10gige-1 (up)

4:10gige-1 (up)

Aggregation Mode: Link Aggregation Control Protocol (LACP)

Access Zone: System (1)

SmartConnect:

Suspended Nodes : None

Auto Unsuspend ... 0

Zone : subnet0-pool0.isiloncluster1.lab.example.com

Time to Live : 0

Service Subnet : subnet0

Connection Policy: Round Robin

Failover Policy : Round Robin

Rebalance Policy : Automatic Failback


__________________________________________________________________


To create a new access zone and an associated IP address pool:

isiloncluster1-1# mkdir -p /ifs/isiloncluster1/zone1

isiloncluster1-1# isi zone zones create --name zone1 \

--path /ifs/isiloncluster1/zone1

isiloncluster1-1# isi networks create pool --name subnet0:pool1 \

--ranges 10.111.129.127-10.111.129.138 --ifaces 1-4:10gige-1 \

--access-zone zone1 --zone subnet0-pool1.isiloncluster1.lab.example.com \

--sc-subnet subnet0 --dynamic

Creating pool

‘subnet0:pool1’: OK

Saving: OK

____________________________________________________________________

Note: If you do not have a SmartConnect Advanced license, you will need to omit the --

dynamic option.

____________________________________________________________________

Sharing Data between Access Zones

By default, the data in one access zone cannot be access by users in another access zone.

In certain cases, however, you may need to make the same data set available to more

than one Hadoop compute cluster. Using fully qualified HDFS paths, e.g. hdfs://zone1-

hdfs.example.com/hadoop/dir1, can render a data set available across two or more

access zones.

With fully qualified HDFS paths, the data sets do not cross access zones. Instead, the

Hadoop jobs can access the data sets from a common shared HDFS namespace. For

instance, you can selectively share data between two or more access zones based on

referential links and file/directory permissions.


__________________________________________________________________


User & Group ID’s

Isilon clusters and Hadoop servers each have their own mapping of user IDs (uid) to user

names and group IDs (gid) to group names. When Isilon is used only for HDFS storage by

the Hadoop servers, the IDs do not need to match. This is due to the fact that the HDFS

protocol only refers to users and groups by their names, and never their numeric IDs.

In contrast, the NFS protocol refers to users and groups by their numeric IDs. Although

NFS is rarely used in traditional Hadoop environments, the high-performance, enterprise-

class, and POSIX-compatible NFS functionality of Isilon makes NFS a compelling protocol

for certain workflows. If you expect to use both NFS and HDFS on your Isilon cluster (or

simply want to be open to the possibility in the future), it is highly recommended to

maintain consistent names and numeric IDs for all users and groups on Isilon and your

Hadoop servers. In a multi-tenant environment with multiple Hadoop clusters, numeric IDs

for users in different clusters should be distinct.

For instance, the user bigsql in Hadoop cluster 1 may have ID 1013 and this same ID will

be used in the Isilon access zone for Hadoop cluster 1 as well as every server in Hadoop

cluster 1. The user bigsql in Hadoop cluster 2 may have ID 710 and this ID will be used in

the Isilon access zone for Hadoop cluster 2 as well as every server in Hadoop cluster 2.

Configuring Isilon for HDFS

_____________________________________________________________________

Note: In the steps below, replace zone1 with System to use the default System access

zone or you may specify the name of a new access zone that you previously created.

______________________________________________________________________

1. Open a web browser to the your Isilon cluster’s web administration page. If you

don’t know the URL, simply point your browser to:

https://isilon_node_ip_address:8080

https://isilon_node_ip_address:8080/


__________________________________________________________________


The isilon_node_ip_address is any IP address on any Isilon node that is in the System

Access Zone. This usually corresponds to the ext-1 interface of any Isilon node.

2. Login with your root account. You specified the root password when you configured

your first node using the console.

3. Check, and edit as necessary, your NTP settings. Click Cluster Management ->

General Settings -> NTP.


__________________________________________________________________


1. SSH into any node in your Isilon cluster as root.

2. Confirm that your Isilon cluster is at OneFS version 7.2.0.3.

isiloncluster1-1# isi version

Isilon OneFS v7.2.0.3 ...

3. For OneFS version 7.2.0.3, you must have patch-159065 installed. You can view

the list of patches you have installed with:

# isi pkg info

patch-159065: This patch adds support for the Ambari 1.7.0_IBM Server.

4. Install the patch if needed:

[user@workstation ~]$ scp patch-159065.tgz root@mycluster1-hdfs:/tmp

isiloncluster1-1# gunzip < /tmp/patch-159065.tgz | tar -xvf -

isiloncluster1-1# isi pkg install patch-159065.tar

Preparing to install the package...

Checking the package for installation...

Installing the package

Committing the installation...

Package successfully installed.

5. Verify your HDFS license.

isiloncluster1-1# isi license

Module License Status Configuration Expiration Date

------ -------------- ------------- ---------------

HDFS Evaluation Not Configured November12, 2016


__________________________________________________________________


6. Create the HDFS root directory. This is usually called hadoop and must be within

the access zone directory.

isiloncluster1-1# mkdir -p /ifs/isiloncluster1/zone1/hadoop

7. Set the HDFS root directory for the access zone.

isiloncluster1-1# isi zone zones modify zone1 \

--hdfs-root-directory /ifs/isiloncluster1/zone1/hadoop

8. Set the HDFS block size used for reading from Isilon.

isiloncluster1-1# isi hdfs settings modify --default-block-size 128M

9. Create an indicator file so that we can easily determine when we are looking your

Isilon cluster via HDFS.

isiloncluster1-1# touch \

/ifs/isiloncluster1/zone1/hadoop/THIS_IS_ISILON_isiloncluster1_zone1

10.Copy the scripts (isilon_create_users.sh & isilon_create_directories.sh) you

downloaded from http://www.github.com/bonibruno/BigInsights to Isilon,

[user@workstation ~]$ scp isilon_create_*.sh \

root@isilon_node_ip_address:/ifs/isiloncluster1/scripts

11.Execute the script isilon_create_users.sh. This script will create all required

users and groups for IBM BigInsights v 4.0.

Warning: The script isilon_create_users.sh will create local user and group accounts on

your Isilon cluster for Hadoop services. If you are using a directory service such as Active

Directory and you want these users and groups to be defined in your directory service,

then DO NOT run this script.

Instead, refer to the OneFS documentation and EMC Isilon Best Practices for Hadoop Data

Storage.

http://www.github.com/bonibruno/BigInsights


__________________________________________________________________


Script Usage:

isilon_create_users.sh –dist <DIST> [–startgid <GID>] [–startuid <UID>] [–

zone <ZONE>]

dist - This will correspond to your Hadoop distribution – bi4.0

startgid - Group IDs will begin with this value. For example: 1000

startuid - User IDs will begin with this value. This is generally the same as gid_base. For

example: 1000.

zone – Access Zone name. For example: zone1

isiloncluster1-1# bash /ifs/isiloncluster1/scripts/isilon_create_users.sh \

--dist bi4.0 --startgid 1000 --startuid 1000 --zone zone1

Example output of script is shown below:

Info: Hadoop distribution: bi

Info: groups will start at GID 1000

Info: users will start at UID 1000

Info: will put users in zone: zone1

Info: HDFS root: /ifs/isiloncluster1/hadoop

Failed to add member UID:1001 to group GROUP:hadoop: User is already in local group

SUCCESS -- Hadoop users created successfully!

Done!


__________________________________________________________________


______________________________________________________________________

Note: The “User is already in local group” message is expected, this user corresponds to

the hadoop user which is already in the hadoop group.

12. Execute the script isilon_create_directories.sh. This script will create all

required directories with the appropriate ownership and permissions.

Script Usage:

isilon_create_directories.sh –dist <DIST> [–fixperm] [–zone <ZONE>]

dist - This will correspond to your Hadoop distribution – bi4.0

fixperm - Updates ownership and permissions on hadoop directories.

zone - Access Zone name. For example: zone1

isiloncluster1-1# bash /ifs/isiloncluster1/scripts/isilon_create_directories.sh \

--dist bi4.0 --fixperm --zone zone1

13. Map the hdfs user to the Isilon superuser. This will allow the hdfs user to chown

(change ownership of) all files during IBM BigInsights installation.

______________________________________________________________________

Warning: The command below will restart the HDFS service on Isilon to ensure that any

cached user mapping rules are flushed. This will temporarily interrupt any HDFS

connections coming from other Hadoop clusters.

______________________________________________________________________

isiloncluster1-1# isi zone zones modify --user-mapping-rules=’’hdfs=>root’’ --zone zone1

isiloncluster1-1# isi services isi_hdfs_d disable ; isi services isi_hdfs_d enable

The service ‘isi_hdfs_d’ has been disabled.

The service ‘isi_hdfs_d’ has been enabled.


__________________________________________________________________


Create DNS Records for Isilon

You will now create the required DNS records that will be used to access your Isilon

cluster.

1. Create a delegation record so that DNS requests for the zone

isiloncluster1.example.com are delegated to the Service IP that will be defined on

your Isilon cluster. The Service IP can be any unused static IP address in your lab

subnet.

2. Create a CNAME alias for your Isilon SmartConnect zone. For example, create a

CNAME record for mycluster1-hdfs.example.com that targets subnet0-

pool0.isiloncluster1.example.com.

3. Test name resolution.

[user@workstation ~]$ ping mycluster1-hdfs.example.com

PING subnet0-pool0.isiloncluster1.example.com (10.11.12.13) 56(84) bytes of data.

64 bytes from 10.11.12.13: icmp_seq=1 ttl=64 time=1.15 ms

Prepare Linux Compute Nodes

Linux Operating System packages needed for IBM BigInsights:

1. Compatibility Libraries

2. Networking Tools

3. Perl Support

4. Ruby Support

5. Web Services add on

6. PHP Support

7. Web Server


__________________________________________________________________


8. Mysql*

9. PostGres*

10.snmp support

11.Development Tools

12. Korn Shel

Enable NTP on all Linux Compute nodes

1. Edit /etc/ntp.conf file and add your NTP Server.

2. Enable NTP, “service ntpd start”

3. chkconfig –level 2345 ntpd on

Disable SELinux on each node if enabled before installing Ambari.

1. Edit /etc/selinux/config

2. Set SELINUX=disabled

3. Reboot

____________________________________________________________________

Note: SELinux can be disabled temporarily with the “setenforce 0” command.

____________________________________________________________________

Check UMASK Settings

The umask setting on each node should be set to 0022 in /etc/profile and /etc/bashrc.

Just modify existing umask entry if needed, e.g. “umask 0022”.


__________________________________________________________________


Set ulimit Properties

1. Edit /etc/security/limits.d/90-nproc.conf

#set for all users

* hard nofile 65536

* soft nofile 65536

* hard nproc 65536

* hard nproc 65536

Kernel Modifications

1. Edit /etc/sysctl.conf and add the following:

vm.swappiness=5

kernel.pid_max=4194303

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

net.ipv4.ip_local_port_range = 1024 64000

Create IBM BigInsights Hadoop Users and Groups

Create required users on all Linux nodes. It is recommended to create all Hadoop users

before installing IBM BigInsights. Use the bi_create_users.sh script obtained from:

http://www/github.com/bonibruno/BigInsights

[user_workstation ~$] scp bi_create_users.sh [node1]:/root

Run script, e.g. #./bi_create_users.sh

Repeat above for all nodes.

http://www/github.com/bonibruno/BigInsights


__________________________________________________________________


Configure Passwordless SSH

Configure passwordless SSH for all Linux nodes.

1. Create Authentication SSH Keys

ssh-keygen -f id_rsa -t rsa -N

2. Create .ssh directories on all nodes

ssh root@[node1]

mkdir –p .ssh

cd .ssh

Upload generated keys to all hosts:

cat id_rsa.pub | ssh root@[node1] 'cat >> .ssh/authorized_keys'

Repeat above for all nodes.

3. Set permissions on .ssh directory

ssh root@[node1] "chmod 700 .ssh; chmod 640 .ssh/authorized_keys”

Additional Linux Packages to Install

Install the following packages on all Linux compute nodes.

deltarpm

python-deltarpm createrepo pam-1.1.1-17.el6.i686.rpm

mysql-connector-java-5.1.17-6.el6.noarch.rpm ksh

nc libdbi libstdc

libaio java-1.7.0-openjdk-devel

python-paramiko python-rrdtool-1.4.5-1.el6.rfx.x86_64


__________________________________________________________________


snappy-1.0.5-1.el6.x86_64 web-ui-framework

Install the above packages using the yum install command.

Test DNS Resolution

Make sure all compute nodes resolve with a fully qualifies domain name.

Ping each host with the associated FQDN and make sure it is reachable by FQDN.

Edit sudoers file on all Linux compute nodes. 1. Edit /etc/sudoers

## Additions needed for IBM BigInsights

hadoop ALL=(ALL) NOPASSWD: ALL

bigsql ALL=(ALL) NOPASSWD: ALL

Check IBM’s BigInsights Website for more info on preparing Linux nodes.

http://www01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsigh

ts.install.doc/doc/install_prepare.html

Installing IBM Open Platform (OP)

Download IBM Open Platform Software

Log into the IBM Passport Advantage web portal with your IBM assigned credentials and

download the following packages onto the designated Ambari server node:

• BI-AH-1.0.0.1-IOP-4.0.x86_64.bin

• IOP-4.0.0.0.x86_64.rpm

• iop-4.0.0.0.x86_64.tar.gz

• iop-utils-1.0-iop-4.0.x86_64.tar.gz


__________________________________________________________________


Create IBM Open Platform Repository

The IBM Open Platform with Apache Hadoop uses the repository-based Ambari installer.

You have two options for specifying the location of the repository from which Ambari

obtains the component packages.

The IBM Open Platform with Apache Hadoop installation includes OpenJDK 1.7.0. During

installation, you can either install the version provided or make sure Java™ 7 is installed

on all nodes in the cluster.

1. Log in to your Linux cluster as root, or as a user with root privileges.

2. Ensure that the nc package is installed on all nodes:

yum install -y nc

If you installed the Basic Server option on your server, the nc package might not be

installed, which might result in the failure on datanodes of the IBM Open Platform with

Apache Hadoop.

3. Locate the IOP-4.0.0.0.x86_64.rpm file you downloaded from the download site. Run the

following command to install the ambari.repo file into /etc/yum.repos.d:

yum install IOP-4.0.0.0.x86_64.rpm

If using a mirror repository, edit the file /etc/yum.repos.d/ambari.repo and replace

baseurl=http://ibm-open-platform.ibm.com/repos/Ambari/RHEL6/x86_64/1.7

with your mirror URL. For example,

baseurl=http://<web.server>/repos/Ambari/RHEL6/x86_64/1.7/

Disable the gpgcheck in the ambari.repo file. To disable signature validation,

change gpgcheck=1 to gpgcheck=0.

Alternatively, you can keep gpgcheck on and change the public key file location to the

mirror Ambari repository. To do this, change the following


__________________________________________________________________


gpgkey=http://ibm-open-platform.ibm.com/repos/Ambari/RHEL6/x86_64/1.7/BI-GPG-

KEY.public

to the following:

gpgkey=http://<web.server>/repos/Ambari/RHEL6/x86_64/1.7/BI-GPG-KEY.public

4. Clean the yum cache on each node so that the right packages from the remote repository

are seen by your local yum.

>sudo yum clean all

5. Install the Ambari server on the intended management node, using the following

command:

>sudo yum install ambari-server

Accept the install defaults.

6. If you are using a mirror repository, after you install the Ambari server, update the

following file with the mirror repository URLs.

/var/lib/ambari-server/resources/stacks/BigInsights/4.0/repos/repoinfo.xml

In the file, change the information from the Original content to the Modified content

Original content Modified content

<os type="redhat6">

<repo>

<baseurl>

http://ibm-open-

platform.ibm.com/repos/IOP/RHEL6/x86_64

/4.0</baseurl>

<repoid>IOP-4.0</repoid>

<reponame>IOP</reponame>

</repo>

<repo>

<os type="redhat6">

<repo>

<baseurl>

http://<web.server>/repos/IOP/RHE

L6/x86_64/4.0</baseurl>

<repoid>IOP-4.0</repoid>

<reponame>IOP</reponame>

</repo>

<repo>

<baseurl>


__________________________________________________________________


<baseurl>

http://ibm-open-

platform.ibm.com/repos/IOP-

UTILS/RHEL6/x86_64/1.0</baseurl>

<repoid>IOP-UTILS-1.0</repoid>

<reponame>IOP-UTILS</reponame>

</repo>

</os>

http://<web.server>/repos/IOP-

UTILS/RHEL6/x86_64/1.0</baseurl>

<repoid>IOP-UTILS-1.0</repoid>

<reponame>IOP-

UTILS</reponame>

</repo>

</os>

Edit the /etc/ambari-server/conf/ambari.properties file. change the information from the

Original content to the Modified content

Original content Modified content

jdk1.7.url=http://ibm-open-

platform.ibm.com/repos/IOP-

UTILS/RHEL6/x86_64/1.0/openjdk/jdk-

1.7.0.tar.gz

jdk1.7.url=http://<web.server>/r

epos/IOP-

UTILS/RHEL6/x86_64/1.0/openjdk

/jdk-1.7.0.tar.gz

7. Set up the Ambari server, using the following command:

>sudo ambari-server setup

Accept the setup preferences.

A Java JDK is installed as part of the Ambari server setup. However, the Ambari server

setup also allows you to reuse an existing JDK. The command is:

ambari-server setup -j /full/path/to/JDK

The JDK path set by the -j parameter must be the same on each node in the cluster.

8. Start the Ambari server, using the following command:

>sudo ambari-server start


__________________________________________________________________


9. If the Ambari server had been installed on your node previously, the node may contain

old cluster information. Reset the Ambari server to clean up its cluster information in the

database, using the following commands:

>sudo ambari-server stop

>sudo ambari-server reset

>sudo ambari-server start

10. Access the Ambari web user interface from a web browser by using the server name

(the fully qualified domain name, or the short name) on which you installed the software,

and port 8080. For example, enter abc.com:8080.

You can use any available port other than 8080 that will allow you to connect to the

Ambari server. In some networks, port 8080 is already in use. To use another port, do

the following:

a. Edit the ambari.properties file:

vi /etc/ambari-server/conf/ambari.properties

b. Add a line in the file to select another port:

client.api.port=8081

c. Save the file and restart the Ambari server:

ambari-server restart

11. Log in to the Ambari server with the default username and password: admin/admin.

The default username and password is required only for the first login. You can

configure users and groups after the first login to the Ambari web interface.


__________________________________________________________________


12. On the Welcome page, click Launch Install Wizard.

13. On the Get Started page, enter a name for the cluster you want to create. The name

cannot contain blank spaces or special characters. Click Next.

14. You will deploy IBM Open Platform for Apache Hadoop with EMC Isilon. Ambari Server

allows for the immediate usage of an Isilon cluster for all HDFS services (NameNode and

DataNode), no reconfiguration will be necessary once the IBM Open Platform install is

completed.

1. SSH into Isilon as root and configure the Ambari Agent.

isiloncluster1-1# isi zone zones modify zone1 --hdfs-ambari-namenode

mycluster1-hdfs.example.com

isiloncluster1-1# isi zone zones modify zone1 --hdfs-ambari-server manager-

svr-1.example.com


__________________________________________________________________


15. On the Select Stack page, click the Stack version you want to install (BigInsights™ 4.0).

Click Next.

16. On the Install Options page, in Target Hosts, add the list of Linux hosts that the

Ambari server will manage and the IBM Open Platform with Apache Hadoop software will

deploy one node per line. For example, enter

host1.example.com

host2.example.com

host3.example.com

host4.example.com

In Host Registration Information, select one of the two options:

Provide the SSH Private Key to automatically register hosts


__________________________________________________________________


Click SSH Private Key. The private key file is /root/.ssh/id_rsa, where the root user

installed the Ambari server. Click Choose File to find the private key file you installed

previously. You should have retained a copy of the SSH private key (.ssh/id_rsa) in your

local directory when you set up password-less SSH. Copy and paste the key into the text

box manually. Click the Register and Confirm button.

____________________________________________________________________

Note: After the Linux hosts register, click the back button and Perform manual

registration for Isilon and do not use SSH.

____________________________________________________________________

Isilon has an ambari-agent within OneFS and needs to be manually registered in Ambari.

After registering Isilon manually, click the Next button. You should see the Ambari

agents on both your Linux hosts and Isilon become registered.

17. On the Confirm Hosts page, you check that the correct hosts for your cluster have been

located and that those hosts have the correct directories, packages, and processes to

continue the installation.

If hosts were selected in error, click the check boxes next to the hosts you want to

remove. Click Remove Selected. To remove a single host, click Remove in

the Action column.

If warnings are found during the check process, you can click Click here to see the

warnings to see what caused the warnings. The Host Checks page identifies any issues

with the hosts. For example, a host may have Transparent Huge Pages or Firewall issues.

You can ignore errors related to user names and groups as we pre-created the

users in the pre-installation steps of this document.

After you resolve the issues, click Rerun Checks on the Host Checks page. When you

have confirmed the hosts, click Next.

18. On the Choose Services page, select the services you want to install.


__________________________________________________________________


Ambari shows a confirmation message to install the required service dependencies. For

example, when selecting Oozie only, the Ambari web interface shows messages for

accepting YARN/MR2, HDFS and Zookeeper installations. It also shows Nagios and

Ganglia for monitoring and alerting, but they are not required services.

19. On the Assign Masters page, assign NameNode and SNameNode components to the

Isilon SmartConnect address e.g. mycluster1-hdfs.example.com. The rest of the services

can be deployed per the recommended services layout - refer back to Table 1. Make

sure you assign Namenode and SNameNode only to the Isilon SmartConnect

address and none of the Linux nodes, e.g. only mycluster1-hdfs.example.com. Click

Next.

On the Assign Slaves and Clients page, assign the components to Linux hosts in your

cluster and make sure datanode is only assigned to Isilon.

Assign Client to the client nodes. Click Next.

Tip: If you anticipate adding the Big SQL service at some later time, you must include all

clients on all the anticipated Big SQL worker nodes. Big SQL specifically needs the HDFS,

Hive, HBase, Sqoop, HCat, and Oozie clients.

20. On the Customize Services page, select configuration settings for the services selected.

Default values are filled in automatically when available and they are the recommended

values. The installation wizard prompts you for required fields (such as password entries)

by displaying a number in a circle next to an installed service.

Assign passwords to Hive, Oozie, and any other selected services that require them.

The following settings should be checked:

• YARN Node Manager log-dirs

• YARN Node Manager local-dirs

• HBase local directory

• ZooKeeper directory


__________________________________________________________________


• Oozie Data Dir

• Storm storm.local.dir

Click the number and enter the requested information in the field outlined in red. Make

sure that the service port that is set is not already used by another component. For

example, the Knox gateway port is, by default, set as 8443. But, when the Ambari server

is set up with HTTPs, and the SSL port is set up using 8443, then you must change the

Knox gateway port to some other value.

____________________________________________________________________

Note: If you are working in an LDAP environment where users are set up centrally by the

LDAP administrator and therefore, already exist, selecting the defaults can cause the

installation to fail. Open the Misc tab, and check the box to ignore user modification

errors.

21. When you have completed the configuration of the services, click Next.

22. On the Review page, verify that your settings are correct. Click Deploy.

23. The Install, Start, and Test page shows the progress of the installation. The progress

bar at the top of the page gives the overall status while the main section of the page

gives the status for each host. Logs for a specific task can be displayed by clicking on the

task. Click the link in the Message column to find out what tasks have been completed for

a specific host or to see the warnings that have been encountered. When the message

"Successfully installed and started the services" appears, click Next.

24. On the Summary page, review the accomplished tasks. Click Complete to go to the IBM

Open Platform with Apache Hadoop dashboard.

Validating IBM Open Platform Install

Ambari provides service checks for all the supported services. These checks run

automatically after each service installation, or they can be run manually at any time. You


__________________________________________________________________


can access the Ambari web interface and use the Services View to make sure all the

components pass their checks successfully.

The following steps provide another way to validate your installation.

1. As the root user on a node on which Apache Hadoop is installed, enter the following

command to become the ambari-qa user:

su - ambari-qa

2. As the ambari-qa user, run the following command:

export HADOOP_MR_DIR=/usr/iop/current/hadoop-mapreduce-client

# Generate data with 1000 rows. Each row is about 100 bytes. yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teragen 1000 /tmp/tgout

# Sort data yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar terasort /tmp/tgout

/tmp/tsout # Validate data

yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teravalidate /tmp/tsout /tmp/tvout

If the job is successful, you will see a log record similar to the following: INFO mapreduce.Job: Job job_id completed successfully

Browse to your cluster on port 8088 to see the results of your validation tests, e.g.

http://x.x.x.x:8088/cluster, example YARN test results shown below.

http://x.x.x.x:8088/cluster


__________________________________________________________________


Adding a Hadoop User

You must add a user account for each Linux user that will submit MapReduce jobs. The

procedure below can be used to add a user named hduser1 as an example.

1. Add user to Isilon.

isiloncluster1-1# isi auth groups create hduser1 --zone zone1 --provider local

isiloncluster1-1# isi auth users create hduser1 --primary-group hduser1 --zone zone1 --

provider local --home-directory /ifs/isiloncluster1/zone1/hadoop/user/hduser1

2. Add user to Hadoop nodes.

[root@mycluster1-master-0 ~]# adduser hduser1

3. Create the user’s home directory on HDFS.

[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -mkdir -p /user/hduser1

[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -chown hduser1:hduser1 \

/user/hduser1

[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -chmod 755 /user/hduser1

Additional Service Tests

The tests below should be performed to ensure a proper installation. Perform the tests in the

order shown. You must create the Hadoop user hduser1 before proceeding.

HDFS

[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -ls /

Found 5 items

-rw-r--r-- 1 root hadoop 0 2014-08-05 05:59 /THIS_IS_ISILON

drwxr-xr-x - hbase hbase 148 2014-08-05 06:06 /hbase

drwxrwxr-x - solr solr 0 2014-08-05 06:07 /solr

drwxrwxrwt - hdfs supergroup 107 2014-08-05 06:07 /tmp

drwxr-xr-x - hdfs supergroup 184 2014-08-05 06:07 /user

[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -put -f /etc/hosts /tmp

[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -cat /tmp/hosts

127.0.0.1 localhost

[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -rm -skipTrash /tmp/hosts


__________________________________________________________________


[root@mycluster1-master-0 ~]# su - hduser1

[hduser1@mycluster1-master-0 ~]$ hdfs dfs -ls /

Found 5 items

-rw-r--r-- 1 root hadoop 0 2014-08-05 05:59 /THIS_IS_ISILON

drwxr-xr-x - hbase hbase 148 2014-08-05 06:28 /hbase

drwxrwxr-x - solr solr 0 2014-08-05 06:07 /solr

drwxrwxrwt - hdfs supergroup 107 2014-08-05 06:07 /tmp

drwxr-xr-x - hdfs supergroup 209 2014-08-05 06:39 /user

[hduser1@mycluster1-master-0 ~]$ hdfs dfs -ls

...

YARN/MAPREDUCE

[hduser1@mycluster1-master-0 ~]$ hadoop jar \

/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \

pi 10 1000

...

Estimated value of Pi is 3.14000000000000000000

[hduser1@mycluster1-master-0 ~]$ hadoop fs -mkdir in

You can put any file into the in directory. It will be used the datasource for subsequent tests.

[hduser1@mycluster1-master-0 ~]$ hadoop fs -put -f /etc/hosts in

[hduser1@mycluster1-master-0 ~]$ hadoop fs -ls in

...

[hduser1@mycluster1-master-0 ~]$ hadoop fs -rm -r out

[hduser1@mycluster1-master-0 ~]$ hadoop jar \

/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \

wordcount in out

...

[hduser1@mycluster1-master-0 ~]$ hadoop fs -ls out

Found 4 items

-rw-r--r-- 1 hduser1 hduser1 0 2014-08-05 06:44 out/_SUCCESS

-rw-r--r-- 1 hduser1 hduser1 24 2014-08-05 06:44 out/part-r-00000



[hduser1@mycluster1-master-0 ~]$ hadoop fs -cat out/part*

localhost 1

127.0.0.1 1

Browse to the YARN Resource Manager GUI http://mycluster1-master-0.example.com:8088/

Browse to the MapReduce History Server GUI http://mycluster1-master-0.lab.example.com:19888/.

In particular, confirm that you can view the complete logs for task attempts.

http://mycluster1-master-0.example.com:8088/


__________________________________________________________________


HIVE

[hduser1@mycluster1-master-0 ~]$ hadoop fs -mkdir -p sample_data/tab1

[hduser1@mycluster1-master-0 ~]$ cat - > tab1.csv

1,true,123.123,2012-10-24 08:55:00

2,false,1243.5,2012-10-25 13:40:00

3,false,24453.325,2008-08-22 09:33:21.123

4,false,243423.325,2007-05-12 22:32:21.33454

5,true,243.325,1953-04-22 09:11:33

Type <Control+D>.

[hduser1@mycluster1-master-0 ~]$ hadoop fs -put -f tab1.csv sample_data/tab1

[hduser1@mycluster1-master-0 ~]$ hive

hive>

DROP TABLE IF EXISTS tab1;

CREATE EXTERNAL TABLE tab1

(

id INT,

col_1 BOOLEAN,

col_2 DOUBLE,

col_3 TIMESTAMP

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’

LOCATION ‘/user/hduser1/sample_data/tab1’;

DROP TABLE IF EXISTS tab2;

CREATE TABLE tab2

(

id INT,

col_1 BOOLEAN,

col_2 DOUBLE,

month INT,

day INT

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;

INSERT OVERWRITE TABLE tab2

SELECT id, col_1, col_2, MONTH(col_3), DAYOFMONTH(col_3)

FROM tab1 WHERE YEAR(col_3) = 2012;

...

OK

Time taken: 28.256 seconds

hive> show tables;

OK


__________________________________________________________________


tab1

tab2

Time taken: 0.889 seconds, Fetched: 2 row(s)

hive> select * from tab1;

OK

1 true 123.123 2012-10-24 08:55:00

2 false 1243.5 2012-10-25 13:40:00

3 false 24453.325 2008-08-22 09:33:21.123

4 false 243423.325 2007-05-12 22:32:21.33454

5 true 243.325 1953-04-22 09:11:33


hive> select * from tab2;

OK

1 true 123.123 10 24

2 false 1243.5 10 25


hive> select * from tab1 where id=1;

OK

1 true 123.123 2012-10-24 08:55:00


hive> select * from tab2 where id=1;

OK

1 true 123.123 10 24


hive> exit;

HBASE

[hduser1@mycluster1-master-0 ~]$ hbase shell

hbase(main):001:0> create ‘test’, ‘cf’

0 row(s) in 3.3680 seconds

=> Hbase::Table - test

hbase(main):002:0> list ‘test’

TABLE

test


=> [’’test’’]

hbase(main):003:0> put ‘test’, ‘row1’, ‘cf:a’, ‘value1’


hbase(main):004:0> put ‘test’, ‘row2’, ‘cf:b’, ‘value2’


__________________________________________________________________



hbase(main):005:0> scan ‘test’

ROW COLUMN+CELL

row1 column=cf:a,timestamp=1407542488028,value=value1

row2 column=cf:b,timestamp=1407542499562,value=value2


hbase(main):006:0> get ‘test’, ‘row1’

COLUMN CELL

cf:a timestamp=1407542488028,value=value1


hbase(main):007:0> quit

Ambari Service Check

Ambari has built-in functional tests for each component. These are executed automatically

when you install your cluster with Ambari. To execute them after installation, select the service

in Ambari, click the Service Actions button, and select Run Service Check.


__________________________________________________________________


Installing IBM Value Packages

Before You Begin

Please note that “BigInsights Analyst” and “BigInsights Data Scientist” value package have been

sanity tested on EMC Isilon, but have not been performance profiled and tested under load with

Isilon 7.2.0.3 version. EMC and IBM BigInsights plan to validate these components under load

as part of future integration efforts. Please refer to EMC – IBM BigInsights Joint Support

Statement for further details.

You must acquire the software from Passport Advantage. The acquired software has a *.bin

extension. The name of the *.bin file depends on whether the BigInsights Analyst or the

BigInsights Data Scientist module was downloaded.

When you run the *.bin file, configuration files are copied to appropriate locations to

enable Ambari to see that value-add services as available. When adding the value-add

services through Ambari, additional software packages can be downloaded. If the

Hadoop cluster cannot directly access the internet, a local mirror repository can be

created.

Where you perform the following steps depends on whether the Hadoop cluster has

direct internet access.

If the Hadoop cluster has direct access to the internet, perform the steps from the

Ambari server of the Hadoop cluster.

If the Hadoop cluster does not have direct internet access, perform the steps from

a Linux host with direct internet access. Then, transfer the files, as required, to a

local repository mirror.


__________________________________________________________________


Installation Procedure

1. Update the permissions on the downloaded *.bin file to enable execute.

chmod +x <package_name>.bin

2. Run the *.bin file to extract and install the services in the module.

./<package_name>.bin

where <package_name> is BI-Analyst-xxxxx.bin for the Analyst module or BI-DS-

xxxxx.bin for the Data Scientist module.

3. After the prompt, agree to the license terms. Reply yes | y to continue install.

4. After the prompt, choose if you want to do an online (option 1) or offline

(option 2) install.

a. Online install will lay out the Ambari service configuration files and

update the repository locations in the Ambari server file. Skip to step 6.

b. Offline install initiates a download of files to set up a local repository

mirror. A subdirectory called BigInsights will be created with RPMs and

associated files will be located in directory BigInsights/packages

5. Setup a local repository.

A local repository is required if the Hadoop cluster cannot connect directly to the internet,

or if you wish to avoid multiple downloads of the same software when installing services

across multiple nodes. In the following steps, the host that performs the repository mirror

function is called the repository server. If you do not have an additional Linux host, you

can use one of the Hadoop management nodes. The repository server must be accessible

over the network by the Hadoop cluster. The repository server requires an HTTP web

server. The following instructions describe how to set up a repository server by using a

Linux host with an Apache HTTP server.

a. On the repository server, if the Apache HTTP server is not installed,

install it:


__________________________________________________________________


yum install httpd

b. On the repository server, ensure that the createrepo package is

installed.

c. On the repository server, create a directory for your value-add

repository, such as <mirror web server document

root>/repos/valueadds. For example, for Apache httpd, the default is

/var/www/html/repos.

mkdir /var/www/html/repos/valueadds

d. By selecting Option 2 in step 4, RPMs were downloaded to a

subdirectory called BigInsights/packages. Copy all of the RPMs to the

mirror web server location, <your.mirror.web.server.document

root>/repos/valueadds directory.

cp BigInsights/packages/* /var/www/html/repos/valueadds/

e. Start this web server. If you use Apache httpd, start it by using either of

the following commands:

apachect start or service httpd start

f. Test your local repository by browsing to the web directory:

http://<your.mirror.web.server>/repos/valueadds

You should see all of the files that you copied to the repository server.

g. On the repository server, run the createrepo command to initialize the

repository:

createrepo /var/www/html/repos/valueadds

h. In the BigInsights/packages directory, find the RPM to install on the

Ambari Server host of the Hadoop cluster:

BigInsights Analyst

BI-Analyst-X.X.X.X-IOP-X.X.x86_64.rpm


__________________________________________________________________


BigInsights Data Scientist

BI-DS-X.X.X.X-IOP-X.X.x86_64.rpm

Tip: The BigInsights Data Scientist module also entitles you to the features of the

BigInsights Analyst module. Therefore, consider doing the yum install for both of the RPM

packages.

Then, copy the file to the Ambari Server host and install the RPMs by using the following

commands:

sudo yum install <BI-xxx-1.0.0.1-IOP...>.rpm

i. On the Ambari Server node, navigate to the /var/lib/ambari-

server/resources/stacks/BigInsights/<version_number>/repos/repoinfo.

xml file. If the file does not exist, create it. Ensure the <baseurl>

element for the BIGINSIGHTS-VALUEPACK <repo> entry points to your

repository server. Remember, there might be multiple <repo> sections.

Make sure that the URL you tested in step 5.f matches exactly the value

indicated in the <baseurl> element. For example, the repoinfo.xml

might look like the following content after you change http://ibm-open-

platform.ibm.com/repos/BigInsights-Valuepacks/to become

http://your.mirror.web.server/repos/valueadds:

<repo>

<baseurl> http://<your.mirror.web.server>/repos/valueadds

</baseurl>

<repoid>BIGINSIGHTS-VALUEPACK</repoid>

<reponame>BIGINSIGHTS-VALUEPACK</reponame>

</repo>

Note: The new <repo> section might appear as a single line.

Tip: If you later find an error in this configuration file, make corrections and run the

following command:

http://your.mirror.web.server/repos/valueadds


__________________________________________________________________


yum clean all

Then, restart the ambari server.

j. When the module is installed, restart the Ambari server.


k. Open the Ambari web interface and log in. The default address is the

following URL:

http://<server-name>:8080

The default login name is admin and the default password is admin.

l. Click Actions > Add service. In the list of services you will see the services that you previously added as well as the BigInsights services

you can now add.


__________________________________________________________________


Select IBM BigInsights Service to Install

Select the service that you want to install and deploy. Even though your module might

contain multiple services, install the specific service that you want and the BigInsights™

Home service. Installing one value-add service at a time is recommended. Follow the

service specific installation instructions for more information.

At the conclusion of installing all the IBM BigInsights Services, the Ambari GUI Software

List should have green check marks next to each service as shown below:


__________________________________________________________________


Installing BigInsights Home

The BigInsights Home service is the main interface to launch BigInsights - BigSheets,

BigInsights - Text Analytics, and BigInsights - Big SQL.

The BigInsights Home service requires Knox to be installed, configured and started.

Open a browser and access the Ambari server dashboard. The following is the default URL:


The default user name is admin, and the default password is admin.

In the Ambari dashboard, click Actions > Add Service.

In the Add Service Wizard > Choose Services, select the BigInsights – BigInsights Home

service. Click Next. If you do not see the option for BigInsights – BigInsights Home, follow the

instructions described in Installing the BigInsights value-add packages.

In the Assign Masters page, select a Management node (edge node) that your users can

communicate with. BigInsights Home is a web application that your users must be able to open

with a web browser.

In the Assign Slaves and Clients page, make selections to assign slaves and clients.

The nodes that you select will have JSQSH (an open source, command line interface to SQL for

Big SQL and other database engines) and SFTP client. Select nodes that might be used to ingest

data as an SFTP client, where you might want to work with Big SQL scripts, or other databases

interactively.

Click Next to review any options that you might want to customize.

Click Deploy.

If the BigInsights – BigInsights Home service fails to install, run the

remove_value_add_services.sh cleanup script. The following code is an example command:

cd /usr/ibmpacks/bin/<version>


__________________________________________________________________


remove_value_add_services.sh

-u admin -p admin

-x 8080 -s WEBUIFRAMEWORK -r

For more information about cleaning the value-add service environment, see Removing

BigInsights value-add services.

After installation is complete, click Next > Complete.

Configure Knox

The Apache Knox gateway is a system that provides a single point of authentication and access

for Apache Hadoop services on the compute nodes in a cluster; however authentication to HDFS

services is completely controlled by Isilon OneFS only.

The Knox gateway simplifies Hadoop security for users that access the cluster and execute jobs

and operators that control access and manage the cluster. The gateway runs as a server, or a

cluster of servers, providing centralized access to one or more Hadoop clusters.

In IBM® Open Platform with Apache Hadoop, Knox is a service that you start, stop, and

configure in the Ambari web interface.

Users access the following BigInsights™ value added components through Knox by going to the

IBM BigInsights home service.

https://<knox_host>:<knox_port>/<knox_gateway_path>/default/BigInsightsWeb/index.html

BigSheets

Text Analytics

Big SQL

Knox supports only REST API calls for the following Hadoop services:

WebHCat


__________________________________________________________________


Oozie

HBase

Hive

Yarn

Click the Knox service from the Ambari web interface to see the summary page.

Select Service Actions > Restart All to restart it and all of its components.

If you are using LDAP, you must also start LDAP if it is not already started.

Click the BigInsights Home service in the Ambari User Interface.

Select Service Actions > Restart All to restart it and all of its components.

Open the BigInsights Home page from a web.

The URL for BigInsights Home is:

https://<knox_host>:<knox_port>/<knox_gateway_path>/default/BigInsightsWeb/index.html

where:

knox_host

The host where Knox is installed and running

knox_port

The port where Knox is listening (by default this is 8443)

knox_gateway_path

The value entered in the gateway.path field in the Knox configuration (by default this is

'gateway')


__________________________________________________________________


For example, the URL might look like the following address:

https://myhost.company.com:8443/gateway/default/BigInsightsWeb/index.html

If you are using the Knox Demo LDAP, a default user ID and password is created for you. When

you access the web page, use the following preset credentials:

User Name = guest Password = guest-password

Installing BigSheets

To extend the power of the Open Platform for Apache Hadoop, install and deploy the BigInsights

BigSheets service, which is the IBM spreadsheet interface for big data.

1. Open a browser and access the Ambari server dashboard. The following is the default

URL.



2. In the Ambari Dashboard, click Actions > Add Service.

3. In the Add Service Wizard, Choose Services, select the BigInsights -

BigSheets service, and if you have not already installed the BigInsights Home service,

select that as well. Click Next.

If you do not see BigInsights – BigSheets service, you need to install the appropriate

module and restart Ambari as described in Installing the BigInsights value-add packages.

4. In the Assign Masters page, decide on which node of your cluster you want to run the

specified BigSheets master.

5. In the Assign Slaves and Clients page all the defaults are automatically accepted and

the next page automatically appears. BigSheets service does not have any slaves and

https://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.install.doc/doc/inst_valaddinstall.html?lang=en-us#task_kys_prw_ds


__________________________________________________________________


clients. The Assign Slaves and Clients page will show and be skipped immediately

during install. This is the expected behavior.

6. In the Customize Services page, accept the recommended configurations for the

BigSheets service, or customize the configuration by expanding the configuration files

and modifying the values. In theAdvanced bigsheets-user-config section, make sure

that you enter the following information:

a. In the bigsheets.user field, leave the default user name, which is bigsheets.

b. In the bigsheets.password field, type a valid password.

c. In the bigsheets.userid, type a valid user ID to use for the bigsheets service

user. This user ID is created across all of the nodes of the cluster, and must be

unique across all nodes of the cluster.

d. Click Next..

7. In the Advanced bigsheets-ambari-config section, in the ambari.password field,

type the correct Ambari administration password.

8. You can review your selections in the Review page before accepting them. If you want

to modify any values, click the Back button. If you are satisfied with your setup,

click Deploy.

9. In the Install, Start and Test page, the BigSheets service is installed and verified. If

you have multiple nodes, you can see the progress on each node. When the installation is

complete, either view the errors or warnings by clicking the link, or click Next to see a

summary and then the new service added to the list of services.

10.Click Complete.

If the BigInsights – BigSheets service fails to install, run

the remove_value_add_services.shcleanup script. The following code is an example of

the command:


./remove_value_add_services.sh -u admin -p admin -x 8080 -s BIGSHEETS -r



https://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.install.doc/doc/clean_valadd.html?lang=en-us#task_r5h_kdq_5r



__________________________________________________________________


11.After you install BigInsights - BigSheets, you must restart the HDFS, MapReduce2, YARN,

Knox, Nagios and Ganglia client services.

a. For each service that requires restart, select the service.

b. Click Service Actions.

c. Click Restart All.

12.Access the BigInsights - BigSheets service from the BigInsights Home service.

o If the BigInsights Home service has not yet been added, see Installing

BigInsights Home.

o If the BigInsights Home service has been installed, it must be restarted so

the BigInsights - BigSheets icon will display.

13.Launch the BigInsights Home service by typing the following address in your browser:

https://<knox_host>:<knox_port>/<knox_gateway_path>/default/BigInsightsWeb/inde

x.html

Where:

knox_host


knox_port


knox_gateway_path


'gateway')



https://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.install.doc/doc/inst_biHome.html?lang=en-us#task_tbn_zvw_fs

https://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.install.doc/doc/inst_biHome.html?lang=en-us#task_tbn_zvw_fs


__________________________________________________________________


Installing Big SQL

To extend the power of the Open Platform for Apache Hadoop, install and deploy the BigInsights

- Big SQL service, which is the IBM SQL interface to the Hadoop-based platform, IBM Open

Platform with Apache Hadoop.


URL.


The default user name is admin, and the default password is admin .

2. In the Ambari web interface, click Actions > Add Service.

3. In the Add Service Wizard, Choose Services, select the BigInsights - Big

SQL service, and theBigInsights Home service. Click Next.

If you do not see the option to select the BigInsights - Big SQL service, complete the

steps.

4. In the Assign Masters page, decide which nodes of your cluster you want to run the

specified components, or accept the default nodes. Follow these guidelines:

o For the Big SQL monitoring and editing tool, make sure that the Data Server

Manager (DSM) is assigned to the same node that is assigned to the Big SQL Head

node.

5. Click Next.

6. In the Assign Slaves and Clients page, accept the defaults, or make specific

assignments for your nodes. Follow these guidelines:

o Select the non-head nodes for the Big SQL Worker components. You must select at

least one node as the worker node.

o Select all nodes for the CLIENT. This puts JSqsh and SFTP clients on the nodes.


__________________________________________________________________


7. In the Customize Services page, accept the recommended configurations for the Big

SQL service, or customize the configuration by expanding the configuration files and

modifying the values. Make sure that you have a

valid bigsql_user and bigsql_user_password (see reference screen below) and

user_id (created by the bi_create_users.sh script) in the appropriate fields in

theAdvanced bigsql-users-env section.

8.


__________________________________________________________________




click Deploy.

10.In the Install, Start and Test page, the Big SQL service is installed and verified. If you

have multiple nodes, you can see the progress on each node. When the installation is



If the BigInsights – Big SQL service fails to install, run

the remove_value_add_services.shcleanup script. The following code is an example of

the command:


__________________________________________________________________



./remove_value_add_services.sh -u admin -p admin -x 8080 -s BIGSQL -r



11. A web application interface for Big SQL monitoring and editing is available to your end-

users to work with Big SQL. You access this monitoring utility from the IBM BigInsights

Home service. If you have not added the BigInsights Home service yet, do that now.

12. Restart the Knox Service. Also start the Knox Demo LDAP service if you have not

configured your own LDAP.

13. Restart the BigInsights Home services.

14. To run SQL statements from the Big SQL monitoring and editing tool, type the following

address in your browser to open the BigInsights Home service:


x.html

Where:

knox_host


knox_port


knox_gateway_path


'gateway')



If you use the Knox Demo LDAP service, the default credential is:

userid = guest

password = guest-password




__________________________________________________________________


Your end users can also use the JSqsh client, which is a component of

the BigInsights - Big SQL service.

15. If the BigInsights - Big SQL service shows as unavailable, there might have been a

problem with post-installation configuration. Run the following commands

as root (or sudo) where the Big SQL monitoring utility (DSM) server is installed:

a. Run the dsmKnoxSetup script:

b. cd /usr/ibmpacks/bigsql/<version-number>/dsm/1.1/ibm-datasrvrmgr/bin/

./dsmKnoxSetup.sh -knoxHost <knox-host>

where <knox-host> is the node where the Knox gateway service is running.

c. Make sure that you do not stop and restart the Knox gateway service within

Ambari. If you do, then run the dsmKnoxSetup script again.

d. Restart the BigInsights Home service so that the Big SQL monitoring utility

(DSM) can be accessed from the BigInsights Home interface.

16. For HBase, do the following post-installation steps:

. For all nodes where HBase is installed, check that the symlinks to hive-serde.jar

and hive-common.jar in the hbase/lib directory are valid.

To verify the symlinks are created and valid:

namei /usr/iop/<version-number>/hbase/lib/hive-serde.jar

namei /usr/iop/<version-number>/hbase/lib/hive-common.jar

If they are not valid, do the following steps:

cd /usr/iop/<version-number>/hbase/lib

rm -rf hive-serde.jar

rm -rf hive-common.jar

ln -s /usr/iop/<version-number>/hive/lib/hive-serde.jar hive-serde.jar

ln -s /usr/iop/<version-number>/hive/lib/hive-common.jar hive-common.jar

a. After installing the Big SQL service, and fixing the symlinks, restart the HBase

service from the Ambari web interface.

https://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/bsql_jsqsh.html?lang=en-us#bsql_jsqsh


__________________________________________________________________


After you add Big SQL worker nodes, make sure that you stop and then restart the Hive service.

Connecting to Big SQL

You can run Big SQL queries from Java SQL Shell (JSqsh), or from the IBM Data Server

Manager. You can also run queries from a client application, such as IBM Data Studio,

that uses JDBC or ODBC drivers. You must identify a running Big SQL server and

configure either a JDBC or ODBC driver.

For more information about JSqsh, or IBM Data Studio, see the related topics in the

IBM® BigInsights™ Knowledge Center.

Running JSqsh

JSqsh is installed in /usr/ibmpacks/common-utils/current/jsqsh/bin. Change to that directory

and type./jsqsh to open the JSqsh shell:

cd /usr/ibmpacks/common-utils/current/jsqsh/bin

./jsqsh

You can then run any JSqsh commands from the prompt.

Connection setup

To use the JSqsh command shell, you can use the default connections or define and test a

connection to the Big SQL server.

1. The first time that you open the JSqsh command shell, a configuration wizard is started.

When you are at the Jsqsh command prompt, type \drivers to determine the available

drivers.

a. On the driver selection screen, select the Big SQL instance that you want to run

Note: Big SQL is designated as DB2 in this example:

Name Target Class


__________________________________________________________________


- ------- ------------------- --------------------------------------------

...

2 *db2 IBM Data Server(DB2 com.ibm.db2.jcc.DB2Driver

b. Verify the port, server, and user name. Run \setup and click C to define a

password for the connection. The username must have database administration

privileges, or must be granted those privileges by the Big SQL administrator.

c. Test the connection to the Big SQL server.

d. Save and name this connection.

2. Generally, you can access JSqsh from /usr/ibmpacks/common-

utils/current/jsqsh/bin with the following command:

3. ./jsqsh --driver=db2 --user=<username>

--password=<user_password>

4. Open the saved configuration wizard any time by typing \setup while in the command

interface, or./jsqsh --setup when you open the command interface.

5. Specify the following connection name in the JSqsh command shell to establish a

connection:

./jsqsh name

6. Use the \connect command when you are already inside the JSQSH shell to establish a

connection at the JSqsh prompt:

\connect name

Commands and queries

At the JSqsh command prompt, you can run JSqsh commands or database server commands.

JSqsh commands usually begin with a backslash (\) character.

JSqsh commands accept command-line arguments and allow for common shell activities, such

as I/O redirection and pipes.

For example, consider this set of commands:

1> select * from t1

2> where c1 > 10

3> \go --style csv > /tmp/t1.csv


__________________________________________________________________


Because the commands do not begin with a backslash character, the first two commands are

assumed to be SQL statements, and are sent to the Big SQL server.

The \go command sends the statements to run on the server. The \go command has a built-in

alias so that you can omit the backslash. Additionally, you can specify a trailing semicolon to

indicate that you want to run a statement, for example:

1> select * from t1

2> where c1 > 10;

The --style option in the \go command indicates that the display shows comma-separated

values (CSV). The \go form is most useful if you provide additional arguments to affect how

the query is run. Changing the display style is an example of this feature.

The redirection operator (>) specifies that the results of the command are sent to a file

called /tmp/t1.csv.

A set of frequently run commands does not require the leading backslash. Any JSqsh command

can bealiased to another name (without a leading backslash, if you choose), by using

the \alias command. For example, if you want to be able to type bye to leave the JSqsh shell,

you establish that word as the alias for the \quit command:

\alias bye='\quit'

You can run a script that contains one or more SQL statements. For example, assume that you

have a file called mySQL.sql. That file contains these statements:

select tabschema, tabname from syscat.tables fetch first 5 rows only;

select tabschema, colname, colno, typename, length from syscat.columns fetch first 10 rows

only;

You can start JSqsh and run the script at the same time with this command:

/usr/ibmpacks/common-utils/current/jsqsh/bin/jsqsh bigsql < /home/bigsql/mySQL.sql

The redirection operator specifies to JSqsh to get the commands from the file located in

the /home/bigsqldirectory, and then run the statements within the file.


__________________________________________________________________


Command and query edit

The JSqsh command shell uses the JLine2 library, which allows you to edit previously entered

commands and queries. You use the command-line edit features to move the arrow keys and to

edit the command or query on the current line.

The JLine2 library provides the same key bindings (vi and emacs) as the GNU Readline library.

In addition, it attempts to apply any custom key maps that you created in a

GNU Readline configuration file, (.inputrc) in the local file system $HOME/ directory.

In addition to individual line editing, the JSqsh command shell remembers the 50 most recently

run statements, which you can view by using the \history command:

1> \history

(1) use tpch;

(2) select count(*) from lineitem

Previously run statements are prefixed with a number in parentheses. You use this number to

recall that query by using the JSqsh recall operator (!), for example:

1> !2

1> select count(*) from lineitem

2>

The ! recall operator has the following behavior:

!! Recalls the previously run statement.

!5 Recalls the fifth query from history.

!-2 Recalls the query from two prior runs.

You can also edit queries that span multiple lines by using the \buf-edit command,

which pulls the current query into an external editor, for example:

1> select id, count(*)

2> from t1, t2

3> where t1.c1 = t2.c2

4> \buf-edit

The query is opened in an external editor (/usr/bin/vi by default. However, you can

specify a different editor on the environment variable $EDITOR). When you close the

editor, the edited query is entered at the JSqsh command shell prompt.


__________________________________________________________________


The JSqsh command shell provides built-in aliases, vi and emacs, for the \buf-

edit command. The following commands, for example, open the query in the vi editor:

1> select id, count(*)

2> from t1, t2

3> where t1.c1 = t2.c2

4> vi

Configuration variables

You can use the \set command to list or define values for a number of configuration

variables, for example:

1> \set

If you want to redefine the prompt in the command shell, you run the following command

with the prompt option:

1> \set prompt='foo $lineno> '

foo 1>

Every JSqsh configuration variable has built-in help available:

1> \help prompt

If you want to permanently set a specific variable, you can do so by editing

your $HOME/.jsqsh/sqshrc file and including the appropriate \set command in it.


__________________________________________________________________


Installing Text Analytics

The Text Analytics service provides powerful text extraction capabilities. You can extract

structured information from unstructured and semi-structured text.

It is recommended that you make sure that the python-paramiko package is installed prior to

installing the Text Analytics service.

yum install python-paramiko

You will be selecting a Master node for Text Analytics, and this node should contain

the python-paramikopackage. The master node is the node where Text Analytics Web Tooling

and Text Analytics Runtime are both installed.


URL.



2. In the Ambari dashboard, click Actions > Add Service.

3. In the Add Service Wizard, Choose Services, select the BigInsights - Text

Analytics service.

If you do not see the option to select the BigInsights - Text Analytics service,

complete the steps inInstalling the BigInsights value-add packages.

4. To assign master nodes, select the Text Analytics Master server Node.

5. Click Next. The Assign Slaves and Clients page displays.

6. Assign slave and client components to the hosts on which you want them to run. An

asterisk (*) after a host name indicates the host is assigned a master component.

https://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.install.doc/doc/inst_valaddinstall.html?lang=en-us#task_kys_prw_ds


__________________________________________________________________


a. To assign slaves nodes and clients, click All on the Clients column.

The client package that is installed contains runtime binaries that are needed to

run Text Analytics. This client needs to be installed on all datanodes that belong to

your cluster.

Client nodes will install only the Text Analytics Runtime artifacts.

(/usr/ibmpacks/current/text-analytics-runtime). Choose one or more clients. You

do not have to choose the Master node as a client since it already installs Text

Analytics Runtime.

7. Click Next and select BigInsights - Text Analytics.

8. Expand Advanced ta-database-config and enter the password in the

database.password field.Recommended configurations for the service are completed

automatically but you can edit these default settings as desired.

By default, the database server is MySQL. There are two options:

o database.create.new = Yes (default)

a. You must enter the password for the database.

b. You must ensure that the default port, 32050 is free. You can change the

port to any free port.

c. You can change the database.username, but any changes to

the database.hostnameare ignored.

o database.create.new = N

a. You must enter the database.hostname, database.port (where the

existing database server instance is

running), database.user and database.password. Ensure that the user

and password have full access to create a database in the existing database

server instance you specify. Especially if it is a remote MySQL server

instance, ensure that all permissions are given to the user and password to

access this remote instance. Ensure that the server instance is up and

running so that the Text Analytics service can be started successfully.

9. Click Next and in the Review screen that opens, click Deploy.


__________________________________________________________________


10.After installation is complete, click Next > Complete.

11.After the installation is successful, click Next and Complete.

If the BigInsights - Text Analytics service fails to install, run

the remove_value_add_services.shcleanup script. The following code is an example

command:


remove_value_add_services.sh

-u admin -p admin

-x 8080 -s TEXTANALYTICS -r



12. The Text Analytics directory on all nodes where Text Analytics components are installed

is created with world-writable permissions, which are not required. Change the

permissions to rwxr-x-r-x on all nodes to improve security:

chmod go-w /usr/ibmpacks/text-analytics-runtime

13. Restart the Knox service. If you have not configured LDAP service, start the Knox

Demo LDAP service.

14. Open the BigInsights Home and launch Text Analytics at the following address:




__________________________________________________________________



x.html

Where:

knox_host


knox_port


knox_gateway_path


'gateway')



If you use the Knox Demo LDAP service and have not modified the default

configuration, the default credential to log into the BigInsights - Home service is:

userid = guest

password = guest-password

Note: If you do not see the Text Analytics service from BigInsights Home, restart

the BigInsights Home service in the Ambari interface.

At this point, IBM BigHome should show all three Big Insights Services as shown

below:


__________________________________________________________________


Installing Big R

To extend the power of the Open Platform for Apache Hadoop, install and deploy the Big R

service, which is the IBM R extension, to the Hadoop-based platform, IBM Open Platform with

Apache Hadoop.

1. Open a browser and access the Ambari server dashboard. The following is the

default URL.


The default user name is admin, and the default password is admin .

2. In the Ambari web interface, click Actions > Add Service.


__________________________________________________________________


3. Optional: If you do not already have the R Service installed, you can add it now. Big R

service depends on the R statistics environment and the following three R packages:

base64enc, rJava and data.table. If these have been installed on all nodes in the cluster,

this step can be skipped. Otherwise, you can choose to install the above dependencies

with your own approach, or, if your cluster has external network access, you can use the

following R service to install these dependencies.

a. In the Add Service Wizard, Choose Services, select the R service and

click Next.

b. In the Assign Slaves and Clients page, for client nodes, mark all of the nodes as

the R Clientnode and click Next.

c. In the Customize Services page, accept the recommended configurations for the

R service, or customize the configuration by expanding the configuration files and

modifying the values.

Make sure that you read the R license, and indicate acceptance by

typing Y in the fieldaccept.R.Licenses. The value is case sensitive, so make

sure you type an uppercaseletter. The R Licenses field contains a URL

where you can find the licensing information.

In the user.R.packages you must ensure that the following required

packages are listed:base64enc, rJava, and data.table.

In the user.R.repository field, enter the preferred repository. The default

is epel-release, which uses the EPEL repository, but you can also type a

different repository by entering a URL, such

as http://repos.domain.com/repos.

Note: When installing R from the EPEL repository, you might have the

following GPG key error: GPG key retrieval failed: [Errno 14] Could not

open/read

If you receive this error, you can import the key with the following rpm

command, then retry: rpm --import

d. Click Next and in the Review Page that opens, click Deploy.


__________________________________________________________________


e. If R deployment fails, review and correct the errors before reattempting the

installation. Remove the R service from Ambari and delete the RSERV server by

using the following command:

f. curl -u [uid]:[pwd] -H "X-Requested-By: ambari"

-X DELETE http://[hostname]:8080/api/v1/clusters/[cluster

name]/services/RSERV

where

[uid:[pwd]]

The Ambari administrator user ID and password.

[hostname]

The correct host name for your environment.

8080

The port number 8080 is the default. Modify this according to your environment.

[cluster name]

The correct name of your cluster.

The following command is an example:

curl -u admin:admin -H "X-Requested-By: ambari"

-X DELETE

http://my_host.localdomain:8080/api/v1/clusters/my_cluster/services/RS

ERV

g. In the Summary page, click Complete. When you return to the Ambari Dashboard

Services tab, you notice that the R service is now listed.

4. In the Add Service Wizard, Choose Services page, select the Big R service and

click Next.

5. In the Assign Masters page, decide which nodes of your cluster you want to run the

specified components, or accept the default nodes. You must assign the Big R Connector

to the same node that is running the MapReduce2 Client service, which is a required

service that runs MapReduce2 Hadoop jobs. Click Next.


__________________________________________________________________


6. In the Assign Slaves and Clients page, accept the defaults, or make specific

assignments for your nodes. For client nodes, mark all of the nodes as the Big R

Client node and click Next.

7. In the Customize Services page, default Big R environment variables are set in

the bigr-env template field. Review these entries for accuracy and completeness. Make

any necessary changes and click Next



click Deploy.

9. In the Install, Start and Test page, the Big R service is installed and verified. If you

have multiple nodes, you can see the progress on each node. When the installation is



If the BigInsights – Big R service fails to install, run

the remove_value_add_services.sh cleanup script. The following code is an example

of the command:


./remove_value_add_services.sh -u admin -p admin -x 8080 -s BIGR -r



10. Advise your end users that the service is deployed and ready for their use by having

them launch the Value Added packages welcome page.

11. In the Summary page, click Complete.

Running BigInsights - Big R as the YARN application master

You must update the Linux Container Executor as the default executor in the yarn-

site.xml file to change the owner to the bigr server user (the application process owner).




__________________________________________________________________


1. In the Ambari web interface, from the YARN service Configs page, scroll down to

find theAdvanced yarn-site and expand it.

2. Change the yarn.nodemanager.container-executor.class property to have the

following value:

org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor

3. In the Custom yarn-site section, click Add Property to add the following

properties:

4.

Property name Value

yarn.nodemanager.linux-container-

executor.nonsecure-mode.local-user

Yarn

yarn.nodemanager.linux-container-

executor.nonsecure-mode.limit-users

False

5. Make sure that the property yarn.nodemanager.linux-container-

executor.group has the valuehadoop.

6. Click Save in the Configs page to save your configuration changes.

7. Make sure that the directories on ALL the nodes set in the Node Manager section

for the properties yarn.nodemanager.local-dirs and yarn.nodemanager.log-

dirs have permissionsyarn:hadoop:

On ALL nodes do the following commands:

$ echo "yarn.nodemanager.linux-container-executor.group=hadoop" >>

/etc/hadoop/conf/container-executor.cfg

$ echo "banned.users=hdfs,yarn,mapred,bin" >>


$ echo "min.user.id=1000" >>


$ chown root:hadoop /etc/hadoop/conf/container-executor.cfg

$ chown root:hadoop /usr/iop/4.0.0.0/hadoop-yarn/bin/container-executor

$ chmod 6050 /usr/iop/4.0.0.0/hadoop-yarn/bin/container-executor

8. Make sure that the user ID with which the BigR connection is made (by using

bigr.connect) is present on ALL nodes, and that the user belongs to groups users,


__________________________________________________________________


hadoop. If the user does not exist, run the following command as the root user on

ALL nodes:

$ useradd -G users,hadoop someuser

9. Change the SystemML configuration file, /usr/ibmpacks/current/bigr/machine-

learning/SystemML-config.xml:

10. dml.yarn.appmaster

value: true

11. You can optionally update the MapReduce configuration to get better

performance:

a. In the Ambari web interface, from the MapReduce2 service Configs page,

scroll down to find the Advanced map-red site section and expand it.

b. Update the property mapreduce.task.io.sort.mb to 384 . This should be

approximately three times the HDFS block size.

Note: If the property is not available, add it to the Custom map-red site.

12. Click Save in the Configs page to save your configuration changes.

For information about using BigInsights - Big R, see Analyzing data with IBM BigInsights

Big R .

IBM BigInsights Online Tutorials

Learn how to use BigInsights™ by completing online tutorials, which use real data and teach you to run applications. Complete the tutorials in any order.

https://www-

01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.tut.doc/doc/tut_Introduction.html

You can find additional information, tutorials, and articles about BigInsights, Hadoop, and

related components at Hadoop Dev.

http://developer.ibm.com/hadoop/docs/tutorials/

https://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/t_analyze_data_bigr.html?lang=en-us#t_analyze_data_bigR

https://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/t_analyze_data_bigr.html?lang=en-us#t_analyze_data_bigR

https://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.tut.doc/doc/tut_Introduction.html



http://developer.ibm.com/hadoop/docs/tutorials/


__________________________________________________________________


Security Configuration and Administration

IBM® Open Platform with Apache Hadoop security includes perimeter security, authentication, and authorization. Authenticate, authorize, and protect your data by using the steps and

recommendations listed in this section. This document covers security to the Isilon HDFS storage, the resources that you use in Yarn, and the cluster infrastructure.

Setting up HTTPS for Ambari You can limit access to the Ambari Web interface to HTTPS connections.

Before you begin

The Ambari server must not be running when you are performing this task.

You must provide a certificate. You can use a self-signed certificate for initial trials, but these

certificates are not suitable for production environments.

The certificate you use must be PEM-encoded, not DER-encoded. If you attempt to use a DER-

encoded certificate, the following error appears.

unable to load certificate

140109766494024:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c

:698:Expecting: TRUSTED CERTIFICATE

You can use the following command to convert a DER-encoded certificate to a PEM-encoded

certificate.cert.crt is the DER-encoded certificate, and cert.pem is the resulting PEM-encoded

certificate.

openssl x509 -in cert.crt -outform pem -out cert.pem

Procedure

1. Log into the Ambari server host.

Note: Make sure Ambari server is not running.

2. Locate the certificate that you want to use.


__________________________________________________________________


You can use the following example to create a temporary self-signed certificate.

Replace $wserverwith the Ambari server host name.

openssl genrsa -out $wserver.key 2048

openssl req -new -key $wserver.key -out $wserver.csr

openssl x509 -req -days 365 -in $wserver.csr -signkey $wserver.key -out $wserver.crt

3. Run the following command and answer the prompts that appear.

ambari-server setup-security

a. At the Security setup options prompt, type 1.

b. When asked whether you want to configure HTTPS, type y.

c. Select the port that you want to use for SSL. The default is 8443.

________________________________________________________________

Note: Make sure that you choose a port that is not being used by any services on

the machine. For example, the default port for Knox is also 8443.

__________________________________________________________

d. Provide the path to your certificate and your private key.

e. Provide the password for the private key.

Configuring SSL support for HBase REST gateway with Knox

By using Knox, your Hadoop cluster can be securely accessible to a large number of users, such

as HBase, Hive, and Oozie. Follow these steps to use SSL to connect between Knox and a

Hadoop component such as HBase.

Many of the services in IBM® Open Platform with Apache Hadoop use Knox to allow more users

to make use of the data and queries in Hadoop without compromising on security. Only a

handful of administrators are allowed to connect directly to their Hadoop clusters, while end-

users are routed through Knox.


__________________________________________________________________


Knox acts as a reverse proxy between end-users and Hadoop, providing a two connection hop

between the client and the Hadoop cluster. The first connection is between the client and Knox.

Knox comes with SSL support for this connection. The second connection is between Knox and a

given Hadoop component, such as HBase, which requires some configuration.

Procedure

1. You must have a certificate, either self-signed or one signed by a Certificate Authority

(CA).

Trusted SSL Certificates are issued by Certificate Authorities (CAs). Self-signed

certificates are signed by the same entity whose identity it certifies. It is one signed with

its own private key.

The examples use a self-signed certificate, but this might not be suitable for your

production environment.

a. Configure the SSL on the HBase REST server. This example uses a self-signed

certificate, and a SSL certificate used by a Certificate Authority (CA) makes the

configuration steps even easier.

i. Log-in to the HBase REST server. As the HBase user (su hbase), create a

keystore to hold the SSL certificate.

export HOST_NAME=`hostname`

keytool -genkey -keyalg RSA -alias selfsigned -keystore hbase.jks

-storepass password

-validity 360 -keysize 2048

-dname "CN=$HOST_NAME, OU=Eng, O=MyCompany, L=Central City,

ST=CA, C=US"

-keypass password

Make sure the common name portion of the certificate matches the host

where the certificate will be deployed. For example, when the host that runs

HBase is actuallysandbox.MyCompany.com, the self-signed SSL certificate in

the example, uses this value as the CN: sandbox.MyCompany.com.


__________________________________________________________________


“Owner: CN=sandbox.MyCompany.com, OU=Eng, O=MC, L=CC, ST=CA,

C=US

Issuer: CN=sandbox.MyCompany.com, OU=Eng, O=MC, L=CC, ST=CA,

C=US”

You can now use this self-signed certificate with HBase.

ii. Skip this step if you use a Certificate Authority signed certificate. Self-signed

certificates are rejected during SSL handshake. If you use a self-signed

certificate, export the certificate and put it in the cacerts file of the JRE that

is used by Knox. On the machine that is running HBase, export the HBase

SSL certificate into a file hbase.crt:

keytool -exportcert -file hbase.crt

-keystore hbase.jks -alias selfsigned -storepass password

iii. Copy the hbase.crt file to the Node that is running Knox. Then run the

following command:

keytool -import -file hbase.crt -keystore

/<your_jdk_path>/jre/lib/security/cacerts

-storepass changeit -alias selfsigned

Make sure the path to the cacerts file points to the cacerts of the JDK that is

used to run the Knox gateway. The default cacerts password is changeit.

2. Configure the HBase REST Server for SSL.

a. Use the Ambari web interface to update the Hadoop configuration properties:

<property>

<name>hbase.rest.ssl.enabled</name>

<value>true</value>

</property>

<property>

<name>hbase.rest.ssl.keystore.store</name>

<value>/path/to/keystore/created/hbase.jks</value>

</property>


__________________________________________________________________


<property>

<name>hbase.rest.ssl.keystore.password</name>

<value>password</value>

</property>

<property>

<name>hbase.rest.ssl.keystore.keypassword</name>

<value>password</value>

</property>

b. Click Save in the Ambari configuration page.

c. Restart the HBase REST server by clicking the HBase service in the Ambari web

interface. You can also type the following command in the Linux terminal window:

sudo /usr/iop/current/hbase-client/bin/hbase-daemon.sh stop rest & sudo

/usr/iop/current/hbase-client/bin/hbase-daemon.sh start rest -p 8091

3. Verify the HBase REST server over SSL. Replace localhost with the hostname of your

HBase REST server.

curl -H "Accept: application/json" -k https://localhost:8091/

The command should display the tables in your HBase environment:

{“table”:[{“name”:”ambarismoketest”}]}

.

4. Configure Knox to point to HBase over SSL and then re-start Knox.

Change the URL of the HBase service for your Knox topology to HTTPS. Make sure that

the Host matches the host of HBase rest server.

<service>

<role>WEBHBASE</role>

<url>https://sandbox.MyCompany.com:8091</url>

</service>


__________________________________________________________________


Overview of Kerberos

To ensure secure access in Hadoop, you need a strong authentication and a reliable way to

establish the identity of a user.

When users successfully identify themselves, then that identity can be propagated throughout

the Hadoop cluster. Those users can access resources or work with applications on the cluster.

The Hadoop cluster resources, such as Hosts and Services, also must authenticate with each

other to avoid potential malicious systems or daemons that pretend to be trusted components

of the cluster to gain access to data.

Hadoop uses Kerberos as the basis for strong authentication and identity propagation for both

users and services. Kerberos is a third party authentication mechanism, in which users and

services rely on a third party - the Kerberos server - to authenticate each to the other. The

Kerberos server itself is known as the Key Distribution Center (KDC). The KDC has three

parts:

Principals

A database of the users and services that the server knows about and their respective

Kerberos passwords.

Authentication Server (AS)

An AS performs the initial authentication and issues a Ticket Granting Ticket (TGT).

Ticket Granting Server (TGS)

A TGS issues subsequent service tickets based on the initial TGT.

The basic flow is illustrated by the following steps:

1. A user principal requests authentication from the AS.

2. The AS returns a TGT that is encrypted by using the Kerberos password of the user

principal. This password is known only to the user principal and the AS.


__________________________________________________________________


3. The user principal decrypts the TGT locally by using its Kerberos password, and

from that point forward, until the ticket expires, the user principal can use the TGT

to get service tickets from the TGS.

4. Service tickets are what allow a principal to access various services.

Because cluster resources (hosts or services) cannot provide a password each time to decrypt

the TGT, they use a special file, called a keytab. The keytab contains the authentication

credentials of the resource principal. The set of hosts, users, and services over which the

Kerberos server has control is called a realm.

Each service and sub-service in Hadoop must have its own principal. A principal name in a given

realm consists of a primary name and an instance name. The instance name is the fully

qualified domain name (FQDN) of the host that runs that service.

_________________________________________________________________________

Note: With respect to the HDFS service, this service is entirely handled by Isilon. So it is very

important to make sure the fully qualified Isilon Hadoop Zone Name be used for the instance

name for the HDFS service.

As services do not log in with a password to acquire their tickets, the authentication credentials

of their principal are stored in a keytab file. This file is extracted from the Kerberos database

and stored locally in a secured directory with the service principal on the service component

host.

In addition to the Hadoop Service Principals, Ambari also requires a set of Ambari

Principals to perform service checks and alert health checks. Keytab files for the Ambari, or

headless, principals reside on each cluster host, just as keytab files for the service principals.


__________________________________________________________________


Terminology

The following terms are useful in understanding Kerberos:

Key Distribution Center

The trusted source for authentication in a Kerberos-enabled environment.

Kerberos KDC Server

The server that serves as the KDC.

Kerberos Client

Any machine in the cluster that authenticates against the KDC.

Principal

The unique name of a user or service that authenticates against the KDC.

Keytab

A file that includes one or more principals and their keys.

Realm

The Kerberos network that includes a KDC and a number of Clients.

KDC Admin Account

An administrative account that is used by Ambari to create principals and generate

keytabs in the KDC.

Kerberos Descriptor

A JSON-formatted text file that contains information Ambari needs to enable or disable

FlumKerberos for a stack and its services. This file must be named kerberos.json. It must

be in the root directory of the relevant stack or service. Kerberos Descriptors are meant

to be hierarchical such that details in the stack-level descriptor can be overwritten or

updated by details in the service-level descriptors.


__________________________________________________________________


Enabling Kerberos for IBM Open Platform

You begin setting up Kerberos by enabling it from the Ambari web interface. To use Kerberos

authentication in IBM® Open Platform with Apache Hadoop, you must generate principals and

keytabs for each of the services on each node where you installed the product.

Before you begin

1. You must have the latest supported Red Hat Enterprise Linux (RHEL) packages to enable

and use Kerberos – krb5-server, krb5-workstation and krb5-libs.

2. Deploy the Java Cryptography Extension (JCE) security policy files on the Ambari server

and on all hosts in the cluster. Depending on the JDK that you selected during the

installation of IBM Open Platform with Apache Hadoop the JCE policy files might already

be downloaded and installed onto the server.

a. Stop the Ambari server:

ambari-server stop

b. Make sure you have access to the policy file archive.

c. From the Ambari server and on each host in the cluster, add the unlimited security

policy JCE jars to $JAVA_HOME/jre/lib/security/. For example, run the following

command to extract the policy jars into the JDK that is installed on your host:

unzip -o -j -q UnlimitedJCEPolicyJDK7.zip -d

/usr/jdk64/jdk1.version/jre/lib/security/

d. Restart the Ambari server.


3. Ambari automatically creates principals in the KDC and generates keytabs. Therefore,

you must have the Kerberos Admin Account credentials available when running the

Kerberos wizard.

4. Use an existing Active Directory installation with Kerberos.

a. Make sure that Ambari server and cluster hosts have network access to, and be

able to resolve the DNS names of, the Domain Controllers.

b. Configure the LDAP or Active Directory authentication connectivity.


__________________________________________________________________


c. The Active Directory User container for principals is created and is available. For

example, "OU=Hadoop,OU=People,dc=apache,dc=org"

Manually generating keytabs for Kerberos authentication

You use the kadmin local command-line interface to generate keytabs for IBM® Open Platform

with Apache Hadoop services. All Kerberos-enabled services need a keytab file to authenticate

to the Key Distribution Center (KDC).

You can also use the kadmin command-line interface that can be used on Kerberos client nodes

and KDC server nodes. The kadmind service starts the Kerberos administration server,

whereas the kadmin.local command-line interface directly accesses the KDC database.

To generate keytabs for services that contain the HTTP principal, you use the ktadd command

with the -norandkey option in the kadmin.local command-line interface. This option indicates

to not randomize the keytabs. The keytabs and their version numbers remain unchanged.

____________________________________________________________________

Note: If your version of Kerberos does not support this option, or if you cannot use the

kadmin.local shell, then create your keytabs with the ktadd command and use

the ktutil command to merge keytabs that you create.

You must generate keytabs for the following services to configure them with Kerberos HTTP

authentication. If two or more of these services run on the same host, then all running services

on that host must use the same HTTP principal and key for their HTTP endpoints. Hadoop,

HBase, HttpFS, and Oozie require HTTP principal.

Procedure

1. From the Linux shell, as the root user start the kadmin.local or kadmin command-line

interface.


__________________________________________________________________


Important: If you have root access to your KDC machine, login to the KDC machine as

root and use the kadmin.local command-line interface to generate principles and keytabs.

If you do not have root access to the KDC machine, use the kadmin command-line

interface on any Kerberos configured machine to generate principles and keytabs.

kadmin.local

2. Create the principal and keytab for each of the IBM Open Platform with Apache

Hadoop services. For each service, you must enter

the domain.name and YOUR_REALM.COM parameters.

domain.name - The fully qualified domain name of the cluster node where the server

component is running. The domain.name must be lowercase characters.

YOUR_REALM.COM - The name of the Kerberos realm where you are installing IBM

Open Platform with Apache Hadoop. Kerberos realm names are typically in all

uppercase characters to differentiate it from any similar DNS domain that the realm is

associated with.

Option Description

Flume On every Kerberos configured node that runs a Flume agent that writes to

HDFS, generate a keytab file that contains entries for the Flume agent

principal.

a. On each host where a Flume agent runs, create the

Flume principal and keytab file, and then copy the keytab to the respective host under

the../conf/security/keytabs/flume.keytab. addprinc -randkey flume/domain.name@YOUR_REALM.COM

ktadd -k flume.keytab flume/domain.name@YOUR_REALM.COM

b. Check to ensure that Flume agent principal information was added to the keytab file. klist -e -k -t flume.keytab

c. Ensure that the flume.keytab file is only readable by

the Flume user. sudo chown flume:biadmin

../conf/security/keytabs/flume.keytab

sudo chmod 400

../conf/security/keytabs/flume.keytab

d. To enable the Flume agent to store data on a secure


__________________________________________________________________


Option Description

HDFS, add the following parameters to the Flume

configuration file,flume-conf.properties.template,

which exists in the../flume/conf directory. You can

rename this configuration file to generate your own configuration file for Flume. agentName.sinks.sinkName.type = HDFS

agentName.sinks.sinkName.hdfs.kerberosPrincipal =

flume/domain.name@YOUR_REALM.COM

agentName.sinks.sinkName.hdfs.kerberosKeytab =

keytab_path

agentName

Name of the Flume agent that you are configuring for

Kerberos authentication.

sinkName

Name of the HDFS sink that you are configuring. The sink type must be HDFS.

keytab_path

Path to the Flume keytab. The default path

is../conf/security/keytabs/flume.keytab.

When you start the Flume agent, specify the --conf-file option to point to the Flume

configuration file that you modified. For example, $FLUME_HOME/bin/flume-ng agent --conf-file

flume-conf.properties.template --name

myAgentName

-Dflume.root.logger=INFO,console

Hadoop On every Kerberos configured node that runs a Hadoop server, generate a

keytab file for HDFS, MapReduce, and HTTP services. The HDFS keytab

file must contain entries for the HDFS principal and the HTTP principal.

The MapReduce keytab file must contain entries for the MapReduce

principal and the HTTP principal. Both Hadoop and HBase use the HTTP

keytab file. On each node, the HTTP principal must be the same in all

keytab files.

e. Run the following commands on every host in your

cluster that runs a Hadoop server or an HBase server. addprinc -randkey HTTP/domain.name@YOUR_REALM.COM

ktadd -norandkey -k http.domain.name.keytab

HTTP/domain.name@YOUR_REALM.COM

mailto:HTTP/domain.name@YOUR_REALM.COM


__________________________________________________________________


Option Description

f. Run the following commands on every host in your

cluster where Hadoop servers run. Create principals and keytabs for HDFS services including the

NameNode, Secondary NameNode, DataNodes – in all cases the instance name will point to the FQDN of

Isilon Hadoop Zone, e.g. hdfs/Isilon- If you plan to [email protected].

enable high availability with the Quorum Journal Manager (QJM), create principals and keytabs for

JournalNodes. addprinc -randkey hdfs/domain.name@YOUR_REALM.COM

Tip: You can add keytabs for NFS high availability. NFS high availability.

i. Add the following principles addprinc -randkey

hdfs/isilon.zonename@YOUR_REALM.COM

addprinc -randkey

HTTP/virtual.hostname@YOUR_REALM.COM

ii. Add the NFS principles and key to every

HA node. ktadd -norandkey -k

hdfs.domain.name.keytab hdfs/

isilonzone.domain.name@YOUR_REALM.COM

ktadd -norandkey -k

http.domain.name.keytab HTTP/


ktadd -norandkey -k

hdfs.isilonzone.domain.name.keytab HTTP/


ktadd -norandkey -k hdfs.domain.name.keytab

hdfs/isilonzone.domain.name@YOUR_REALM.COM

HTTP/isilonzone.domain.name@YOUR_REALM.COM

Check to ensure that the HDFS and HTTP principal information was added to the keytab file. klist -e -k -t hdfs.isilonzone.domain.name.keytab

g. Run the following commands on every host in your

cluster where hadoop servers run, including the JobTracker and TaskTracker. addprinc -randkey mapred/domain.name@YOUR_REALM.COM

ktadd -norandkey -k mapred.domain.name.keytab

mapred/domain.name@YOUR_REALM.COM


Check to ensure that MapReduce and HTTP principal information was added to the keytab file. klist -e -k -t mapred.domain.name.keytab

HBase h. On every Kerberos configured node that runs HBase,


__________________________________________________________________


Option Description

including the primary and secondary servers,

generate a keytab file that contains entries for the HBase principal. addprinc -randkey hbase/domain.name@YOUR_REALM.COM

ktadd -k hbase.domain.name.keytab

hbase/domain.name@YOUR_REALM.COM

i. Check to ensure that HBase principal information was

added to the keytab file. klist -e -k -t hbase.domain.name.keytab

Hive j. On every Kerberos configured node that runs a Hive JDBC server, generate a Hive keytab file that

contains entries for the Hive principal. addprinc -randkey hive/domain.name@YOUR_REALM.COM

ktadd -k hive.domain.name.keytab

hive/domain.name@YOUR_REALM.COM

k. Check to ensure that Hive principal information was

added to the keytab file. klist -e -k -t hive.domain.name.keytab

HttpFS l. On every Kerberos configured node that runs the HttpFS server, generate a keytab file that contains

entries for the HttpFS principal and an HTTP principal. addprinc -randkey

httpfs/isilonzone.domain.name@YOUR_REALM.COM

addprinc -randkey


ktadd -norandkey -k httpfs.domain.name.keytab

httpfs/isilonzone.domain.name@YOUR_REALM.COM


Oozie m. On every Kerberos configured node that runs Oozie, generate a keytab file that contains entries for the

Oozie principal and an HTTP principal. addprinc -randkey oozie/domain.name@YOUR_REALM.COM

addprinc -randkey HTTP/domain.name@YOUR_REALM.COM

ktadd -norandkey -k oozie.domain.name.keytab

oozie/domain.name@YOUR_REALM.COM


n. Check to ensure that Oozie and HTTP principal

information was added to the keytab file. klist -e -k -t oozie.domain.name.keytab

ZooKeeper o. On every Kerberos configured node that runs

mailto:HTTP/isilonzone.domain.name@YOUR_REALM.COM

mailto:HTTP/domain.name@YOUR_REALM.COM


__________________________________________________________________


Option Description

ZooKeeper, generate a keytab file that contains

entries for the ZooKeeper principal. addprinc -randkey zookeeper/domain.name@YOUR_REALM.COM

ktadd -k zookeeper.domain.name.keytab

zookeeper/domain.name@YOUR_REALM.COM

p. Check to ensure that ZooKeeper principal information was added to the keytab file. klist -e -k -t zookeeper.domain.name.keytab

Setting up Active Directory or LDAP authentication in Ambari

Lightweight Directory Access Protocol (LDAP security) is an interface that is used to read from

and write to the Active Directory database. By default, Ambari uses an internal database as the

user store for authentication and authorization. You can configure LDAP or Active Directory (AD)

external authentication.

Before you begin

An LDAP client must be installed on the Ambari server host.

The Ambari server must not be running when you are performing this task.

The following table describes the properties and values that are required to set up LDAP

authentication.

Table 1. Ambari server LDAP properties

Property Values Description

authentication.ldap.primaryUrl server:port The hostname and port for the LDAP or AD server. For

example, my.ldap.server:389.


__________________________________________________________________


Table 1. Ambari server LDAP properties

Property Values Description

authentication.ldap.secondaryUrl

server:port The hostname and port for the secondary LDAP or AD server. For

example,my.secondary.ldap.server:

389.

This value is optional.

authentication.ldap.useSSL true or false If true, use SSL when connecting to

the LDAP or the AD server.

authentication.ldap.

usernameAttribute

[LDAP

attribute]

The attribute for username. For

example, uid.

authentication.ldap.baseDn [Distinguished Name]

The root Distinguished Name to search in the directory for users. For

example,ou=people,dc=hadoop,dc=ap

ache,dc=org.


bindAnonymously

true or false If true, bind to the LDAP or AD server

anonymously.

authentication.ldap.managerDn [Full

Distinguishe

d Name]

If Bind anonymous is set to false, the

Distinguished Name (“DN”) for the

manager. For

example,uid=hdfs,ou=people,dc=had

oop,dc=apache,dc=org.


managerPassword

[password] If Bind anonymous is set to false, the

password for the manager.

authentication.ldap.userObjectClass

[LDAP Object

Class]

The object class that is used for users.

For example, organizationalPerson.

authentication.ldap.groupObjectClass

[LDAP Object

Class]

The object class that is used for groups. For

example, groupOfUniqueNames.

authentication.ldap.groupMemb

ershi pAttr

[LDAP

attribute]

The attribute for group membership.

For example, uniqueMember.

authentication.ldap.groupNamingAttr

[LDAP attribute]

The attribute for group name.


__________________________________________________________________


____________________________________________________________________

Note: If you are going to set bindAnonymously to false (the default), make sure that you have

an LDAP Manager name and password set up. If you are going to use SSL, make sure you have

already set up your certificate and keys.

To manage authorization and permissions against your users and groups, you must synchronize

those LDAP users and groups in the Ambari database.

If the LDAP server certificate is signed by a trusted Certificate Authority, you do not need to

import the certificate into Ambari. If the LDAP server certificate is self-signed, or is signed by

an unrecognized certificate authority such as an internal certificate authority, you must import

the certificate and create a keystore file.

Procedure

1. Stop the Ambari server.

ambari-server stop

2. If required, create a keystore file.

a. Create a directory for the keystore file. For example, type mkdir /keys to create a

directory calledkeys.

b. Create the keystore file. For example, type the following command to create the

keystore file ldaps-keystore.jks in the keys directory.

$JAVA_HOME/bin/keytool -import -trustcacerts -alias root -file

$PATH_TO_YOUR_LDAPS_CERT -keystore /keys/ldaps-keystore.jks

c. When prompted, set a password.

The password is needed when you are setting up LDAP or AD authentication in

Ambari.


__________________________________________________________________


3. Run the following LDAP set up command, and answer the prompts with the information

that you previously collected. ambari-server setup-ldap

Note: Prompts marked with an asterisk are required values.

4. At the Primary URL* prompt, type the server URL and port.

5. At the Secondary URL prompt, type the secondary server URL and port.

6. At the Use SSL* prompt, type your value.

If you are using LDAP, type true.

7. At the User name attribute* prompt, type your value. The default value is uid.

8. At the Base DN* prompt, type your value.

9. At the Bind anonymously* prompt, type your value.

10. If you have set bind.Anonymously to false, at the Manager DN* prompt, type your

value.

11. At the Enter the Manager Password* prompt, type the password for your LDAP

manager.

12. At the Enter the userObjectClass* prompt, type the object class that is used for

users.

13. At the Enter the groupObjectClass * prompt, type the object class that is used for

groups.

14. At the Enter the groupMembershipAttr * prompt, type the attribute for group

membership.

15. At the Enter the groupNamingAttr * prompt, type the attribute for group name.

16. If you set Use SSL* to true in step 6, the prompt Do you want to provide custom

TrustStore for Ambari? appears.

https://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.1.0/com.ibm.swg.im.infosphere.biginsights.admin.doc/doc/admin_set_ldap_authent_ambari.html%23admin_set_up_ldap_authent_ambari?lang=en-us#admin_set_up_ldap_authent_ambari__usessl


__________________________________________________________________


o If you are using a self-signed certificate that you do not want imported to the

existing JDK keystore, type y.

This is option is more secure. For example, you want only Ambari to use this

certificate, and not any other applications run by JDK on the same host.

When you select this option, other prompts appear.

At the TrustStore type prompt, type jks.

At the Path to TrustStore file prompt, type /keystore_directory/ldaps-

keystore.jks.

At the Password for TrustStore prompt, type the password that you

defined for the keystore.

o If you are using a self-signed certificate that you want to import and store in the

existing, default JDK keystore, type n.

This is option is less secure.

When you select this option, do the following.

If necessary, convert the SSL certificate to X.509 format by executing the

following command:

openssl x509 -in slapd.pem -out slapd.crt

where slapd.crt is the path to the X.509 certificate.

Import the SSL certificate to the existing keystore, such as the default jre

certificates store, by typing the following command:

/usr/jdk64/jdk1.7.0_45/bin/keytool -import -trustcacerts -file slapd.crt -keystore

/usr/jdk64/jdk1.7.0_45/jre/lib/security/cacerts

where Ambari is set up to use JDK 1.7. Consequently, the certificate must

be imported into the JDK 7 keystore.

17. Review your settings, and if they are correct, select y.


__________________________________________________________________


18. Restart the Ambari server.

19. Synchronize your LDAP users and groups into the Ambari database.

o To synchronize a specific set of users and groups, type the following command:

ambari-server sync-ldap --users users.txt --groups groups.txt

where users.txt and groups.txt are files that contain comma-separated users and

groups.

Note: Group membership is determined using the group membership attribute

that you specified when you ran ambari-setup setup-ldap.

o If you have synchronized a specific set of users and groups, type the following

command to synchronize only those entities that are in Ambari with LDAP. Users

are removed from Ambari if they no longer exist in LDAP, and group membership

in Ambari is updated to match LDAP.

ambari-server sync-ldap --existing

Note: Group membership is determined using the group membership attribute

that you specified when you ran ambari-setup setup-ldap.

o To import all entities with matching LDAP user and group object classes into

Ambari, type the following command:

ambari-server sync-ldap --all

________________________________________________________

Note: Use this option only if you are sure that you want to synchronize all users

and groups from LDAP into Ambari. Isilon will also need to be configured for LDAP

authentication for this synchronization to work across the entire cluster.

_________________________________________________________


__________________________________________________________________


Additional User Priviledges

Initially, the users you have enabled all have Ambari User privileges. Ambari Users can read

metrics, view service status and configuration, and browse job information. If you want users to

be able to start or stop services, modify configurations, and run smoke tests, you must give the

users administrator privileges.

Enabling Kerberos for HDFS on Isilon

Using MIT Kerberos 5

This section explains how to set up an Isilon cluster to authenticate HDFS connections with a

stand-alone MIT Kerberos 5 key distribution center. The following instructions assume that you

have already set up a Kerberos system with a resolvable hostname for the KDC and a

resolvable hostname for the KDC admin server. It is assumed your KDC is running on the

Ambari Server, all KDC’s have a different realm name, and the Hadoop client setup for Kerberos

is complete on the compute nodes and you have one KDC per zone.

__________________________________________________________________________

Note: AES encryption must be disabled in krb5.conf and RC4/DES should be listed as the only

supported encryption type on server and clients:

kdc.conf

supported_enctypes = RC4-HMAC:normal DES-CBC-MD5:normal DES-CBC-CRC:normal

__________________________________________________________________________

Note: Deleting principals from Isilon does not remove them from KDC.

Procedure

Connect with SSH as root to any node in your Isilon cluster and run the following commands to

configure Isilon for Kerberos.

1. To prevent auto spn generation in the system zone you need to set ‘All Auth Providers’

setting on the system zone to ‘No’.

isi zone zones modify --zone=system --all-auth-providers=No


__________________________________________________________________


2. Add the KDC to the Isilon cluster and each KDC needs a unique name:

isi auth krb5 create --realm=EXAMPLE.COM --admin-server=kdc.example.com

--kdc=kdc.example.com --user=kadmin/admin --password=isi

3. To verify and list all the auth providers for the cluster run:

isi auth status

4. Modify zone to use authenticaion provider

isi zone zones modify --zone=zone-example --add-auth-provider=krb5:EXAMPLE.COM

5. Verify zone infor with view command:

isi zone zones view --zone=zone-example

6. Create the Isilon spn’s for the zone. The format needs to be hdfs/<cluster hostname/SC

name>@REALM and HTTP/<cluster hostname/SC name>@REALM

isi auth krb5 spn create --provider-name=EXAMPLE.COM --

spn=hdfs/[email protected] --user=kadmin/admin --

password=isi

isi auth krb5 spn create --provider-name=EXAMPLE.COM --

spn=HTTP/[email protected] --user=kadmin/admin --

password=isi

7. Verify spn creation:

isi auth krb5 spn list --provider-name=EXAMPLE.COM

8. Lastly create proxy users

o isi hdfs proxyusers create oozie --zone=zone-example --add-user=ambari-qa

o isi hdfs proxyusers create hive --zone=zone-example --add-user=ambari-qa

o isi hdfs proxyusers create zookeeper --zone=zone-example --add-

user=ambari-qa

o isi hdfs proxyusers create flume --zone=zone-example --add-user=ambari-qa

o isi hdfs proxyusers create hadoop --zone=zone-example --add-user=ambari-

qa

o isi hdfs proxyusers create hbase --zone=zone-example --add-user=ambari-qa

9. Before proceeding to this step, you should be finished with the Kerberos setup on the

compute nodes as well as completed the Ambari Security Wizard. After everything has finished installing you need to configure the Isilon zone to only allow secure connections with the command shown below:

mailto:–spn%3Dhdfs/[email protected]

mailto:–spn%3Dhdfs/[email protected]

mailto:–spn%3DHTTP/[email protected]

mailto:–spn%3DHTTP/[email protected]


__________________________________________________________________


o isi zone zones modify --zone=zone-example --hdfs-

authentication=kerberos_only

______________________________________________________________________

Note: It is very important during the Ambari Security Wizard (next section) to configure the

HDFS principals (namenode, snamenode, datanode) to, for example -

hdfs/[email protected]. All three principals must point to the FQDN

of the Isilon Hadoop Zone configured@REALM_NAME.

___________________________________________________________________________

Running the Ambari Kerberos Wizard _________________________________________________________________________

Note: Make sure you complete the Enabling Kerberos for HDFS on Isilon (shown in the

following section) setup before completing the Ambari Kerberos Wizard.

_________________________________________________________________________

Your cluster might use a primary KDC and one or more secondary KDCs to ensure continued

availability of Kerberos-enabled services. In this configuration, each KDC contains a copy of the

Kerberos database. The primary KDC contains the writeable copy of the realm database, which

is replicated on each of the secondary KDCs.

The Kerberos realm must trust the server. In Kerberos configuration files, your realm is

typically identified in uppercase characters to differentiate it from any similar DNS domain that

the realm is associated with.

__________________________________________________________________________

Note: To use Kerberos, you must install a few basic packages on the machines in your cluster

or build and install the packages from scratch. If you need to build the packages yourself, you

can download the latest version from the MIT website.

If your system uses a package management system, you can install the following packages to

use a generic version of Kerberos:

krb5-workstation must be installed on all client systems. This package contains basic

Kerberos program, in addition to Kerberos-enabled versions of the telnet and ftp

applications.

mailto:hdfs/[email protected]


__________________________________________________________________


krb5-server must be installed on all server and secondary server systems. This package

provides the programs that must be installed on a Kerberos 5 server or server replica.

krb5-libs must be installed on all client and server systems. This package contains the

shared libraries that are used by Kerberos on all clients and services.

pam_krb5 on all client systems. This package provides a pluggable authentication

module (PAM) that enables Kerberos authentication.

Procedure

1. From the Ambari web dashboard, from the menu bar, click Admin > Kerberos.

2. Click Enable Kerberos.

3. Select the type of KDC that you want to use and confirm that you meet the prerequisites.


__________________________________________________________________


4. Provide information about the KDC and admin account in the configuration page.

5. Install the Kerberos client. The wizard page shows you the progress, but you can also see the

progress of the install in the file /var/log/ambari-server/ambari-server.log.

The Kerberos clients are installed on the hosts and the access to the KDC is tested by testing

that Ambari can create a principal, generate a keytab and distribute that keytab.

6. Configure the Kerberos identities that are used by Hadoop.


__________________________________________________________________


7. Kerberize the cluster.


__________________________________________________________________


____________________________________________________________________

Note: Make sure Isilon is configured for Kerberos before configuring HDFS in the Ambari

Security Wizard – see Enabling Kerberos for HDFS on Isilon. Click through the wizard untill you get to the screen that configures the principals. Note: Isilon does not convert

principal names to short names using rules so don’t use aliases(e.g. rm instead of yarn) o Realm name o Hdfs -> namenode hdfs/[email protected]

o Hdfs -> secondarynamenode hdfs/[email protected]

o Hdfs -> datanode hdfs/[email protected]

o Yarn -> resourceManager yarn/_HOST o Yarn -> nodemanager yarn/_HOST

o Mapreduce2 -> history server principal -> mapred/_HOST





__________________________________________________________________


8. The final step to enable Kerberos is called Start and Test Services. If you see an error that

indicates some services failed to start and execute tests successfully, you might learn

more about the issue by clicking Start and Test Services. If you see a Check HBase failure

error message, such as ERROR: Can't get master address from ZooKeeper; znode data

== null, work around this issue by manually restarting the HBase service. After manually

restarting, retry the Start and Test Services.

Trouble Shooting and Support

To isolate and resolve problems with BigInsights®, you can use the troubleshooting and

support information online. This information contains instructions for using the problem-

determination resources that are provided with BigInsights.

https://www-

01.ibm.com/support/knowledgecenter/SSPT3X_4.1.0/com.ibm.swg.im.infosphere.biginsights.tr

b.doc/doc/troubleshooting.html

https://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.1.0/com.ibm.swg.im.infosphere.biginsights.trb.doc/doc/troubleshooting.html