#RememberRuddy
_____________________________
EMC ISILON HADOOP STARTER KIT Deploying IBM BigInsights v 4.0 with EMC ISILON
Release 1.0
October, 2015
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 2
To learn more about how EMC products, services, and solutions can help solve your
business and IT challenges, contact your local representative or authorized reseller,
visit www.emc.com, or explore and compare products in the EMC Store
Copyright © 2015 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as of its publication date.
The information is subject to change without notice.
The information in this publication is provided “as is.” EMC Corporation makes no
representations or warranties of any kind with respect to the information in this
publication, and specifically disclaims implied warranties of merchantability or fitness
for a particular purpose.
Use, copying, and distribution of any EMC software described in this publication
requires an applicable software license.
For the most up-to-date listing of EMC product names, see EMC Corporation
Trademarks on EMC.com.
EMC are registered trademarks or trademarks of EMC, Inc. in the United States
and/or other jurisdictions. All other trademarks used herein are the property of their
respective owners.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 3
Contents
INTRODUCTION ........................................................................................ 6
IBM & EMC Technology Highlights ........................................................................ 6
Audience ........................................................................................................... 7
Apache Hadoop Projects ...................................................................................... 7
IBM Open Platform and the Ambari Manager ......................................................... 8
Isilon Scale-Out NAS for HDFS ............................................................................. 8
Overview of Isilon Scale-Out NAS for Big Data ....................................................... 9
PRE-INSTALLATION CHECKLIST ............................................................. 10
Supported Software Versions ............................................................................. 10
Hardware Requirements and Suggested Hadoop Service Layout ............................. 10
INSTALLATION OVERVIEW ..................................................................... 12
Prerequisites ................................................................................................... 12
Isilon Scale-Out NAS or Isilon OneFS Simulator ........................................................... 12
Linux ...................................................................................................................... 13
Networking ............................................................................................................. 13
DNS ....................................................................................................................... 14
Other ..................................................................................................................... 15
Prepare Isilon .................................................................................................. 15
Assumptions ............................................................................................................ 15
SmartConnect for HDFS ............................................................................................ 16
OneFS Access Zones................................................................................................. 17
Sharing Data between Access Zones .......................................................................... 18
User & Group ID’s .................................................................................................... 19
Configuring Isilon for HDFS ....................................................................................... 19
Create DNS Records for Isilon .................................................................................... 25
Prepare Linux Compute Nodes ........................................................................... 25
Linux Operating System packages needed for IBM BigInsights: ...................................... 25
Enable NTP on all Linux Compute nodes ...................................................................... 26
Disable SELinux on each node if enabled before installing Ambari. ................................. 26
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 4
Check UMASK Settings ............................................................................................. 26
Set ulimit Properties ................................................................................................. 27
Kernel Modifications ................................................................................................. 27
Create IBM BigInsights Hadoop Users and Groups ........................................................ 27
Configure Passwordless SSH ...................................................................................... 28
Additional Linux Packages to Install ............................................................................ 28
Test DNS Resolution ................................................................................................. 29
Edit sudoers file on all Linux compute nodes. ............................................................... 29
INSTALLING IBM OPEN PLATFORM (OP) ................................................ 29
Download IBM Open Platform Software ............................................................... 29
Create IBM Open Platform Repository ................................................................. 30
Validating IBM Open Platform Install................................................................... 38
Adding a Hadoop User ...................................................................................... 40
Additional Service Tests .................................................................................... 40
HDFS ...................................................................................................................... 40
YARN/MAPREDUCE ................................................................................................... 41
HIVE ...................................................................................................................... 42
HBASE .................................................................................................................... 43
Ambari Service Check ....................................................................................... 44
INSTALLING IBM VALUE PACKAGES ....................................................... 45
Before You Begin ............................................................................................. 45
Installation Procedure ....................................................................................... 46
Select IBM BigInsights Service to Install ............................................................. 50
Installing BigInsights Home ............................................................................... 51
Configure Knox ................................................................................................ 52
Installing BigSheets .......................................................................................... 54
Installing Big SQL............................................................................................. 57
Connecting to Big SQL ...................................................................................... 62
Running JSqsh ......................................................................................................... 62
Connection setup ..................................................................................................... 62
Commands and queries ............................................................................................ 63
Command and query edit .......................................................................................... 65
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 5
Configuration variables ............................................................................................. 66
Installing Text Analytics .................................................................................... 67
Installing Big R ................................................................................................ 71
IBM BigInsights Online Tutorials................................................................................. 76
SECURITY CONFIGURATION AND ADMINISTRATION .............................. 77
Setting up HTTPS for Ambari ............................................................................. 77
Configuring SSL support for HBase REST gateway with Knox ................................. 78
Overview of Kerberos ....................................................................................... 82
Enabling Kerberos for IBM Open Platform ............................................................ 85
Manually generating keytabs for Kerberos authentication ...................................... 86
Setting up Active Directory or LDAP authentication in Ambari ................................ 91
Enabling Kerberos for HDFS on Isilon.................................................................. 97
Using MIT Kerberos 5 ............................................................................................... 97
Running the Ambari Kerberos Wizard .................................................................. 99
Trouble Shooting and Support .......................................................................... 104
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS
6
EMC Isilon Hadoop Starter Kit for IBM BigInsights v 4.0
This document describes how to create a Hadoop environment utilizing IBM® Open Platform with Apache Hadoop and an EMC® Isilon® scale-out network-attached storage (NAS) for HDFS
accessible shared storage. Installation and configuration of IBM BigInsights Value Packages is also presented in this document.
Introduction
IBM & EMC Technology Highlights
The IBM® Open Platform with Apache Hadoop is comprised of entirely Apache Hadoop
open source components, such as Apache Ambari, YARN, Spark, Knox, Slider, Sqoop,
Flume, Hive, Oozie, HBase, ZooKeeper, and more. After installing IBM Open Platform, you
can install additional IBM value-add service modules.
These value-add service modules are installed separately, and they include IBM
BigInsights® Analyst, IBM BigInsights Data Scientist, and the IBM BigInsights Enterprise
Management module to provide enhanced capabilities to IBM Open Platform to accelerate
the conversion of all types of data into business insight and action.
The EMC® Isilon® Scale-Out Network-Attached Storage (NAS) platform provides Hadoop
clients with direct access to big data through a Hadoop File System (HDFS) interface.
Powered by the distributed EMC Isilon OneFS® operating system, an EMC Isilon cluster
delivers a powerful yet simple and highly efficient storage platform with native HDFS
integration to accelerate analytics, gain new flexibility, and avoid the costs of a separate
Hadoop infrastructure.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 7
Audience
This document is intended for IT program managers, IT architects, Developers, and IT
management to easily deploy IBM BigInsights v4.0 with EMC Isilon OneFS v 7.2.0.3 for
HDFS storage. If a physical EMC Isilon Cluster is not available, download the free EMC Isilon
OneFS Simulator which can be installed as a virtual machine for integration testing and
training purposes. See http://www.emc.com/getisilon for EMC Isilon OneFS Simulator.
Apache Hadoop Projects
Apache Hadoop is an open source, batch data processing system for enormous amounts of
data. Hadoop runs as a platform that provides cost-effective, scalable infrastructure for
building Big Data analytic applications. All Hadoop clusters contain a distributed file system
called the Hadoop Distributed File System (HDFS) and a computation layer called
MapReduce.
The Apache Hadoop project contains the following subprojects:
• Hadoop Distributed File System (HDFS) – A distributed file system that provides
high-throughput access to application data.
• Hadoop MapReduce – A software framework for writing applications to reliably
process large amounts of data in parallel across a cluster.
Hadoop is supplemented by an ecosystem of Apache projects, such as Pig, Hive, Sqoop,
Flume, Oozie, Slider, HBase, Zookeeper and more that extend the value of Hadoop and
improves its usability.
Version 2 of Apache Hadoop introduces YARN, a sub-project of Hadoop that separates the
resource management and processing components. YARN was born of a need to enable a
broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-
based architecture of Hadoop 2.0 provides a more general processing platform that is not
constrained to MapReduce.
For full details of the Apache Hadoop project see http://hadoop.apache.org/.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 8
IBM Open Platform and the Ambari Manager
The IBM Open Platform with Apache Hadoop enables Enterprise Hadoop by providing the
complete set of essential Hadoop capabilities required for any enterprise. Utilizing YARN at
its core, it provides capabilities for several functional areas including Data Management,
Data Access, Data Governance, Integration, Security and Operations.
IBM Open Platform delivers the core elements of Hadoop - scalable storage and distributed
computing – as well as all of the necessary enterprise capabilities such as security, high
availability and integration with a broad range of hardware and software solutions.
Apache Ambari is an open operational framework for provisioning, managing and monitoring
Apache Hadoop clusters.
As of version 4.0 of IBM Open Platform, Ambari can be used to setup and deploy Hadoop
clusters for nearly any task. Ambari can provision, manage and monitor every aspect of a
Hadoop deployment.
More information on IBM Open Platform can be found at:
http://www-01.ibm.com/software/data/infosphere/hadoop/enterprise.html
Isilon Scale-Out NAS for HDFS
EMC Isilon is the only scale-out NAS platform natively integrated with the Hadoop
Distributed File System (HDFS). Using HDFS as an over-the-wire protocol, you can deploy a
powerful, efficient, and flexible data storage and analytics ecosystem.
In addition to native integration with HDFS, EMC Isilon storage easily scales to support
massively large Hadoop analytics projects. Isilon scale-out NAS also offers unmatched
simplicity, efficiency, flexibility, and reliability that you need to maximize the value of your
Hadoop data storage and analytics workflow investment.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 9
Overview of Isilon Scale-Out NAS for Big Data
The EMC Isilon scale-out platform combines modular hardware with unified software to
provide the storage foundation for data analysis. Isilon scale-out NAS is a fully distributed
system that consists of nodes of modular hardware arranged in a cluster. The distributed
Isilon OneFS operating system combines the memory, I/O, CPUs, and disks of the nodes into
a cohesive storage unit to present a global namespace as a single file system.
The nodes work together as peers in a shared-nothing hardware architecture with no single
point of failure. Every node adds capacity, performance, and resiliency to the cluster and
each node acts as a Hadoop namenode and datanode.
The namenode daemon is a distributed process that runs on all the nodes in the cluster. A
compute client can connect to any node through HDFS.
As nodes are added, the file system expands dynamically and redistributes data, eliminating
the work of partitioning disks and creating volumes. The result is a highly efficient and
resilient storage architecture that brings all the advantages of an enterprise scale-out NAS
system to storing data for analysis.
With traditional direct attached storage, the ratio of CPU, RAM, and disk space requirements
depends on the workload—these factors make it difficult to size a Hadoop cluster before you
have had a chance to measure your MapReduce workload. Expanding data sets also makes
sizing decisions upfront problematic. Isilon scale-out NAS lends itself perfectly to this
situation: Isilon scale-out NAS lets you increase CPUs, RAM, and disk space by adding nodes
to dynamically match storage capacity and performance with the demands of a dynamic
Hadoop workload.
An Isilon cluster optimizes data protection. OneFS more efficiently and reliably protects data
than HDFS. The HDFS protocol, by default, replicates a block of data three times. In
contrast, OneFS stripes the data across the cluster and protects the data with forward error
correction codes, which consume less space than replication with better protection.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 10
Pre-installation Checklist
Supported Software Versions
The environment used for this document consists of the following software versions:
Ambari 1.7.0_IBM
IBM Open Platform v 4.0.0.0
Isilon OneFS 7.2.0.3 with patch-159065
All of IBM BigInsights v 4.0 value packs, i.e. Business Analyst, Data
Scientist, and Enterprise Management
______________________________________________________________________
Note: IBM BigInsights v 4.0 requires OneFS v 7.2.0.3 with patch-159065.
OneFS version 7.2.0.4 should also work as well as version 7.2.1.1 when available.
Do not install IBM BigInsights with OneFS versions lower than 7.2.0.3.
See EMC Isilon Supportability and Compatibility Guide for the latest compatibility updates:
https://support.emc.com/docu44518_Isilon-Supportability-and-Compatibility-
Guide.pdf?language=en_US
Hardware Requirements and Suggested Hadoop Service Layout
Detail system requirements for IBM BigInsights compute nodes can be found at:
http://www-01.ibm.com/support/docview.wss?uid=swg27027565
In a multi-node IBM BigInsights cluster, it is suggested that you have at least one
management node in your non-high availability environment, if performance is not an
issue. If performance is a concern, consider configuring at least three management nodes.
If you use the BigInsights - Big SQL service, consider configuring four management
nodes. If you use a high availability environment, consider six management nodes. Use
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 11
the following list as a guide for the nodes in your IBM/EMC cluster. A suggested layout is
shown in Table 1 for both Non-High availability and High availability deployments.
________________________________________________________________________________________
Note: With both deployment options, EMC Isilon provides namenode, secondary namenode and datanode functions for the entire cluster. Do not designate any compute node as a namenode, secondary namenode, or datanode in any aspect of the IBM BigInsights configuration.
Table 1. Suggested Service Layout
Non-High availability High availability
Management node 1
Ambari
PostgreSQL
Knox
Zookeeper
Hive
Spark
Spark History Server
BigInsights Home
BigSheets
Big R
BigSQL Headnode
Text Analytics
Management node 2
Resource Manager
HBase Master
Zookeeper
Oozie
Ambari monitoring service
Management node 3
Job history server
Zookeeper
App Timeline Server
Kafka
Management node 4
Big SQL Scheduler
Hive Server (MySQL)
MySQL metastore
Hive/Oozie metastore
WebHCat Server
Data Server Manager
Management node 1
Ambari
PostgreSQL
Spark
Spark History Server
BigSQL Headnode
Management node 2
Resource Manager
Zookeeper
Oozie
Ambari monitoring service
BigInsights Home
Management node 3
Resource Manager (standby)
Job history server
Zookeeper
App Timeline Server
Kafka
Oozie (Standby)
Management node 4
Big SQL Scheduler
HBase Master (standby)
Hive Server
MySQL Server
Hive metastore
WebHCat Server
Data Server Manager
Management node 5
Big SQL Headnode (Standby)
Big SQL Scheduler (Standby)
HBase Master
Hive Server (Standby)
Hive Metastore (Standby)
Journal Node
Zookeeper
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 12
Installation Overview
Below is the overview of the installation process that this document will describe.
1. Confirm prerequisites.
2. Prepare your network infrastructure including DNS.
3. Prepare your Isilon cluster.
4. Prepare Linux compute nodes.
5. Install Ambari Server.
6. Use Ambari Manager to deploy IBM Open Platform to compute nodes.
7. Install IBM BigInsights Value Packages
8. Perform key functional tests.
Prerequisites
Isilon Scale-Out NAS or Isilon OneFS Simulator
For low-capacity, non-performance testing of Isilon, the EMC Isilon OneFS Simulator can
be used instead of a cluster of physical Isilon appliances. This can be downloaded for free
from http://www.emc.com/getisilon.
Refer to the EMC Isilon OneFS Simulator Install Guide for details. Be sure to follow the
section for running the virtual nodes in VMware ESX. Only a single virtual node is required
but adding additional nodes will allow you to explore other features such as data
protection, SmartPools (tiering), and SmartConnect (network load balancing).
For physical Isilon nodes, you should have already completed the console-based
installation process for your first Isilon node and added two other nodes for a
minimum of 3 Isilon nodes.
You should have OneFS version 7.2.0.3 + patch 159065 installed on Isilon.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 13
You must obtain OneFS HDFS license code and install it on your Isilon cluster. You can
get your free OneFS HDFS license from:
http://www.emc.com/campaign/isilon-hadoop/index.htm.
It is recommended, but not required, to have a SmartConnect Advanced license for
your Isilon cluster.
To allow for scripts and other small files to be easily shared between all nodes in your
environment, it is highly recommended to enable NFS (Unix Sharing) on your Isilon
cluster. By default, the entire /ifs directory is already exported and this can remain
unchanged. This document assumes that a single Isilon cluster is used for this NFS
export as well as for HDFS. However, there is no requirement that the NFS export be
on the same Isilon cluster that you are using for HDFS.
Linux
RedHat Enterprise Linux (RHEL) Server 6 (Update 5 minimum) or comparable
CentOS Server.
100GB Root Partition
At a minimum, 96G RAM for production environments. The more RAM the better
for Hadoop.
Networking
For the best performance, a single 10 Gigabit Ethernet switch should connect to at
least one 10 Gigabit port on each Linux host. Additionally, the same switch should
connect to at least one 10 Gigabit port on each Isilon node.
A single dedicated layer-2 network can be used to connect all hosts and Isilon nodes.
Although multiple networks can be used for increased security, monitoring, and
robustness.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 14
At least an entire /24 IP address block should be allocated to your network. This will
allow a DNS reverse lookup zone to be delegated to your Hadoop DNS server.
If using the EMC Isilon OneFS Simulator, you will need at least two static IP addresses
(one for the node’s ext-1 interface, another for the SmartConnect service IP). Each
additional Isilon node will require an additional IP address.
At a minimum, you will need to allocate to your Isilon cluster one IP address per
Access Zone per Isilon node. In general, you will need one Access Zone for each
separate Hadoop cluster that will use Isilon for HDFS storage.
For the best possible load balancing during an Isilon node failure scenario, the
recommended number of IP addresses is given by the formula below. Of course, this
is in addition to any IP addresses used for non-HDFS pools.
# of IP addresses = 2 * (# of Isilon Nodes) * (# of Access Zones)
For example, 20 IP addresses are recommended for 5 Isilon nodes and 2 Access Zones.
This document will assume that Internet access is available to all servers to download
various components from Internet repositories.
DNS
A DNS server is required and you must have the ability to create DNS records and
zone delegations.
It is recommended that your DNS server delegate a subdomain to your Isilon cluster.
For instance, DNS requests for subnet0-pool0.isiloncluster1.example.com or
isiloncluster1.example.com should be delegated to the Service IP defined on your
Isilon cluster.
To allow for a convenient way of changing the HDFS Namenode used by all Hadoop
applications and services, create a DNS record for your Isilon cluster’s HDFS
Namenode service. This should be a CNAME alias to your Isilon SmartConnect zone.
Specify a TTL of 1 minute to allow for quick changes. For example, create a CNAME
record for mycluster1-hdfs.example.com that targets subnet0-
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 15
pool0.isiloncluster1.example.com. If you later want to redirect all HDFS I/O to another
cluster or a different pool on the same Isilon cluster, you simply need to change the
DNS record and restart all Hadoop services.
Other
See http://www.github.com/bonibruno/BigInsights, there are three scripts to
download to help automate new IBM BigInsights installations with EMC Isilon:
1. bi_create_users.sh – use this script to create the users and groups on all the
Linux nodes before beginning installation.
2. isilon_create_users.sh – use this script to create the users and groups on
Isilon before beginning installation. You must first create your access zone
(described later in this document, e.g. ibm), before running this script.
3. isilon_create_directories.sh – run this after the script above.
More information on the use of these scripts is provided in the installation section of this
document.
Prepare Isilon
Assumptions
This document makes the assumptions listed below. These are not necessarily
requirements but they are usually valid and simplify the process.
It is assumed that you are not using a directory service such as Active
Directory for Hadoop users and groups.
It is assumed that you are not using Kerberos authentication for Hadoop.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 16
SmartConnect for HDFS
A best practice for HDFS on Isilon is to utilize two SmartConnect IP address pools for each
access zone. One IP address pool should be used by Hadoop clients to connect to the HDFS
namenode service on Isilon and it should use the dynamic IP allocation method to
minimize connection interruptions in the event that an Isilon node fails.
____________________________________________________________________
Note: Dynamic IP allocation requires a SmartConnect Advanced license.
____________________________________________________________________
A Hadoop client uses a specific SmartConnect IP address pool simply by using its zone
name (DNS name) in the HDFS URI:
For example, hdfs://subnet0-pool1.isiloncluster1.example.com:8020
A second IP address pool should be used for HDFS datanode connections and it should also
use dynamic IP allocation method. To assign specific Smart-Connect IP address pools for
datanode connections, you will use the “isi hdfs racks modify” command. If the network
is flat, there is no need to use “isi hdfs racks modify”, the default configuration will suffice.
If IP addresses are limited and you have a SmartConnect Advanced license, you may
choose to use a single dynamic pool for namenode and datanode connections. This may
result in uneven utilization of Isilon nodes.
If you do not have a SmartConnect Advanced license, you may choose to use a single
static pool for namenode and datanode connections. This may result in some failed HDFS
connections in the event of a node failure.
For more information, see EMC Isilon Best Practices for Hadoop Data Storage white paper
online at: https://www.emc.com/collateral/white-papers/h13926-wp-emc-isilon-hadoop-
best-practices-onefs72.pdf
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 17
OneFS Access Zones
Access zones on OneFS are a way to select a distinct configuration for the OneFS cluster
based on the IP address that the client connects to. For HDFS, this configuration includes
authentication methods, HDFS root path, and authentication providers (AD, LDAP, local,
etc.). By default, OneFS includes a single access zone called System.
If you will only have a single Hadoop cluster connecting to your Isilon cluster, then you can
use the System access zone with no additional configuration. However, to have more than
one Hadoop cluster connect to your Isilon cluster, it is best to have each Hadoop cluster
connect to a separate OneFS access zone. This will allow OneFS to present each Hadoop
cluster with its own HDFS namespace and an independent set of users.
For more information, see Security and Compliance for Scale-out Hadoop Data Lakes
whitepaper.
To view your current list of access zones and the IP pools associated with them:
isiloncluster1-1# isi zone zones list
Name Path
------------
System /ifs
------------
Total: 1
isiloncluster1-1# isi networks list pools -v
subnet0:pool0
In Subnet: subnet0
Allocation: Static
Ranges: 1
10.111.129.115-10.111.129.126
Pool Membership: 4
1:10gige-1 (up)
2:10gige-1 (up)
3:10gige-1 (up)
4:10gige-1 (up)
Aggregation Mode: Link Aggregation Control Protocol (LACP)
Access Zone: System (1)
SmartConnect:
Suspended Nodes : None
Auto Unsuspend ... 0
Zone : subnet0-pool0.isiloncluster1.lab.example.com
Time to Live : 0
Service Subnet : subnet0
Connection Policy: Round Robin
Failover Policy : Round Robin
Rebalance Policy : Automatic Failback
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 18
To create a new access zone and an associated IP address pool:
isiloncluster1-1# mkdir -p /ifs/isiloncluster1/zone1
isiloncluster1-1# isi zone zones create --name zone1 \
--path /ifs/isiloncluster1/zone1
isiloncluster1-1# isi networks create pool --name subnet0:pool1 \
--ranges 10.111.129.127-10.111.129.138 --ifaces 1-4:10gige-1 \
--access-zone zone1 --zone subnet0-pool1.isiloncluster1.lab.example.com \
--sc-subnet subnet0 --dynamic
Creating pool
‘subnet0:pool1’: OK
Saving: OK
____________________________________________________________________
Note: If you do not have a SmartConnect Advanced license, you will need to omit the --
dynamic option.
____________________________________________________________________
Sharing Data between Access Zones
By default, the data in one access zone cannot be access by users in another access zone.
In certain cases, however, you may need to make the same data set available to more
than one Hadoop compute cluster. Using fully qualified HDFS paths, e.g. hdfs://zone1-
hdfs.example.com/hadoop/dir1, can render a data set available across two or more
access zones.
With fully qualified HDFS paths, the data sets do not cross access zones. Instead, the
Hadoop jobs can access the data sets from a common shared HDFS namespace. For
instance, you can selectively share data between two or more access zones based on
referential links and file/directory permissions.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 19
User & Group ID’s
Isilon clusters and Hadoop servers each have their own mapping of user IDs (uid) to user
names and group IDs (gid) to group names. When Isilon is used only for HDFS storage by
the Hadoop servers, the IDs do not need to match. This is due to the fact that the HDFS
protocol only refers to users and groups by their names, and never their numeric IDs.
In contrast, the NFS protocol refers to users and groups by their numeric IDs. Although
NFS is rarely used in traditional Hadoop environments, the high-performance, enterprise-
class, and POSIX-compatible NFS functionality of Isilon makes NFS a compelling protocol
for certain workflows. If you expect to use both NFS and HDFS on your Isilon cluster (or
simply want to be open to the possibility in the future), it is highly recommended to
maintain consistent names and numeric IDs for all users and groups on Isilon and your
Hadoop servers. In a multi-tenant environment with multiple Hadoop clusters, numeric IDs
for users in different clusters should be distinct.
For instance, the user bigsql in Hadoop cluster 1 may have ID 1013 and this same ID will
be used in the Isilon access zone for Hadoop cluster 1 as well as every server in Hadoop
cluster 1. The user bigsql in Hadoop cluster 2 may have ID 710 and this ID will be used in
the Isilon access zone for Hadoop cluster 2 as well as every server in Hadoop cluster 2.
Configuring Isilon for HDFS
_____________________________________________________________________
Note: In the steps below, replace zone1 with System to use the default System access
zone or you may specify the name of a new access zone that you previously created.
______________________________________________________________________
1. Open a web browser to the your Isilon cluster’s web administration page. If you
don’t know the URL, simply point your browser to:
https://isilon_node_ip_address:8080
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 20
The isilon_node_ip_address is any IP address on any Isilon node that is in the System
Access Zone. This usually corresponds to the ext-1 interface of any Isilon node.
2. Login with your root account. You specified the root password when you configured
your first node using the console.
3. Check, and edit as necessary, your NTP settings. Click Cluster Management ->
General Settings -> NTP.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 21
1. SSH into any node in your Isilon cluster as root.
2. Confirm that your Isilon cluster is at OneFS version 7.2.0.3.
isiloncluster1-1# isi version
Isilon OneFS v7.2.0.3 ...
3. For OneFS version 7.2.0.3, you must have patch-159065 installed. You can view
the list of patches you have installed with:
# isi pkg info
patch-159065: This patch adds support for the Ambari 1.7.0_IBM Server.
4. Install the patch if needed:
[user@workstation ~]$ scp patch-159065.tgz root@mycluster1-hdfs:/tmp
isiloncluster1-1# gunzip < /tmp/patch-159065.tgz | tar -xvf -
isiloncluster1-1# isi pkg install patch-159065.tar
Preparing to install the package...
Checking the package for installation...
Installing the package
Committing the installation...
Package successfully installed.
5. Verify your HDFS license.
isiloncluster1-1# isi license
Module License Status Configuration Expiration Date
------ -------------- ------------- ---------------
HDFS Evaluation Not Configured November12, 2016
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 22
6. Create the HDFS root directory. This is usually called hadoop and must be within
the access zone directory.
isiloncluster1-1# mkdir -p /ifs/isiloncluster1/zone1/hadoop
7. Set the HDFS root directory for the access zone.
isiloncluster1-1# isi zone zones modify zone1 \
--hdfs-root-directory /ifs/isiloncluster1/zone1/hadoop
8. Set the HDFS block size used for reading from Isilon.
isiloncluster1-1# isi hdfs settings modify --default-block-size 128M
9. Create an indicator file so that we can easily determine when we are looking your
Isilon cluster via HDFS.
isiloncluster1-1# touch \
/ifs/isiloncluster1/zone1/hadoop/THIS_IS_ISILON_isiloncluster1_zone1
10.Copy the scripts (isilon_create_users.sh & isilon_create_directories.sh) you
downloaded from http://www.github.com/bonibruno/BigInsights to Isilon,
[user@workstation ~]$ scp isilon_create_*.sh \
root@isilon_node_ip_address:/ifs/isiloncluster1/scripts
11.Execute the script isilon_create_users.sh. This script will create all required
users and groups for IBM BigInsights v 4.0.
Warning: The script isilon_create_users.sh will create local user and group accounts on
your Isilon cluster for Hadoop services. If you are using a directory service such as Active
Directory and you want these users and groups to be defined in your directory service,
then DO NOT run this script.
Instead, refer to the OneFS documentation and EMC Isilon Best Practices for Hadoop Data
Storage.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 23
Script Usage:
isilon_create_users.sh –dist <DIST> [–startgid <GID>] [–startuid <UID>] [–
zone <ZONE>]
dist - This will correspond to your Hadoop distribution – bi4.0
startgid - Group IDs will begin with this value. For example: 1000
startuid - User IDs will begin with this value. This is generally the same as gid_base. For
example: 1000.
zone – Access Zone name. For example: zone1
isiloncluster1-1# bash /ifs/isiloncluster1/scripts/isilon_create_users.sh \
--dist bi4.0 --startgid 1000 --startuid 1000 --zone zone1
Example output of script is shown below:
Info: Hadoop distribution: bi
Info: groups will start at GID 1000
Info: users will start at UID 1000
Info: will put users in zone: zone1
Info: HDFS root: /ifs/isiloncluster1/hadoop
Failed to add member UID:1001 to group GROUP:hadoop: User is already in local group
SUCCESS -- Hadoop users created successfully!
Done!
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 24
______________________________________________________________________
Note: The “User is already in local group” message is expected, this user corresponds to
the hadoop user which is already in the hadoop group.
12. Execute the script isilon_create_directories.sh. This script will create all
required directories with the appropriate ownership and permissions.
Script Usage:
isilon_create_directories.sh –dist <DIST> [–fixperm] [–zone <ZONE>]
dist - This will correspond to your Hadoop distribution – bi4.0
fixperm - Updates ownership and permissions on hadoop directories.
zone - Access Zone name. For example: zone1
isiloncluster1-1# bash /ifs/isiloncluster1/scripts/isilon_create_directories.sh \
--dist bi4.0 --fixperm --zone zone1
13. Map the hdfs user to the Isilon superuser. This will allow the hdfs user to chown
(change ownership of) all files during IBM BigInsights installation.
______________________________________________________________________
Warning: The command below will restart the HDFS service on Isilon to ensure that any
cached user mapping rules are flushed. This will temporarily interrupt any HDFS
connections coming from other Hadoop clusters.
______________________________________________________________________
isiloncluster1-1# isi zone zones modify --user-mapping-rules=’’hdfs=>root’’ --zone zone1
isiloncluster1-1# isi services isi_hdfs_d disable ; isi services isi_hdfs_d enable
The service ‘isi_hdfs_d’ has been disabled.
The service ‘isi_hdfs_d’ has been enabled.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 25
Create DNS Records for Isilon
You will now create the required DNS records that will be used to access your Isilon
cluster.
1. Create a delegation record so that DNS requests for the zone
isiloncluster1.example.com are delegated to the Service IP that will be defined on
your Isilon cluster. The Service IP can be any unused static IP address in your lab
subnet.
2. Create a CNAME alias for your Isilon SmartConnect zone. For example, create a
CNAME record for mycluster1-hdfs.example.com that targets subnet0-
pool0.isiloncluster1.example.com.
3. Test name resolution.
[user@workstation ~]$ ping mycluster1-hdfs.example.com
PING subnet0-pool0.isiloncluster1.example.com (10.11.12.13) 56(84) bytes of data.
64 bytes from 10.11.12.13: icmp_seq=1 ttl=64 time=1.15 ms
Prepare Linux Compute Nodes
Linux Operating System packages needed for IBM BigInsights:
1. Compatibility Libraries
2. Networking Tools
3. Perl Support
4. Ruby Support
5. Web Services add on
6. PHP Support
7. Web Server
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 26
8. Mysql*
9. PostGres*
10.snmp support
11.Development Tools
12. Korn Shel
Enable NTP on all Linux Compute nodes
1. Edit /etc/ntp.conf file and add your NTP Server.
2. Enable NTP, “service ntpd start”
3. chkconfig –level 2345 ntpd on
Disable SELinux on each node if enabled before installing Ambari.
1. Edit /etc/selinux/config
2. Set SELINUX=disabled
3. Reboot
____________________________________________________________________
Note: SELinux can be disabled temporarily with the “setenforce 0” command.
____________________________________________________________________
Check UMASK Settings
The umask setting on each node should be set to 0022 in /etc/profile and /etc/bashrc.
Just modify existing umask entry if needed, e.g. “umask 0022”.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 27
Set ulimit Properties
1. Edit /etc/security/limits.d/90-nproc.conf
#set for all users
* hard nofile 65536
* soft nofile 65536
* hard nproc 65536
* hard nproc 65536
Kernel Modifications
1. Edit /etc/sysctl.conf and add the following:
vm.swappiness=5
kernel.pid_max=4194303
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
net.ipv4.ip_local_port_range = 1024 64000
Create IBM BigInsights Hadoop Users and Groups
Create required users on all Linux nodes. It is recommended to create all Hadoop users
before installing IBM BigInsights. Use the bi_create_users.sh script obtained from:
http://www/github.com/bonibruno/BigInsights
[user_workstation ~$] scp bi_create_users.sh [node1]:/root
Run script, e.g. #./bi_create_users.sh
Repeat above for all nodes.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 28
Configure Passwordless SSH
Configure passwordless SSH for all Linux nodes.
1. Create Authentication SSH Keys
ssh-keygen -f id_rsa -t rsa -N
2. Create .ssh directories on all nodes
ssh root@[node1]
mkdir –p .ssh
cd .ssh
Upload generated keys to all hosts:
cat id_rsa.pub | ssh root@[node1] 'cat >> .ssh/authorized_keys'
Repeat above for all nodes.
3. Set permissions on .ssh directory
ssh root@[node1] "chmod 700 .ssh; chmod 640 .ssh/authorized_keys”
Additional Linux Packages to Install
Install the following packages on all Linux compute nodes.
deltarpm
python-deltarpm createrepo pam-1.1.1-17.el6.i686.rpm
mysql-connector-java-5.1.17-6.el6.noarch.rpm ksh
nc libdbi libstdc
libaio java-1.7.0-openjdk-devel
python-paramiko python-rrdtool-1.4.5-1.el6.rfx.x86_64
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 29
snappy-1.0.5-1.el6.x86_64 web-ui-framework
Install the above packages using the yum install command.
Test DNS Resolution
Make sure all compute nodes resolve with a fully qualifies domain name.
Ping each host with the associated FQDN and make sure it is reachable by FQDN.
Edit sudoers file on all Linux compute nodes. 1. Edit /etc/sudoers
## Additions needed for IBM BigInsights
hadoop ALL=(ALL) NOPASSWD: ALL
bigsql ALL=(ALL) NOPASSWD: ALL
Check IBM’s BigInsights Website for more info on preparing Linux nodes.
http://www01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsigh
ts.install.doc/doc/install_prepare.html
Installing IBM Open Platform (OP)
Download IBM Open Platform Software
Log into the IBM Passport Advantage web portal with your IBM assigned credentials and
download the following packages onto the designated Ambari server node:
• BI-AH-1.0.0.1-IOP-4.0.x86_64.bin
• IOP-4.0.0.0.x86_64.rpm
• iop-4.0.0.0.x86_64.tar.gz
• iop-utils-1.0-iop-4.0.x86_64.tar.gz
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 30
Create IBM Open Platform Repository
The IBM Open Platform with Apache Hadoop uses the repository-based Ambari installer.
You have two options for specifying the location of the repository from which Ambari
obtains the component packages.
The IBM Open Platform with Apache Hadoop installation includes OpenJDK 1.7.0. During
installation, you can either install the version provided or make sure Java™ 7 is installed
on all nodes in the cluster.
1. Log in to your Linux cluster as root, or as a user with root privileges.
2. Ensure that the nc package is installed on all nodes:
yum install -y nc
If you installed the Basic Server option on your server, the nc package might not be
installed, which might result in the failure on datanodes of the IBM Open Platform with
Apache Hadoop.
3. Locate the IOP-4.0.0.0.x86_64.rpm file you downloaded from the download site. Run the
following command to install the ambari.repo file into /etc/yum.repos.d:
yum install IOP-4.0.0.0.x86_64.rpm
If using a mirror repository, edit the file /etc/yum.repos.d/ambari.repo and replace
baseurl=http://ibm-open-platform.ibm.com/repos/Ambari/RHEL6/x86_64/1.7
with your mirror URL. For example,
baseurl=http://<web.server>/repos/Ambari/RHEL6/x86_64/1.7/
Disable the gpgcheck in the ambari.repo file. To disable signature validation,
change gpgcheck=1 to gpgcheck=0.
Alternatively, you can keep gpgcheck on and change the public key file location to the
mirror Ambari repository. To do this, change the following
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 31
gpgkey=http://ibm-open-platform.ibm.com/repos/Ambari/RHEL6/x86_64/1.7/BI-GPG-
KEY.public
to the following:
gpgkey=http://<web.server>/repos/Ambari/RHEL6/x86_64/1.7/BI-GPG-KEY.public
4. Clean the yum cache on each node so that the right packages from the remote repository
are seen by your local yum.
>sudo yum clean all
5. Install the Ambari server on the intended management node, using the following
command:
>sudo yum install ambari-server
Accept the install defaults.
6. If you are using a mirror repository, after you install the Ambari server, update the
following file with the mirror repository URLs.
/var/lib/ambari-server/resources/stacks/BigInsights/4.0/repos/repoinfo.xml
In the file, change the information from the Original content to the Modified content
Original content Modified content
<os type="redhat6">
<repo>
<baseurl>
http://ibm-open-
platform.ibm.com/repos/IOP/RHEL6/x86_64
/4.0</baseurl>
<repoid>IOP-4.0</repoid>
<reponame>IOP</reponame>
</repo>
<repo>
<os type="redhat6">
<repo>
<baseurl>
http://<web.server>/repos/IOP/RHE
L6/x86_64/4.0</baseurl>
<repoid>IOP-4.0</repoid>
<reponame>IOP</reponame>
</repo>
<repo>
<baseurl>
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 32
<baseurl>
http://ibm-open-
platform.ibm.com/repos/IOP-
UTILS/RHEL6/x86_64/1.0</baseurl>
<repoid>IOP-UTILS-1.0</repoid>
<reponame>IOP-UTILS</reponame>
</repo>
</os>
http://<web.server>/repos/IOP-
UTILS/RHEL6/x86_64/1.0</baseurl>
<repoid>IOP-UTILS-1.0</repoid>
<reponame>IOP-
UTILS</reponame>
</repo>
</os>
Edit the /etc/ambari-server/conf/ambari.properties file. change the information from the
Original content to the Modified content
Original content Modified content
jdk1.7.url=http://ibm-open-
platform.ibm.com/repos/IOP-
UTILS/RHEL6/x86_64/1.0/openjdk/jdk-
1.7.0.tar.gz
jdk1.7.url=http://<web.server>/r
epos/IOP-
UTILS/RHEL6/x86_64/1.0/openjdk
/jdk-1.7.0.tar.gz
7. Set up the Ambari server, using the following command:
>sudo ambari-server setup
Accept the setup preferences.
A Java JDK is installed as part of the Ambari server setup. However, the Ambari server
setup also allows you to reuse an existing JDK. The command is:
ambari-server setup -j /full/path/to/JDK
The JDK path set by the -j parameter must be the same on each node in the cluster.
8. Start the Ambari server, using the following command:
>sudo ambari-server start
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 33
9. If the Ambari server had been installed on your node previously, the node may contain
old cluster information. Reset the Ambari server to clean up its cluster information in the
database, using the following commands:
>sudo ambari-server stop
>sudo ambari-server reset
>sudo ambari-server start
10. Access the Ambari web user interface from a web browser by using the server name
(the fully qualified domain name, or the short name) on which you installed the software,
and port 8080. For example, enter abc.com:8080.
You can use any available port other than 8080 that will allow you to connect to the
Ambari server. In some networks, port 8080 is already in use. To use another port, do
the following:
a. Edit the ambari.properties file:
vi /etc/ambari-server/conf/ambari.properties
b. Add a line in the file to select another port:
client.api.port=8081
c. Save the file and restart the Ambari server:
ambari-server restart
11. Log in to the Ambari server with the default username and password: admin/admin.
The default username and password is required only for the first login. You can
configure users and groups after the first login to the Ambari web interface.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 34
12. On the Welcome page, click Launch Install Wizard.
13. On the Get Started page, enter a name for the cluster you want to create. The name
cannot contain blank spaces or special characters. Click Next.
14. You will deploy IBM Open Platform for Apache Hadoop with EMC Isilon. Ambari Server
allows for the immediate usage of an Isilon cluster for all HDFS services (NameNode and
DataNode), no reconfiguration will be necessary once the IBM Open Platform install is
completed.
1. SSH into Isilon as root and configure the Ambari Agent.
isiloncluster1-1# isi zone zones modify zone1 --hdfs-ambari-namenode
mycluster1-hdfs.example.com
isiloncluster1-1# isi zone zones modify zone1 --hdfs-ambari-server manager-
svr-1.example.com
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 35
15. On the Select Stack page, click the Stack version you want to install (BigInsights™ 4.0).
Click Next.
16. On the Install Options page, in Target Hosts, add the list of Linux hosts that the
Ambari server will manage and the IBM Open Platform with Apache Hadoop software will
deploy one node per line. For example, enter
host1.example.com
host2.example.com
host3.example.com
host4.example.com
In Host Registration Information, select one of the two options:
Provide the SSH Private Key to automatically register hosts
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 36
Click SSH Private Key. The private key file is /root/.ssh/id_rsa, where the root user
installed the Ambari server. Click Choose File to find the private key file you installed
previously. You should have retained a copy of the SSH private key (.ssh/id_rsa) in your
local directory when you set up password-less SSH. Copy and paste the key into the text
box manually. Click the Register and Confirm button.
____________________________________________________________________
Note: After the Linux hosts register, click the back button and Perform manual
registration for Isilon and do not use SSH.
____________________________________________________________________
Isilon has an ambari-agent within OneFS and needs to be manually registered in Ambari.
After registering Isilon manually, click the Next button. You should see the Ambari
agents on both your Linux hosts and Isilon become registered.
17. On the Confirm Hosts page, you check that the correct hosts for your cluster have been
located and that those hosts have the correct directories, packages, and processes to
continue the installation.
If hosts were selected in error, click the check boxes next to the hosts you want to
remove. Click Remove Selected. To remove a single host, click Remove in
the Action column.
If warnings are found during the check process, you can click Click here to see the
warnings to see what caused the warnings. The Host Checks page identifies any issues
with the hosts. For example, a host may have Transparent Huge Pages or Firewall issues.
You can ignore errors related to user names and groups as we pre-created the
users in the pre-installation steps of this document.
After you resolve the issues, click Rerun Checks on the Host Checks page. When you
have confirmed the hosts, click Next.
18. On the Choose Services page, select the services you want to install.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 37
Ambari shows a confirmation message to install the required service dependencies. For
example, when selecting Oozie only, the Ambari web interface shows messages for
accepting YARN/MR2, HDFS and Zookeeper installations. It also shows Nagios and
Ganglia for monitoring and alerting, but they are not required services.
19. On the Assign Masters page, assign NameNode and SNameNode components to the
Isilon SmartConnect address e.g. mycluster1-hdfs.example.com. The rest of the services
can be deployed per the recommended services layout - refer back to Table 1. Make
sure you assign Namenode and SNameNode only to the Isilon SmartConnect
address and none of the Linux nodes, e.g. only mycluster1-hdfs.example.com. Click
Next.
On the Assign Slaves and Clients page, assign the components to Linux hosts in your
cluster and make sure datanode is only assigned to Isilon.
Assign Client to the client nodes. Click Next.
Tip: If you anticipate adding the Big SQL service at some later time, you must include all
clients on all the anticipated Big SQL worker nodes. Big SQL specifically needs the HDFS,
Hive, HBase, Sqoop, HCat, and Oozie clients.
20. On the Customize Services page, select configuration settings for the services selected.
Default values are filled in automatically when available and they are the recommended
values. The installation wizard prompts you for required fields (such as password entries)
by displaying a number in a circle next to an installed service.
Assign passwords to Hive, Oozie, and any other selected services that require them.
The following settings should be checked:
• YARN Node Manager log-dirs
• YARN Node Manager local-dirs
• HBase local directory
• ZooKeeper directory
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 38
• Oozie Data Dir
• Storm storm.local.dir
Click the number and enter the requested information in the field outlined in red. Make
sure that the service port that is set is not already used by another component. For
example, the Knox gateway port is, by default, set as 8443. But, when the Ambari server
is set up with HTTPs, and the SSL port is set up using 8443, then you must change the
Knox gateway port to some other value.
____________________________________________________________________
Note: If you are working in an LDAP environment where users are set up centrally by the
LDAP administrator and therefore, already exist, selecting the defaults can cause the
installation to fail. Open the Misc tab, and check the box to ignore user modification
errors.
21. When you have completed the configuration of the services, click Next.
22. On the Review page, verify that your settings are correct. Click Deploy.
23. The Install, Start, and Test page shows the progress of the installation. The progress
bar at the top of the page gives the overall status while the main section of the page
gives the status for each host. Logs for a specific task can be displayed by clicking on the
task. Click the link in the Message column to find out what tasks have been completed for
a specific host or to see the warnings that have been encountered. When the message
"Successfully installed and started the services" appears, click Next.
24. On the Summary page, review the accomplished tasks. Click Complete to go to the IBM
Open Platform with Apache Hadoop dashboard.
Validating IBM Open Platform Install
Ambari provides service checks for all the supported services. These checks run
automatically after each service installation, or they can be run manually at any time. You
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 39
can access the Ambari web interface and use the Services View to make sure all the
components pass their checks successfully.
The following steps provide another way to validate your installation.
1. As the root user on a node on which Apache Hadoop is installed, enter the following
command to become the ambari-qa user:
su - ambari-qa
2. As the ambari-qa user, run the following command:
export HADOOP_MR_DIR=/usr/iop/current/hadoop-mapreduce-client
# Generate data with 1000 rows. Each row is about 100 bytes. yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teragen 1000 /tmp/tgout
# Sort data yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar terasort /tmp/tgout
/tmp/tsout # Validate data
yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teravalidate /tmp/tsout /tmp/tvout
If the job is successful, you will see a log record similar to the following: INFO mapreduce.Job: Job job_id completed successfully
Browse to your cluster on port 8088 to see the results of your validation tests, e.g.
http://x.x.x.x:8088/cluster, example YARN test results shown below.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 40
Adding a Hadoop User
You must add a user account for each Linux user that will submit MapReduce jobs. The
procedure below can be used to add a user named hduser1 as an example.
1. Add user to Isilon.
isiloncluster1-1# isi auth groups create hduser1 --zone zone1 --provider local
isiloncluster1-1# isi auth users create hduser1 --primary-group hduser1 --zone zone1 --
provider local --home-directory /ifs/isiloncluster1/zone1/hadoop/user/hduser1
2. Add user to Hadoop nodes.
[root@mycluster1-master-0 ~]# adduser hduser1
3. Create the user’s home directory on HDFS.
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -mkdir -p /user/hduser1
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -chown hduser1:hduser1 \
/user/hduser1
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -chmod 755 /user/hduser1
Additional Service Tests
The tests below should be performed to ensure a proper installation. Perform the tests in the
order shown. You must create the Hadoop user hduser1 before proceeding.
HDFS
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -ls /
Found 5 items
-rw-r--r-- 1 root hadoop 0 2014-08-05 05:59 /THIS_IS_ISILON
drwxr-xr-x - hbase hbase 148 2014-08-05 06:06 /hbase
drwxrwxr-x - solr solr 0 2014-08-05 06:07 /solr
drwxrwxrwt - hdfs supergroup 107 2014-08-05 06:07 /tmp
drwxr-xr-x - hdfs supergroup 184 2014-08-05 06:07 /user
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -put -f /etc/hosts /tmp
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -cat /tmp/hosts
127.0.0.1 localhost
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -rm -skipTrash /tmp/hosts
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 41
[root@mycluster1-master-0 ~]# su - hduser1
[hduser1@mycluster1-master-0 ~]$ hdfs dfs -ls /
Found 5 items
-rw-r--r-- 1 root hadoop 0 2014-08-05 05:59 /THIS_IS_ISILON
drwxr-xr-x - hbase hbase 148 2014-08-05 06:28 /hbase
drwxrwxr-x - solr solr 0 2014-08-05 06:07 /solr
drwxrwxrwt - hdfs supergroup 107 2014-08-05 06:07 /tmp
drwxr-xr-x - hdfs supergroup 209 2014-08-05 06:39 /user
[hduser1@mycluster1-master-0 ~]$ hdfs dfs -ls
...
YARN/MAPREDUCE
[hduser1@mycluster1-master-0 ~]$ hadoop jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
pi 10 1000
...
Estimated value of Pi is 3.14000000000000000000
[hduser1@mycluster1-master-0 ~]$ hadoop fs -mkdir in
You can put any file into the in directory. It will be used the datasource for subsequent tests.
[hduser1@mycluster1-master-0 ~]$ hadoop fs -put -f /etc/hosts in
[hduser1@mycluster1-master-0 ~]$ hadoop fs -ls in
...
[hduser1@mycluster1-master-0 ~]$ hadoop fs -rm -r out
[hduser1@mycluster1-master-0 ~]$ hadoop jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
wordcount in out
...
[hduser1@mycluster1-master-0 ~]$ hadoop fs -ls out
Found 4 items
-rw-r--r-- 1 hduser1 hduser1 0 2014-08-05 06:44 out/_SUCCESS
-rw-r--r-- 1 hduser1 hduser1 24 2014-08-05 06:44 out/part-r-00000
-rw-r--r-- 1 hduser1 hduser1 0 2014-08-05 06:44 out/part-r-00001
-rw-r--r-- 1 hduser1 hduser1 0 2014-08-05 06:44 out/part-r-00002
[hduser1@mycluster1-master-0 ~]$ hadoop fs -cat out/part*
localhost 1
127.0.0.1 1
Browse to the YARN Resource Manager GUI http://mycluster1-master-0.example.com:8088/
Browse to the MapReduce History Server GUI http://mycluster1-master-0.lab.example.com:19888/.
In particular, confirm that you can view the complete logs for task attempts.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 42
HIVE
[hduser1@mycluster1-master-0 ~]$ hadoop fs -mkdir -p sample_data/tab1
[hduser1@mycluster1-master-0 ~]$ cat - > tab1.csv
1,true,123.123,2012-10-24 08:55:00
2,false,1243.5,2012-10-25 13:40:00
3,false,24453.325,2008-08-22 09:33:21.123
4,false,243423.325,2007-05-12 22:32:21.33454
5,true,243.325,1953-04-22 09:11:33
Type <Control+D>.
[hduser1@mycluster1-master-0 ~]$ hadoop fs -put -f tab1.csv sample_data/tab1
[hduser1@mycluster1-master-0 ~]$ hive
hive>
DROP TABLE IF EXISTS tab1;
CREATE EXTERNAL TABLE tab1
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE,
col_3 TIMESTAMP
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LOCATION ‘/user/hduser1/sample_data/tab1’;
DROP TABLE IF EXISTS tab2;
CREATE TABLE tab2
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE,
month INT,
day INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
INSERT OVERWRITE TABLE tab2
SELECT id, col_1, col_2, MONTH(col_3), DAYOFMONTH(col_3)
FROM tab1 WHERE YEAR(col_3) = 2012;
...
OK
Time taken: 28.256 seconds
hive> show tables;
OK
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 43
tab1
tab2
Time taken: 0.889 seconds, Fetched: 2 row(s)
hive> select * from tab1;
OK
1 true 123.123 2012-10-24 08:55:00
2 false 1243.5 2012-10-25 13:40:00
3 false 24453.325 2008-08-22 09:33:21.123
4 false 243423.325 2007-05-12 22:32:21.33454
5 true 243.325 1953-04-22 09:11:33
Time taken: 1.083 seconds, Fetched: 5 row(s)
hive> select * from tab2;
OK
1 true 123.123 10 24
2 false 1243.5 10 25
Time taken: 0.094 seconds, Fetched: 2 row(s)
hive> select * from tab1 where id=1;
OK
1 true 123.123 2012-10-24 08:55:00
Time taken: 15.083 seconds, Fetched: 1 row(s)
hive> select * from tab2 where id=1;
OK
1 true 123.123 10 24
Time taken: 13.094 seconds, Fetched: 1 row(s)
hive> exit;
HBASE
[hduser1@mycluster1-master-0 ~]$ hbase shell
hbase(main):001:0> create ‘test’, ‘cf’
0 row(s) in 3.3680 seconds
=> Hbase::Table - test
hbase(main):002:0> list ‘test’
TABLE
test
1 row(s) in 0.0210 seconds
=> [’’test’’]
hbase(main):003:0> put ‘test’, ‘row1’, ‘cf:a’, ‘value1’
0 row(s) in 0.1320 seconds
hbase(main):004:0> put ‘test’, ‘row2’, ‘cf:b’, ‘value2’
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 44
0 row(s) in 0.0120 seconds
hbase(main):005:0> scan ‘test’
ROW COLUMN+CELL
row1 column=cf:a,timestamp=1407542488028,value=value1
row2 column=cf:b,timestamp=1407542499562,value=value2
2 row(s) in 0.0510 seconds
hbase(main):006:0> get ‘test’, ‘row1’
COLUMN CELL
cf:a timestamp=1407542488028,value=value1
1 row(s) in 0.0240 seconds
hbase(main):007:0> quit
Ambari Service Check
Ambari has built-in functional tests for each component. These are executed automatically
when you install your cluster with Ambari. To execute them after installation, select the service
in Ambari, click the Service Actions button, and select Run Service Check.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 45
Installing IBM Value Packages
Before You Begin
Please note that “BigInsights Analyst” and “BigInsights Data Scientist” value package have been
sanity tested on EMC Isilon, but have not been performance profiled and tested under load with
Isilon 7.2.0.3 version. EMC and IBM BigInsights plan to validate these components under load
as part of future integration efforts. Please refer to EMC – IBM BigInsights Joint Support
Statement for further details.
You must acquire the software from Passport Advantage. The acquired software has a *.bin
extension. The name of the *.bin file depends on whether the BigInsights Analyst or the
BigInsights Data Scientist module was downloaded.
When you run the *.bin file, configuration files are copied to appropriate locations to
enable Ambari to see that value-add services as available. When adding the value-add
services through Ambari, additional software packages can be downloaded. If the
Hadoop cluster cannot directly access the internet, a local mirror repository can be
created.
Where you perform the following steps depends on whether the Hadoop cluster has
direct internet access.
If the Hadoop cluster has direct access to the internet, perform the steps from the
Ambari server of the Hadoop cluster.
If the Hadoop cluster does not have direct internet access, perform the steps from
a Linux host with direct internet access. Then, transfer the files, as required, to a
local repository mirror.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 46
Installation Procedure
1. Update the permissions on the downloaded *.bin file to enable execute.
chmod +x <package_name>.bin
2. Run the *.bin file to extract and install the services in the module.
./<package_name>.bin
where <package_name> is BI-Analyst-xxxxx.bin for the Analyst module or BI-DS-
xxxxx.bin for the Data Scientist module.
3. After the prompt, agree to the license terms. Reply yes | y to continue install.
4. After the prompt, choose if you want to do an online (option 1) or offline
(option 2) install.
a. Online install will lay out the Ambari service configuration files and
update the repository locations in the Ambari server file. Skip to step 6.
b. Offline install initiates a download of files to set up a local repository
mirror. A subdirectory called BigInsights will be created with RPMs and
associated files will be located in directory BigInsights/packages
5. Setup a local repository.
A local repository is required if the Hadoop cluster cannot connect directly to the internet,
or if you wish to avoid multiple downloads of the same software when installing services
across multiple nodes. In the following steps, the host that performs the repository mirror
function is called the repository server. If you do not have an additional Linux host, you
can use one of the Hadoop management nodes. The repository server must be accessible
over the network by the Hadoop cluster. The repository server requires an HTTP web
server. The following instructions describe how to set up a repository server by using a
Linux host with an Apache HTTP server.
a. On the repository server, if the Apache HTTP server is not installed,
install it:
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 47
yum install httpd
b. On the repository server, ensure that the createrepo package is
installed.
c. On the repository server, create a directory for your value-add
repository, such as <mirror web server document
root>/repos/valueadds. For example, for Apache httpd, the default is
/var/www/html/repos.
mkdir /var/www/html/repos/valueadds
d. By selecting Option 2 in step 4, RPMs were downloaded to a
subdirectory called BigInsights/packages. Copy all of the RPMs to the
mirror web server location, <your.mirror.web.server.document
root>/repos/valueadds directory.
cp BigInsights/packages/* /var/www/html/repos/valueadds/
e. Start this web server. If you use Apache httpd, start it by using either of
the following commands:
apachect start or service httpd start
f. Test your local repository by browsing to the web directory:
http://<your.mirror.web.server>/repos/valueadds
You should see all of the files that you copied to the repository server.
g. On the repository server, run the createrepo command to initialize the
repository:
createrepo /var/www/html/repos/valueadds
h. In the BigInsights/packages directory, find the RPM to install on the
Ambari Server host of the Hadoop cluster:
BigInsights Analyst
BI-Analyst-X.X.X.X-IOP-X.X.x86_64.rpm
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 48
BigInsights Data Scientist
BI-DS-X.X.X.X-IOP-X.X.x86_64.rpm
Tip: The BigInsights Data Scientist module also entitles you to the features of the
BigInsights Analyst module. Therefore, consider doing the yum install for both of the RPM
packages.
Then, copy the file to the Ambari Server host and install the RPMs by using the following
commands:
sudo yum install <BI-xxx-1.0.0.1-IOP...>.rpm
i. On the Ambari Server node, navigate to the /var/lib/ambari-
server/resources/stacks/BigInsights/<version_number>/repos/repoinfo.
xml file. If the file does not exist, create it. Ensure the <baseurl>
element for the BIGINSIGHTS-VALUEPACK <repo> entry points to your
repository server. Remember, there might be multiple <repo> sections.
Make sure that the URL you tested in step 5.f matches exactly the value
indicated in the <baseurl> element. For example, the repoinfo.xml
might look like the following content after you change http://ibm-open-
platform.ibm.com/repos/BigInsights-Valuepacks/to become
http://your.mirror.web.server/repos/valueadds:
<repo>
<baseurl> http://<your.mirror.web.server>/repos/valueadds
</baseurl>
<repoid>BIGINSIGHTS-VALUEPACK</repoid>
<reponame>BIGINSIGHTS-VALUEPACK</reponame>
</repo>
Note: The new <repo> section might appear as a single line.
Tip: If you later find an error in this configuration file, make corrections and run the
following command:
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 49
yum clean all
Then, restart the ambari server.
j. When the module is installed, restart the Ambari server.
ambari-server restart
k. Open the Ambari web interface and log in. The default address is the
following URL:
http://<server-name>:8080
The default login name is admin and the default password is admin.
l. Click Actions > Add service. In the list of services you will see the services that you previously added as well as the BigInsights services
you can now add.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 50
Select IBM BigInsights Service to Install
Select the service that you want to install and deploy. Even though your module might
contain multiple services, install the specific service that you want and the BigInsights™
Home service. Installing one value-add service at a time is recommended. Follow the
service specific installation instructions for more information.
At the conclusion of installing all the IBM BigInsights Services, the Ambari GUI Software
List should have green check marks next to each service as shown below:
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 51
Installing BigInsights Home
The BigInsights Home service is the main interface to launch BigInsights - BigSheets,
BigInsights - Text Analytics, and BigInsights - Big SQL.
The BigInsights Home service requires Knox to be installed, configured and started.
Open a browser and access the Ambari server dashboard. The following is the default URL:
http://<server-name>:8080
The default user name is admin, and the default password is admin.
In the Ambari dashboard, click Actions > Add Service.
In the Add Service Wizard > Choose Services, select the BigInsights – BigInsights Home
service. Click Next. If you do not see the option for BigInsights – BigInsights Home, follow the
instructions described in Installing the BigInsights value-add packages.
In the Assign Masters page, select a Management node (edge node) that your users can
communicate with. BigInsights Home is a web application that your users must be able to open
with a web browser.
In the Assign Slaves and Clients page, make selections to assign slaves and clients.
The nodes that you select will have JSQSH (an open source, command line interface to SQL for
Big SQL and other database engines) and SFTP client. Select nodes that might be used to ingest
data as an SFTP client, where you might want to work with Big SQL scripts, or other databases
interactively.
Click Next to review any options that you might want to customize.
Click Deploy.
If the BigInsights – BigInsights Home service fails to install, run the
remove_value_add_services.sh cleanup script. The following code is an example command:
cd /usr/ibmpacks/bin/<version>
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 52
remove_value_add_services.sh
-u admin -p admin
-x 8080 -s WEBUIFRAMEWORK -r
For more information about cleaning the value-add service environment, see Removing
BigInsights value-add services.
After installation is complete, click Next > Complete.
Configure Knox
The Apache Knox gateway is a system that provides a single point of authentication and access
for Apache Hadoop services on the compute nodes in a cluster; however authentication to HDFS
services is completely controlled by Isilon OneFS only.
The Knox gateway simplifies Hadoop security for users that access the cluster and execute jobs
and operators that control access and manage the cluster. The gateway runs as a server, or a
cluster of servers, providing centralized access to one or more Hadoop clusters.
In IBM® Open Platform with Apache Hadoop, Knox is a service that you start, stop, and
configure in the Ambari web interface.
Users access the following BigInsights™ value added components through Knox by going to the
IBM BigInsights home service.
https://<knox_host>:<knox_port>/<knox_gateway_path>/default/BigInsightsWeb/index.html
BigSheets
Text Analytics
Big SQL
Knox supports only REST API calls for the following Hadoop services:
WebHCat
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 53
Oozie
HBase
Hive
Yarn
Click the Knox service from the Ambari web interface to see the summary page.
Select Service Actions > Restart All to restart it and all of its components.
If you are using LDAP, you must also start LDAP if it is not already started.
Click the BigInsights Home service in the Ambari User Interface.
Select Service Actions > Restart All to restart it and all of its components.
Open the BigInsights Home page from a web.
The URL for BigInsights Home is:
https://<knox_host>:<knox_port>/<knox_gateway_path>/default/BigInsightsWeb/index.html
where:
knox_host
The host where Knox is installed and running
knox_port
The port where Knox is listening (by default this is 8443)
knox_gateway_path
The value entered in the gateway.path field in the Knox configuration (by default this is
'gateway')
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 54
For example, the URL might look like the following address:
https://myhost.company.com:8443/gateway/default/BigInsightsWeb/index.html
If you are using the Knox Demo LDAP, a default user ID and password is created for you. When
you access the web page, use the following preset credentials:
User Name = guest Password = guest-password
Installing BigSheets
To extend the power of the Open Platform for Apache Hadoop, install and deploy the BigInsights
BigSheets service, which is the IBM spreadsheet interface for big data.
1. Open a browser and access the Ambari server dashboard. The following is the default
URL.
http://<server-name>:8080
The default user name is admin, and the default password is admin.
2. In the Ambari Dashboard, click Actions > Add Service.
3. In the Add Service Wizard, Choose Services, select the BigInsights -
BigSheets service, and if you have not already installed the BigInsights Home service,
select that as well. Click Next.
If you do not see BigInsights – BigSheets service, you need to install the appropriate
module and restart Ambari as described in Installing the BigInsights value-add packages.
4. In the Assign Masters page, decide on which node of your cluster you want to run the
specified BigSheets master.
5. In the Assign Slaves and Clients page all the defaults are automatically accepted and
the next page automatically appears. BigSheets service does not have any slaves and
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 55
clients. The Assign Slaves and Clients page will show and be skipped immediately
during install. This is the expected behavior.
6. In the Customize Services page, accept the recommended configurations for the
BigSheets service, or customize the configuration by expanding the configuration files
and modifying the values. In theAdvanced bigsheets-user-config section, make sure
that you enter the following information:
a. In the bigsheets.user field, leave the default user name, which is bigsheets.
b. In the bigsheets.password field, type a valid password.
c. In the bigsheets.userid, type a valid user ID to use for the bigsheets service
user. This user ID is created across all of the nodes of the cluster, and must be
unique across all nodes of the cluster.
d. Click Next..
7. In the Advanced bigsheets-ambari-config section, in the ambari.password field,
type the correct Ambari administration password.
8. You can review your selections in the Review page before accepting them. If you want
to modify any values, click the Back button. If you are satisfied with your setup,
click Deploy.
9. In the Install, Start and Test page, the BigSheets service is installed and verified. If
you have multiple nodes, you can see the progress on each node. When the installation is
complete, either view the errors or warnings by clicking the link, or click Next to see a
summary and then the new service added to the list of services.
10.Click Complete.
If the BigInsights – BigSheets service fails to install, run
the remove_value_add_services.shcleanup script. The following code is an example of
the command:
cd /usr/ibmpacks/bin/<version>
./remove_value_add_services.sh -u admin -p admin -x 8080 -s BIGSHEETS -r
For more information about cleaning the value-add service environment, see Removing
BigInsights value-add services.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 56
11.After you install BigInsights - BigSheets, you must restart the HDFS, MapReduce2, YARN,
Knox, Nagios and Ganglia client services.
a. For each service that requires restart, select the service.
b. Click Service Actions.
c. Click Restart All.
12.Access the BigInsights - BigSheets service from the BigInsights Home service.
o If the BigInsights Home service has not yet been added, see Installing
BigInsights Home.
o If the BigInsights Home service has been installed, it must be restarted so
the BigInsights - BigSheets icon will display.
13.Launch the BigInsights Home service by typing the following address in your browser:
https://<knox_host>:<knox_port>/<knox_gateway_path>/default/BigInsightsWeb/inde
x.html
Where:
knox_host
The host where Knox is installed and running
knox_port
The port where Knox is listening (by default this is 8443)
knox_gateway_path
The value entered in the gateway.path field in the Knox configuration (by default this is
'gateway')
For example, the URL might look like the following address:
https://myhost.company.com:8443/gateway/default/BigInsightsWeb/index.html
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 57
Installing Big SQL
To extend the power of the Open Platform for Apache Hadoop, install and deploy the BigInsights
- Big SQL service, which is the IBM SQL interface to the Hadoop-based platform, IBM Open
Platform with Apache Hadoop.
1. Open a browser and access the Ambari server dashboard. The following is the default
URL.
http://<server-name>:8080
The default user name is admin, and the default password is admin .
2. In the Ambari web interface, click Actions > Add Service.
3. In the Add Service Wizard, Choose Services, select the BigInsights - Big
SQL service, and theBigInsights Home service. Click Next.
If you do not see the option to select the BigInsights - Big SQL service, complete the
steps.
4. In the Assign Masters page, decide which nodes of your cluster you want to run the
specified components, or accept the default nodes. Follow these guidelines:
o For the Big SQL monitoring and editing tool, make sure that the Data Server
Manager (DSM) is assigned to the same node that is assigned to the Big SQL Head
node.
5. Click Next.
6. In the Assign Slaves and Clients page, accept the defaults, or make specific
assignments for your nodes. Follow these guidelines:
o Select the non-head nodes for the Big SQL Worker components. You must select at
least one node as the worker node.
o Select all nodes for the CLIENT. This puts JSqsh and SFTP clients on the nodes.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 58
7. In the Customize Services page, accept the recommended configurations for the Big
SQL service, or customize the configuration by expanding the configuration files and
modifying the values. Make sure that you have a
valid bigsql_user and bigsql_user_password (see reference screen below) and
user_id (created by the bi_create_users.sh script) in the appropriate fields in
theAdvanced bigsql-users-env section.
8.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 59
9. You can review your selections in the Review page before accepting them. If you want
to modify any values, click the Back button. If you are satisfied with your setup,
click Deploy.
10.In the Install, Start and Test page, the Big SQL service is installed and verified. If you
have multiple nodes, you can see the progress on each node. When the installation is
complete, either view the errors or warnings by clicking the link, or click Next to see a
summary and then the new service added to the list of services.
If the BigInsights – Big SQL service fails to install, run
the remove_value_add_services.shcleanup script. The following code is an example of
the command:
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 60
cd /usr/ibmpacks/bin/<version>
./remove_value_add_services.sh -u admin -p admin -x 8080 -s BIGSQL -r
For more information about cleaning the value-add service environment, see Removing
BigInsights value-add services.
11. A web application interface for Big SQL monitoring and editing is available to your end-
users to work with Big SQL. You access this monitoring utility from the IBM BigInsights
Home service. If you have not added the BigInsights Home service yet, do that now.
12. Restart the Knox Service. Also start the Knox Demo LDAP service if you have not
configured your own LDAP.
13. Restart the BigInsights Home services.
14. To run SQL statements from the Big SQL monitoring and editing tool, type the following
address in your browser to open the BigInsights Home service:
https://<knox_host>:<knox_port>/<knox_gateway_path>/default/BigInsightsWeb/inde
x.html
Where:
knox_host
The host where Knox is installed and running
knox_port
The port where Knox is listening (by default this is 8443)
knox_gateway_path
The value entered in the gateway.path field in the Knox configuration (by default this is
'gateway')
For example, the URL might look like the following address:
https://myhost.company.com:8443/gateway/default/BigInsightsWeb/index.html
If you use the Knox Demo LDAP service, the default credential is:
userid = guest
password = guest-password
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 61
Your end users can also use the JSqsh client, which is a component of
the BigInsights - Big SQL service.
15. If the BigInsights - Big SQL service shows as unavailable, there might have been a
problem with post-installation configuration. Run the following commands
as root (or sudo) where the Big SQL monitoring utility (DSM) server is installed:
a. Run the dsmKnoxSetup script:
b. cd /usr/ibmpacks/bigsql/<version-number>/dsm/1.1/ibm-datasrvrmgr/bin/
./dsmKnoxSetup.sh -knoxHost <knox-host>
where <knox-host> is the node where the Knox gateway service is running.
c. Make sure that you do not stop and restart the Knox gateway service within
Ambari. If you do, then run the dsmKnoxSetup script again.
d. Restart the BigInsights Home service so that the Big SQL monitoring utility
(DSM) can be accessed from the BigInsights Home interface.
16. For HBase, do the following post-installation steps:
. For all nodes where HBase is installed, check that the symlinks to hive-serde.jar
and hive-common.jar in the hbase/lib directory are valid.
To verify the symlinks are created and valid:
namei /usr/iop/<version-number>/hbase/lib/hive-serde.jar
namei /usr/iop/<version-number>/hbase/lib/hive-common.jar
If they are not valid, do the following steps:
cd /usr/iop/<version-number>/hbase/lib
rm -rf hive-serde.jar
rm -rf hive-common.jar
ln -s /usr/iop/<version-number>/hive/lib/hive-serde.jar hive-serde.jar
ln -s /usr/iop/<version-number>/hive/lib/hive-common.jar hive-common.jar
a. After installing the Big SQL service, and fixing the symlinks, restart the HBase
service from the Ambari web interface.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 62
After you add Big SQL worker nodes, make sure that you stop and then restart the Hive service.
Connecting to Big SQL
You can run Big SQL queries from Java SQL Shell (JSqsh), or from the IBM Data Server
Manager. You can also run queries from a client application, such as IBM Data Studio,
that uses JDBC or ODBC drivers. You must identify a running Big SQL server and
configure either a JDBC or ODBC driver.
For more information about JSqsh, or IBM Data Studio, see the related topics in the
IBM® BigInsights™ Knowledge Center.
Running JSqsh
JSqsh is installed in /usr/ibmpacks/common-utils/current/jsqsh/bin. Change to that directory
and type./jsqsh to open the JSqsh shell:
cd /usr/ibmpacks/common-utils/current/jsqsh/bin
./jsqsh
You can then run any JSqsh commands from the prompt.
Connection setup
To use the JSqsh command shell, you can use the default connections or define and test a
connection to the Big SQL server.
1. The first time that you open the JSqsh command shell, a configuration wizard is started.
When you are at the Jsqsh command prompt, type \drivers to determine the available
drivers.
a. On the driver selection screen, select the Big SQL instance that you want to run
Note: Big SQL is designated as DB2 in this example:
Name Target Class
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 63
- ------- ------------------- --------------------------------------------
...
2 *db2 IBM Data Server(DB2 com.ibm.db2.jcc.DB2Driver
b. Verify the port, server, and user name. Run \setup and click C to define a
password for the connection. The username must have database administration
privileges, or must be granted those privileges by the Big SQL administrator.
c. Test the connection to the Big SQL server.
d. Save and name this connection.
2. Generally, you can access JSqsh from /usr/ibmpacks/common-
utils/current/jsqsh/bin with the following command:
3. ./jsqsh --driver=db2 --user=<username>
--password=<user_password>
4. Open the saved configuration wizard any time by typing \setup while in the command
interface, or./jsqsh --setup when you open the command interface.
5. Specify the following connection name in the JSqsh command shell to establish a
connection:
./jsqsh name
6. Use the \connect command when you are already inside the JSQSH shell to establish a
connection at the JSqsh prompt:
\connect name
Commands and queries
At the JSqsh command prompt, you can run JSqsh commands or database server commands.
JSqsh commands usually begin with a backslash (\) character.
JSqsh commands accept command-line arguments and allow for common shell activities, such
as I/O redirection and pipes.
For example, consider this set of commands:
1> select * from t1
2> where c1 > 10
3> \go --style csv > /tmp/t1.csv
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 64
Because the commands do not begin with a backslash character, the first two commands are
assumed to be SQL statements, and are sent to the Big SQL server.
The \go command sends the statements to run on the server. The \go command has a built-in
alias so that you can omit the backslash. Additionally, you can specify a trailing semicolon to
indicate that you want to run a statement, for example:
1> select * from t1
2> where c1 > 10;
The --style option in the \go command indicates that the display shows comma-separated
values (CSV). The \go form is most useful if you provide additional arguments to affect how
the query is run. Changing the display style is an example of this feature.
The redirection operator (>) specifies that the results of the command are sent to a file
called /tmp/t1.csv.
A set of frequently run commands does not require the leading backslash. Any JSqsh command
can bealiased to another name (without a leading backslash, if you choose), by using
the \alias command. For example, if you want to be able to type bye to leave the JSqsh shell,
you establish that word as the alias for the \quit command:
\alias bye='\quit'
You can run a script that contains one or more SQL statements. For example, assume that you
have a file called mySQL.sql. That file contains these statements:
select tabschema, tabname from syscat.tables fetch first 5 rows only;
select tabschema, colname, colno, typename, length from syscat.columns fetch first 10 rows
only;
You can start JSqsh and run the script at the same time with this command:
/usr/ibmpacks/common-utils/current/jsqsh/bin/jsqsh bigsql < /home/bigsql/mySQL.sql
The redirection operator specifies to JSqsh to get the commands from the file located in
the /home/bigsqldirectory, and then run the statements within the file.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 65
Command and query edit
The JSqsh command shell uses the JLine2 library, which allows you to edit previously entered
commands and queries. You use the command-line edit features to move the arrow keys and to
edit the command or query on the current line.
The JLine2 library provides the same key bindings (vi and emacs) as the GNU Readline library.
In addition, it attempts to apply any custom key maps that you created in a
GNU Readline configuration file, (.inputrc) in the local file system $HOME/ directory.
In addition to individual line editing, the JSqsh command shell remembers the 50 most recently
run statements, which you can view by using the \history command:
1> \history
(1) use tpch;
(2) select count(*) from lineitem
Previously run statements are prefixed with a number in parentheses. You use this number to
recall that query by using the JSqsh recall operator (!), for example:
1> !2
1> select count(*) from lineitem
2>
The ! recall operator has the following behavior:
!! Recalls the previously run statement.
!5 Recalls the fifth query from history.
!-2 Recalls the query from two prior runs.
You can also edit queries that span multiple lines by using the \buf-edit command,
which pulls the current query into an external editor, for example:
1> select id, count(*)
2> from t1, t2
3> where t1.c1 = t2.c2
4> \buf-edit
The query is opened in an external editor (/usr/bin/vi by default. However, you can
specify a different editor on the environment variable $EDITOR). When you close the
editor, the edited query is entered at the JSqsh command shell prompt.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 66
The JSqsh command shell provides built-in aliases, vi and emacs, for the \buf-
edit command. The following commands, for example, open the query in the vi editor:
1> select id, count(*)
2> from t1, t2
3> where t1.c1 = t2.c2
4> vi
Configuration variables
You can use the \set command to list or define values for a number of configuration
variables, for example:
1> \set
If you want to redefine the prompt in the command shell, you run the following command
with the prompt option:
1> \set prompt='foo $lineno> '
foo 1>
Every JSqsh configuration variable has built-in help available:
1> \help prompt
If you want to permanently set a specific variable, you can do so by editing
your $HOME/.jsqsh/sqshrc file and including the appropriate \set command in it.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 67
Installing Text Analytics
The Text Analytics service provides powerful text extraction capabilities. You can extract
structured information from unstructured and semi-structured text.
It is recommended that you make sure that the python-paramiko package is installed prior to
installing the Text Analytics service.
yum install python-paramiko
You will be selecting a Master node for Text Analytics, and this node should contain
the python-paramikopackage. The master node is the node where Text Analytics Web Tooling
and Text Analytics Runtime are both installed.
1. Open a browser and access the Ambari server dashboard. The following is the default
URL.
http://<server-name>:8080
The default user name is admin, and the default password is admin.
2. In the Ambari dashboard, click Actions > Add Service.
3. In the Add Service Wizard, Choose Services, select the BigInsights - Text
Analytics service.
If you do not see the option to select the BigInsights - Text Analytics service,
complete the steps inInstalling the BigInsights value-add packages.
4. To assign master nodes, select the Text Analytics Master server Node.
5. Click Next. The Assign Slaves and Clients page displays.
6. Assign slave and client components to the hosts on which you want them to run. An
asterisk (*) after a host name indicates the host is assigned a master component.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 68
a. To assign slaves nodes and clients, click All on the Clients column.
The client package that is installed contains runtime binaries that are needed to
run Text Analytics. This client needs to be installed on all datanodes that belong to
your cluster.
Client nodes will install only the Text Analytics Runtime artifacts.
(/usr/ibmpacks/current/text-analytics-runtime). Choose one or more clients. You
do not have to choose the Master node as a client since it already installs Text
Analytics Runtime.
7. Click Next and select BigInsights - Text Analytics.
8. Expand Advanced ta-database-config and enter the password in the
database.password field.Recommended configurations for the service are completed
automatically but you can edit these default settings as desired.
By default, the database server is MySQL. There are two options:
o database.create.new = Yes (default)
a. You must enter the password for the database.
b. You must ensure that the default port, 32050 is free. You can change the
port to any free port.
c. You can change the database.username, but any changes to
the database.hostnameare ignored.
o database.create.new = N
a. You must enter the database.hostname, database.port (where the
existing database server instance is
running), database.user and database.password. Ensure that the user
and password have full access to create a database in the existing database
server instance you specify. Especially if it is a remote MySQL server
instance, ensure that all permissions are given to the user and password to
access this remote instance. Ensure that the server instance is up and
running so that the Text Analytics service can be started successfully.
9. Click Next and in the Review screen that opens, click Deploy.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 69
10.After installation is complete, click Next > Complete.
11.After the installation is successful, click Next and Complete.
If the BigInsights - Text Analytics service fails to install, run
the remove_value_add_services.shcleanup script. The following code is an example
command:
cd /usr/ibmpacks/bin/<version>
remove_value_add_services.sh
-u admin -p admin
-x 8080 -s TEXTANALYTICS -r
For more information about cleaning the value-add service environment, see Removing
BigInsights value-add services.
12. The Text Analytics directory on all nodes where Text Analytics components are installed
is created with world-writable permissions, which are not required. Change the
permissions to rwxr-x-r-x on all nodes to improve security:
chmod go-w /usr/ibmpacks/text-analytics-runtime
13. Restart the Knox service. If you have not configured LDAP service, start the Knox
Demo LDAP service.
14. Open the BigInsights Home and launch Text Analytics at the following address:
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 70
https://<knox_host>:<knox_port>/<knox_gateway_path>/default/BigInsightsWeb/inde
x.html
Where:
knox_host
The host where Knox is installed and running
knox_port
The port where Knox is listening (by default this is 8443)
knox_gateway_path
The value entered in the gateway.path field in the Knox configuration (by default this is
'gateway')
For example, the URL might look like the following address:
https://myhost.company.com:8443/gateway/default/BigInsightsWeb/index.html
If you use the Knox Demo LDAP service and have not modified the default
configuration, the default credential to log into the BigInsights - Home service is:
userid = guest
password = guest-password
Note: If you do not see the Text Analytics service from BigInsights Home, restart
the BigInsights Home service in the Ambari interface.
At this point, IBM BigHome should show all three Big Insights Services as shown
below:
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 71
Installing Big R
To extend the power of the Open Platform for Apache Hadoop, install and deploy the Big R
service, which is the IBM R extension, to the Hadoop-based platform, IBM Open Platform with
Apache Hadoop.
1. Open a browser and access the Ambari server dashboard. The following is the
default URL.
http://<server-name>:8080
The default user name is admin, and the default password is admin .
2. In the Ambari web interface, click Actions > Add Service.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 72
3. Optional: If you do not already have the R Service installed, you can add it now. Big R
service depends on the R statistics environment and the following three R packages:
base64enc, rJava and data.table. If these have been installed on all nodes in the cluster,
this step can be skipped. Otherwise, you can choose to install the above dependencies
with your own approach, or, if your cluster has external network access, you can use the
following R service to install these dependencies.
a. In the Add Service Wizard, Choose Services, select the R service and
click Next.
b. In the Assign Slaves and Clients page, for client nodes, mark all of the nodes as
the R Clientnode and click Next.
c. In the Customize Services page, accept the recommended configurations for the
R service, or customize the configuration by expanding the configuration files and
modifying the values.
Make sure that you read the R license, and indicate acceptance by
typing Y in the fieldaccept.R.Licenses. The value is case sensitive, so make
sure you type an uppercaseletter. The R Licenses field contains a URL
where you can find the licensing information.
In the user.R.packages you must ensure that the following required
packages are listed:base64enc, rJava, and data.table.
In the user.R.repository field, enter the preferred repository. The default
is epel-release, which uses the EPEL repository, but you can also type a
different repository by entering a URL, such
as http://repos.domain.com/repos.
Note: When installing R from the EPEL repository, you might have the
following GPG key error: GPG key retrieval failed: [Errno 14] Could not
open/read
If you receive this error, you can import the key with the following rpm
command, then retry: rpm --import
d. Click Next and in the Review Page that opens, click Deploy.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 73
e. If R deployment fails, review and correct the errors before reattempting the
installation. Remove the R service from Ambari and delete the RSERV server by
using the following command:
f. curl -u [uid]:[pwd] -H "X-Requested-By: ambari"
-X DELETE http://[hostname]:8080/api/v1/clusters/[cluster
name]/services/RSERV
where
[uid:[pwd]]
The Ambari administrator user ID and password.
[hostname]
The correct host name for your environment.
8080
The port number 8080 is the default. Modify this according to your environment.
[cluster name]
The correct name of your cluster.
The following command is an example:
curl -u admin:admin -H "X-Requested-By: ambari"
-X DELETE
http://my_host.localdomain:8080/api/v1/clusters/my_cluster/services/RS
ERV
g. In the Summary page, click Complete. When you return to the Ambari Dashboard
Services tab, you notice that the R service is now listed.
4. In the Add Service Wizard, Choose Services page, select the Big R service and
click Next.
5. In the Assign Masters page, decide which nodes of your cluster you want to run the
specified components, or accept the default nodes. You must assign the Big R Connector
to the same node that is running the MapReduce2 Client service, which is a required
service that runs MapReduce2 Hadoop jobs. Click Next.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 74
6. In the Assign Slaves and Clients page, accept the defaults, or make specific
assignments for your nodes. For client nodes, mark all of the nodes as the Big R
Client node and click Next.
7. In the Customize Services page, default Big R environment variables are set in
the bigr-env template field. Review these entries for accuracy and completeness. Make
any necessary changes and click Next
8. You can review your selections in the Review page before accepting them. If you want
to modify any values, click the Back button. If you are satisfied with your setup,
click Deploy.
9. In the Install, Start and Test page, the Big R service is installed and verified. If you
have multiple nodes, you can see the progress on each node. When the installation is
complete, either view the errors or warnings by clicking the link, or click Next to see a
summary and then the new service added to the list of services.
If the BigInsights – Big R service fails to install, run
the remove_value_add_services.sh cleanup script. The following code is an example
of the command:
cd /usr/ibmpacks/bin/<version>
./remove_value_add_services.sh -u admin -p admin -x 8080 -s BIGR -r
For more information about cleaning the value-add service environment, see Removing
BigInsights value-add services.
10. Advise your end users that the service is deployed and ready for their use by having
them launch the Value Added packages welcome page.
11. In the Summary page, click Complete.
Running BigInsights - Big R as the YARN application master
You must update the Linux Container Executor as the default executor in the yarn-
site.xml file to change the owner to the bigr server user (the application process owner).
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 75
1. In the Ambari web interface, from the YARN service Configs page, scroll down to
find theAdvanced yarn-site and expand it.
2. Change the yarn.nodemanager.container-executor.class property to have the
following value:
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
3. In the Custom yarn-site section, click Add Property to add the following
properties:
4.
Property name Value
yarn.nodemanager.linux-container-
executor.nonsecure-mode.local-user
Yarn
yarn.nodemanager.linux-container-
executor.nonsecure-mode.limit-users
False
5. Make sure that the property yarn.nodemanager.linux-container-
executor.group has the valuehadoop.
6. Click Save in the Configs page to save your configuration changes.
7. Make sure that the directories on ALL the nodes set in the Node Manager section
for the properties yarn.nodemanager.local-dirs and yarn.nodemanager.log-
dirs have permissionsyarn:hadoop:
On ALL nodes do the following commands:
$ echo "yarn.nodemanager.linux-container-executor.group=hadoop" >>
/etc/hadoop/conf/container-executor.cfg
$ echo "banned.users=hdfs,yarn,mapred,bin" >>
/etc/hadoop/conf/container-executor.cfg
$ echo "min.user.id=1000" >>
/etc/hadoop/conf/container-executor.cfg
$ chown root:hadoop /etc/hadoop/conf/container-executor.cfg
$ chown root:hadoop /usr/iop/4.0.0.0/hadoop-yarn/bin/container-executor
$ chmod 6050 /usr/iop/4.0.0.0/hadoop-yarn/bin/container-executor
8. Make sure that the user ID with which the BigR connection is made (by using
bigr.connect) is present on ALL nodes, and that the user belongs to groups users,
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 76
hadoop. If the user does not exist, run the following command as the root user on
ALL nodes:
$ useradd -G users,hadoop someuser
9. Change the SystemML configuration file, /usr/ibmpacks/current/bigr/machine-
learning/SystemML-config.xml:
10. dml.yarn.appmaster
value: true
11. You can optionally update the MapReduce configuration to get better
performance:
a. In the Ambari web interface, from the MapReduce2 service Configs page,
scroll down to find the Advanced map-red site section and expand it.
b. Update the property mapreduce.task.io.sort.mb to 384 . This should be
approximately three times the HDFS block size.
Note: If the property is not available, add it to the Custom map-red site.
12. Click Save in the Configs page to save your configuration changes.
For information about using BigInsights - Big R, see Analyzing data with IBM BigInsights
Big R .
IBM BigInsights Online Tutorials
Learn how to use BigInsights™ by completing online tutorials, which use real data and teach you to run applications. Complete the tutorials in any order.
https://www-
01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.tut.doc/doc/tut_Introduction.html
You can find additional information, tutorials, and articles about BigInsights, Hadoop, and
related components at Hadoop Dev.
http://developer.ibm.com/hadoop/docs/tutorials/
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 77
Security Configuration and Administration
IBM® Open Platform with Apache Hadoop security includes perimeter security, authentication, and authorization. Authenticate, authorize, and protect your data by using the steps and
recommendations listed in this section. This document covers security to the Isilon HDFS storage, the resources that you use in Yarn, and the cluster infrastructure.
Setting up HTTPS for Ambari You can limit access to the Ambari Web interface to HTTPS connections.
Before you begin
The Ambari server must not be running when you are performing this task.
You must provide a certificate. You can use a self-signed certificate for initial trials, but these
certificates are not suitable for production environments.
The certificate you use must be PEM-encoded, not DER-encoded. If you attempt to use a DER-
encoded certificate, the following error appears.
unable to load certificate
140109766494024:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c
:698:Expecting: TRUSTED CERTIFICATE
You can use the following command to convert a DER-encoded certificate to a PEM-encoded
certificate.cert.crt is the DER-encoded certificate, and cert.pem is the resulting PEM-encoded
certificate.
openssl x509 -in cert.crt -outform pem -out cert.pem
Procedure
1. Log into the Ambari server host.
Note: Make sure Ambari server is not running.
2. Locate the certificate that you want to use.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 78
You can use the following example to create a temporary self-signed certificate.
Replace $wserverwith the Ambari server host name.
openssl genrsa -out $wserver.key 2048
openssl req -new -key $wserver.key -out $wserver.csr
openssl x509 -req -days 365 -in $wserver.csr -signkey $wserver.key -out $wserver.crt
3. Run the following command and answer the prompts that appear.
ambari-server setup-security
a. At the Security setup options prompt, type 1.
b. When asked whether you want to configure HTTPS, type y.
c. Select the port that you want to use for SSL. The default is 8443.
________________________________________________________________
Note: Make sure that you choose a port that is not being used by any services on
the machine. For example, the default port for Knox is also 8443.
__________________________________________________________
d. Provide the path to your certificate and your private key.
e. Provide the password for the private key.
Configuring SSL support for HBase REST gateway with Knox
By using Knox, your Hadoop cluster can be securely accessible to a large number of users, such
as HBase, Hive, and Oozie. Follow these steps to use SSL to connect between Knox and a
Hadoop component such as HBase.
Many of the services in IBM® Open Platform with Apache Hadoop use Knox to allow more users
to make use of the data and queries in Hadoop without compromising on security. Only a
handful of administrators are allowed to connect directly to their Hadoop clusters, while end-
users are routed through Knox.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 79
Knox acts as a reverse proxy between end-users and Hadoop, providing a two connection hop
between the client and the Hadoop cluster. The first connection is between the client and Knox.
Knox comes with SSL support for this connection. The second connection is between Knox and a
given Hadoop component, such as HBase, which requires some configuration.
Procedure
1. You must have a certificate, either self-signed or one signed by a Certificate Authority
(CA).
Trusted SSL Certificates are issued by Certificate Authorities (CAs). Self-signed
certificates are signed by the same entity whose identity it certifies. It is one signed with
its own private key.
The examples use a self-signed certificate, but this might not be suitable for your
production environment.
a. Configure the SSL on the HBase REST server. This example uses a self-signed
certificate, and a SSL certificate used by a Certificate Authority (CA) makes the
configuration steps even easier.
i. Log-in to the HBase REST server. As the HBase user (su hbase), create a
keystore to hold the SSL certificate.
export HOST_NAME=`hostname`
keytool -genkey -keyalg RSA -alias selfsigned -keystore hbase.jks
-storepass password
-validity 360 -keysize 2048
-dname "CN=$HOST_NAME, OU=Eng, O=MyCompany, L=Central City,
ST=CA, C=US"
-keypass password
Make sure the common name portion of the certificate matches the host
where the certificate will be deployed. For example, when the host that runs
HBase is actuallysandbox.MyCompany.com, the self-signed SSL certificate in
the example, uses this value as the CN: sandbox.MyCompany.com.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 80
“Owner: CN=sandbox.MyCompany.com, OU=Eng, O=MC, L=CC, ST=CA,
C=US
Issuer: CN=sandbox.MyCompany.com, OU=Eng, O=MC, L=CC, ST=CA,
C=US”
You can now use this self-signed certificate with HBase.
ii. Skip this step if you use a Certificate Authority signed certificate. Self-signed
certificates are rejected during SSL handshake. If you use a self-signed
certificate, export the certificate and put it in the cacerts file of the JRE that
is used by Knox. On the machine that is running HBase, export the HBase
SSL certificate into a file hbase.crt:
keytool -exportcert -file hbase.crt
-keystore hbase.jks -alias selfsigned -storepass password
iii. Copy the hbase.crt file to the Node that is running Knox. Then run the
following command:
keytool -import -file hbase.crt -keystore
/<your_jdk_path>/jre/lib/security/cacerts
-storepass changeit -alias selfsigned
Make sure the path to the cacerts file points to the cacerts of the JDK that is
used to run the Knox gateway. The default cacerts password is changeit.
2. Configure the HBase REST Server for SSL.
a. Use the Ambari web interface to update the Hadoop configuration properties:
<property>
<name>hbase.rest.ssl.enabled</name>
<value>true</value>
</property>
<property>
<name>hbase.rest.ssl.keystore.store</name>
<value>/path/to/keystore/created/hbase.jks</value>
</property>
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 81
<property>
<name>hbase.rest.ssl.keystore.password</name>
<value>password</value>
</property>
<property>
<name>hbase.rest.ssl.keystore.keypassword</name>
<value>password</value>
</property>
b. Click Save in the Ambari configuration page.
c. Restart the HBase REST server by clicking the HBase service in the Ambari web
interface. You can also type the following command in the Linux terminal window:
sudo /usr/iop/current/hbase-client/bin/hbase-daemon.sh stop rest & sudo
/usr/iop/current/hbase-client/bin/hbase-daemon.sh start rest -p 8091
3. Verify the HBase REST server over SSL. Replace localhost with the hostname of your
HBase REST server.
curl -H "Accept: application/json" -k https://localhost:8091/
The command should display the tables in your HBase environment:
{“table”:[{“name”:”ambarismoketest”}]}
.
4. Configure Knox to point to HBase over SSL and then re-start Knox.
Change the URL of the HBase service for your Knox topology to HTTPS. Make sure that
the Host matches the host of HBase rest server.
<service>
<role>WEBHBASE</role>
<url>https://sandbox.MyCompany.com:8091</url>
</service>
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 82
Overview of Kerberos
To ensure secure access in Hadoop, you need a strong authentication and a reliable way to
establish the identity of a user.
When users successfully identify themselves, then that identity can be propagated throughout
the Hadoop cluster. Those users can access resources or work with applications on the cluster.
The Hadoop cluster resources, such as Hosts and Services, also must authenticate with each
other to avoid potential malicious systems or daemons that pretend to be trusted components
of the cluster to gain access to data.
Hadoop uses Kerberos as the basis for strong authentication and identity propagation for both
users and services. Kerberos is a third party authentication mechanism, in which users and
services rely on a third party - the Kerberos server - to authenticate each to the other. The
Kerberos server itself is known as the Key Distribution Center (KDC). The KDC has three
parts:
Principals
A database of the users and services that the server knows about and their respective
Kerberos passwords.
Authentication Server (AS)
An AS performs the initial authentication and issues a Ticket Granting Ticket (TGT).
Ticket Granting Server (TGS)
A TGS issues subsequent service tickets based on the initial TGT.
The basic flow is illustrated by the following steps:
1. A user principal requests authentication from the AS.
2. The AS returns a TGT that is encrypted by using the Kerberos password of the user
principal. This password is known only to the user principal and the AS.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 83
3. The user principal decrypts the TGT locally by using its Kerberos password, and
from that point forward, until the ticket expires, the user principal can use the TGT
to get service tickets from the TGS.
4. Service tickets are what allow a principal to access various services.
Because cluster resources (hosts or services) cannot provide a password each time to decrypt
the TGT, they use a special file, called a keytab. The keytab contains the authentication
credentials of the resource principal. The set of hosts, users, and services over which the
Kerberos server has control is called a realm.
Each service and sub-service in Hadoop must have its own principal. A principal name in a given
realm consists of a primary name and an instance name. The instance name is the fully
qualified domain name (FQDN) of the host that runs that service.
_________________________________________________________________________
Note: With respect to the HDFS service, this service is entirely handled by Isilon. So it is very
important to make sure the fully qualified Isilon Hadoop Zone Name be used for the instance
name for the HDFS service.
As services do not log in with a password to acquire their tickets, the authentication credentials
of their principal are stored in a keytab file. This file is extracted from the Kerberos database
and stored locally in a secured directory with the service principal on the service component
host.
In addition to the Hadoop Service Principals, Ambari also requires a set of Ambari
Principals to perform service checks and alert health checks. Keytab files for the Ambari, or
headless, principals reside on each cluster host, just as keytab files for the service principals.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 84
Terminology
The following terms are useful in understanding Kerberos:
Key Distribution Center
The trusted source for authentication in a Kerberos-enabled environment.
Kerberos KDC Server
The server that serves as the KDC.
Kerberos Client
Any machine in the cluster that authenticates against the KDC.
Principal
The unique name of a user or service that authenticates against the KDC.
Keytab
A file that includes one or more principals and their keys.
Realm
The Kerberos network that includes a KDC and a number of Clients.
KDC Admin Account
An administrative account that is used by Ambari to create principals and generate
keytabs in the KDC.
Kerberos Descriptor
A JSON-formatted text file that contains information Ambari needs to enable or disable
FlumKerberos for a stack and its services. This file must be named kerberos.json. It must
be in the root directory of the relevant stack or service. Kerberos Descriptors are meant
to be hierarchical such that details in the stack-level descriptor can be overwritten or
updated by details in the service-level descriptors.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 85
Enabling Kerberos for IBM Open Platform
You begin setting up Kerberos by enabling it from the Ambari web interface. To use Kerberos
authentication in IBM® Open Platform with Apache Hadoop, you must generate principals and
keytabs for each of the services on each node where you installed the product.
Before you begin
1. You must have the latest supported Red Hat Enterprise Linux (RHEL) packages to enable
and use Kerberos – krb5-server, krb5-workstation and krb5-libs.
2. Deploy the Java Cryptography Extension (JCE) security policy files on the Ambari server
and on all hosts in the cluster. Depending on the JDK that you selected during the
installation of IBM Open Platform with Apache Hadoop the JCE policy files might already
be downloaded and installed onto the server.
a. Stop the Ambari server:
ambari-server stop
b. Make sure you have access to the policy file archive.
c. From the Ambari server and on each host in the cluster, add the unlimited security
policy JCE jars to $JAVA_HOME/jre/lib/security/. For example, run the following
command to extract the policy jars into the JDK that is installed on your host:
unzip -o -j -q UnlimitedJCEPolicyJDK7.zip -d
/usr/jdk64/jdk1.version/jre/lib/security/
d. Restart the Ambari server.
ambari-server restart
3. Ambari automatically creates principals in the KDC and generates keytabs. Therefore,
you must have the Kerberos Admin Account credentials available when running the
Kerberos wizard.
4. Use an existing Active Directory installation with Kerberos.
a. Make sure that Ambari server and cluster hosts have network access to, and be
able to resolve the DNS names of, the Domain Controllers.
b. Configure the LDAP or Active Directory authentication connectivity.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 86
c. The Active Directory User container for principals is created and is available. For
example, "OU=Hadoop,OU=People,dc=apache,dc=org"
Manually generating keytabs for Kerberos authentication
You use the kadmin local command-line interface to generate keytabs for IBM® Open Platform
with Apache Hadoop services. All Kerberos-enabled services need a keytab file to authenticate
to the Key Distribution Center (KDC).
You can also use the kadmin command-line interface that can be used on Kerberos client nodes
and KDC server nodes. The kadmind service starts the Kerberos administration server,
whereas the kadmin.local command-line interface directly accesses the KDC database.
To generate keytabs for services that contain the HTTP principal, you use the ktadd command
with the -norandkey option in the kadmin.local command-line interface. This option indicates
to not randomize the keytabs. The keytabs and their version numbers remain unchanged.
____________________________________________________________________
Note: If your version of Kerberos does not support this option, or if you cannot use the
kadmin.local shell, then create your keytabs with the ktadd command and use
the ktutil command to merge keytabs that you create.
You must generate keytabs for the following services to configure them with Kerberos HTTP
authentication. If two or more of these services run on the same host, then all running services
on that host must use the same HTTP principal and key for their HTTP endpoints. Hadoop,
HBase, HttpFS, and Oozie require HTTP principal.
Procedure
1. From the Linux shell, as the root user start the kadmin.local or kadmin command-line
interface.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 87
Important: If you have root access to your KDC machine, login to the KDC machine as
root and use the kadmin.local command-line interface to generate principles and keytabs.
If you do not have root access to the KDC machine, use the kadmin command-line
interface on any Kerberos configured machine to generate principles and keytabs.
kadmin.local
2. Create the principal and keytab for each of the IBM Open Platform with Apache
Hadoop services. For each service, you must enter
the domain.name and YOUR_REALM.COM parameters.
domain.name - The fully qualified domain name of the cluster node where the server
component is running. The domain.name must be lowercase characters.
YOUR_REALM.COM - The name of the Kerberos realm where you are installing IBM
Open Platform with Apache Hadoop. Kerberos realm names are typically in all
uppercase characters to differentiate it from any similar DNS domain that the realm is
associated with.
Option Description
Flume On every Kerberos configured node that runs a Flume agent that writes to
HDFS, generate a keytab file that contains entries for the Flume agent
principal.
a. On each host where a Flume agent runs, create the
Flume principal and keytab file, and then copy the keytab to the respective host under
the../conf/security/keytabs/flume.keytab. addprinc -randkey flume/domain.name@YOUR_REALM.COM
ktadd -k flume.keytab flume/domain.name@YOUR_REALM.COM
b. Check to ensure that Flume agent principal information was added to the keytab file. klist -e -k -t flume.keytab
c. Ensure that the flume.keytab file is only readable by
the Flume user. sudo chown flume:biadmin
../conf/security/keytabs/flume.keytab
sudo chmod 400
../conf/security/keytabs/flume.keytab
d. To enable the Flume agent to store data on a secure
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 88
Option Description
HDFS, add the following parameters to the Flume
configuration file,flume-conf.properties.template,
which exists in the../flume/conf directory. You can
rename this configuration file to generate your own configuration file for Flume. agentName.sinks.sinkName.type = HDFS
agentName.sinks.sinkName.hdfs.kerberosPrincipal =
flume/domain.name@YOUR_REALM.COM
agentName.sinks.sinkName.hdfs.kerberosKeytab =
keytab_path
agentName
Name of the Flume agent that you are configuring for
Kerberos authentication.
sinkName
Name of the HDFS sink that you are configuring. The sink type must be HDFS.
keytab_path
Path to the Flume keytab. The default path
is../conf/security/keytabs/flume.keytab.
When you start the Flume agent, specify the --conf-file option to point to the Flume
configuration file that you modified. For example, $FLUME_HOME/bin/flume-ng agent --conf-file
flume-conf.properties.template --name
myAgentName
-Dflume.root.logger=INFO,console
Hadoop On every Kerberos configured node that runs a Hadoop server, generate a
keytab file for HDFS, MapReduce, and HTTP services. The HDFS keytab
file must contain entries for the HDFS principal and the HTTP principal.
The MapReduce keytab file must contain entries for the MapReduce
principal and the HTTP principal. Both Hadoop and HBase use the HTTP
keytab file. On each node, the HTTP principal must be the same in all
keytab files.
e. Run the following commands on every host in your
cluster that runs a Hadoop server or an HBase server. addprinc -randkey HTTP/domain.name@YOUR_REALM.COM
ktadd -norandkey -k http.domain.name.keytab
HTTP/domain.name@YOUR_REALM.COM
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 89
Option Description
f. Run the following commands on every host in your
cluster where Hadoop servers run. Create principals and keytabs for HDFS services including the
NameNode, Secondary NameNode, DataNodes – in all cases the instance name will point to the FQDN of
Isilon Hadoop Zone, e.g. hdfs/Isilon- If you plan to [email protected].
enable high availability with the Quorum Journal Manager (QJM), create principals and keytabs for
JournalNodes. addprinc -randkey hdfs/domain.name@YOUR_REALM.COM
Tip: You can add keytabs for NFS high availability. NFS high availability.
i. Add the following principles addprinc -randkey
hdfs/isilon.zonename@YOUR_REALM.COM
addprinc -randkey
HTTP/virtual.hostname@YOUR_REALM.COM
ii. Add the NFS principles and key to every
HA node. ktadd -norandkey -k
hdfs.domain.name.keytab hdfs/
isilonzone.domain.name@YOUR_REALM.COM
ktadd -norandkey -k
http.domain.name.keytab HTTP/
isilonzone.domain.name@YOUR_REALM.COM
ktadd -norandkey -k
hdfs.isilonzone.domain.name.keytab HTTP/
isilonzone.domain.name@YOUR_REALM.COM
ktadd -norandkey -k hdfs.domain.name.keytab
hdfs/isilonzone.domain.name@YOUR_REALM.COM
HTTP/isilonzone.domain.name@YOUR_REALM.COM
Check to ensure that the HDFS and HTTP principal information was added to the keytab file. klist -e -k -t hdfs.isilonzone.domain.name.keytab
g. Run the following commands on every host in your
cluster where hadoop servers run, including the JobTracker and TaskTracker. addprinc -randkey mapred/domain.name@YOUR_REALM.COM
ktadd -norandkey -k mapred.domain.name.keytab
mapred/domain.name@YOUR_REALM.COM
HTTP/domain.name@YOUR_REALM.COM
Check to ensure that MapReduce and HTTP principal information was added to the keytab file. klist -e -k -t mapred.domain.name.keytab
HBase h. On every Kerberos configured node that runs HBase,
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 90
Option Description
including the primary and secondary servers,
generate a keytab file that contains entries for the HBase principal. addprinc -randkey hbase/domain.name@YOUR_REALM.COM
ktadd -k hbase.domain.name.keytab
hbase/domain.name@YOUR_REALM.COM
i. Check to ensure that HBase principal information was
added to the keytab file. klist -e -k -t hbase.domain.name.keytab
Hive j. On every Kerberos configured node that runs a Hive JDBC server, generate a Hive keytab file that
contains entries for the Hive principal. addprinc -randkey hive/domain.name@YOUR_REALM.COM
ktadd -k hive.domain.name.keytab
hive/domain.name@YOUR_REALM.COM
k. Check to ensure that Hive principal information was
added to the keytab file. klist -e -k -t hive.domain.name.keytab
HttpFS l. On every Kerberos configured node that runs the HttpFS server, generate a keytab file that contains
entries for the HttpFS principal and an HTTP principal. addprinc -randkey
httpfs/isilonzone.domain.name@YOUR_REALM.COM
addprinc -randkey
HTTP/isilonzone.domain.name@YOUR_REALM.COM
ktadd -norandkey -k httpfs.domain.name.keytab
httpfs/isilonzone.domain.name@YOUR_REALM.COM
HTTP/isilonzone.domain.name@YOUR_REALM.COM
Oozie m. On every Kerberos configured node that runs Oozie, generate a keytab file that contains entries for the
Oozie principal and an HTTP principal. addprinc -randkey oozie/domain.name@YOUR_REALM.COM
addprinc -randkey HTTP/domain.name@YOUR_REALM.COM
ktadd -norandkey -k oozie.domain.name.keytab
oozie/domain.name@YOUR_REALM.COM
HTTP/domain.name@YOUR_REALM.COM
n. Check to ensure that Oozie and HTTP principal
information was added to the keytab file. klist -e -k -t oozie.domain.name.keytab
ZooKeeper o. On every Kerberos configured node that runs
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 91
Option Description
ZooKeeper, generate a keytab file that contains
entries for the ZooKeeper principal. addprinc -randkey zookeeper/domain.name@YOUR_REALM.COM
ktadd -k zookeeper.domain.name.keytab
zookeeper/domain.name@YOUR_REALM.COM
p. Check to ensure that ZooKeeper principal information was added to the keytab file. klist -e -k -t zookeeper.domain.name.keytab
Setting up Active Directory or LDAP authentication in Ambari
Lightweight Directory Access Protocol (LDAP security) is an interface that is used to read from
and write to the Active Directory database. By default, Ambari uses an internal database as the
user store for authentication and authorization. You can configure LDAP or Active Directory (AD)
external authentication.
Before you begin
An LDAP client must be installed on the Ambari server host.
The Ambari server must not be running when you are performing this task.
The following table describes the properties and values that are required to set up LDAP
authentication.
Table 1. Ambari server LDAP properties
Property Values Description
authentication.ldap.primaryUrl server:port The hostname and port for the LDAP or AD server. For
example, my.ldap.server:389.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 92
Table 1. Ambari server LDAP properties
Property Values Description
authentication.ldap.secondaryUrl
server:port The hostname and port for the secondary LDAP or AD server. For
example,my.secondary.ldap.server:
389.
This value is optional.
authentication.ldap.useSSL true or false If true, use SSL when connecting to
the LDAP or the AD server.
authentication.ldap.
usernameAttribute
[LDAP
attribute]
The attribute for username. For
example, uid.
authentication.ldap.baseDn [Distinguished Name]
The root Distinguished Name to search in the directory for users. For
example,ou=people,dc=hadoop,dc=ap
ache,dc=org.
authentication.ldap.
bindAnonymously
true or false If true, bind to the LDAP or AD server
anonymously.
authentication.ldap.managerDn [Full
Distinguishe
d Name]
If Bind anonymous is set to false, the
Distinguished Name (“DN”) for the
manager. For
example,uid=hdfs,ou=people,dc=had
oop,dc=apache,dc=org.
authentication.ldap.
managerPassword
[password] If Bind anonymous is set to false, the
password for the manager.
authentication.ldap.userObjectClass
[LDAP Object
Class]
The object class that is used for users.
For example, organizationalPerson.
authentication.ldap.groupObjectClass
[LDAP Object
Class]
The object class that is used for groups. For
example, groupOfUniqueNames.
authentication.ldap.groupMemb
ershi pAttr
[LDAP
attribute]
The attribute for group membership.
For example, uniqueMember.
authentication.ldap.groupNamingAttr
[LDAP attribute]
The attribute for group name.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 93
____________________________________________________________________
Note: If you are going to set bindAnonymously to false (the default), make sure that you have
an LDAP Manager name and password set up. If you are going to use SSL, make sure you have
already set up your certificate and keys.
To manage authorization and permissions against your users and groups, you must synchronize
those LDAP users and groups in the Ambari database.
If the LDAP server certificate is signed by a trusted Certificate Authority, you do not need to
import the certificate into Ambari. If the LDAP server certificate is self-signed, or is signed by
an unrecognized certificate authority such as an internal certificate authority, you must import
the certificate and create a keystore file.
Procedure
1. Stop the Ambari server.
ambari-server stop
2. If required, create a keystore file.
a. Create a directory for the keystore file. For example, type mkdir /keys to create a
directory calledkeys.
b. Create the keystore file. For example, type the following command to create the
keystore file ldaps-keystore.jks in the keys directory.
$JAVA_HOME/bin/keytool -import -trustcacerts -alias root -file
$PATH_TO_YOUR_LDAPS_CERT -keystore /keys/ldaps-keystore.jks
c. When prompted, set a password.
The password is needed when you are setting up LDAP or AD authentication in
Ambari.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 94
3. Run the following LDAP set up command, and answer the prompts with the information
that you previously collected. ambari-server setup-ldap
Note: Prompts marked with an asterisk are required values.
4. At the Primary URL* prompt, type the server URL and port.
5. At the Secondary URL prompt, type the secondary server URL and port.
6. At the Use SSL* prompt, type your value.
If you are using LDAP, type true.
7. At the User name attribute* prompt, type your value. The default value is uid.
8. At the Base DN* prompt, type your value.
9. At the Bind anonymously* prompt, type your value.
10. If you have set bind.Anonymously to false, at the Manager DN* prompt, type your
value.
11. At the Enter the Manager Password* prompt, type the password for your LDAP
manager.
12. At the Enter the userObjectClass* prompt, type the object class that is used for
users.
13. At the Enter the groupObjectClass * prompt, type the object class that is used for
groups.
14. At the Enter the groupMembershipAttr * prompt, type the attribute for group
membership.
15. At the Enter the groupNamingAttr * prompt, type the attribute for group name.
16. If you set Use SSL* to true in step 6, the prompt Do you want to provide custom
TrustStore for Ambari? appears.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 95
o If you are using a self-signed certificate that you do not want imported to the
existing JDK keystore, type y.
This is option is more secure. For example, you want only Ambari to use this
certificate, and not any other applications run by JDK on the same host.
When you select this option, other prompts appear.
At the TrustStore type prompt, type jks.
At the Path to TrustStore file prompt, type /keystore_directory/ldaps-
keystore.jks.
At the Password for TrustStore prompt, type the password that you
defined for the keystore.
o If you are using a self-signed certificate that you want to import and store in the
existing, default JDK keystore, type n.
This is option is less secure.
When you select this option, do the following.
If necessary, convert the SSL certificate to X.509 format by executing the
following command:
openssl x509 -in slapd.pem -out slapd.crt
where slapd.crt is the path to the X.509 certificate.
Import the SSL certificate to the existing keystore, such as the default jre
certificates store, by typing the following command:
/usr/jdk64/jdk1.7.0_45/bin/keytool -import -trustcacerts -file slapd.crt -keystore
/usr/jdk64/jdk1.7.0_45/jre/lib/security/cacerts
where Ambari is set up to use JDK 1.7. Consequently, the certificate must
be imported into the JDK 7 keystore.
17. Review your settings, and if they are correct, select y.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 96
18. Restart the Ambari server.
19. Synchronize your LDAP users and groups into the Ambari database.
o To synchronize a specific set of users and groups, type the following command:
ambari-server sync-ldap --users users.txt --groups groups.txt
where users.txt and groups.txt are files that contain comma-separated users and
groups.
Note: Group membership is determined using the group membership attribute
that you specified when you ran ambari-setup setup-ldap.
o If you have synchronized a specific set of users and groups, type the following
command to synchronize only those entities that are in Ambari with LDAP. Users
are removed from Ambari if they no longer exist in LDAP, and group membership
in Ambari is updated to match LDAP.
ambari-server sync-ldap --existing
Note: Group membership is determined using the group membership attribute
that you specified when you ran ambari-setup setup-ldap.
o To import all entities with matching LDAP user and group object classes into
Ambari, type the following command:
ambari-server sync-ldap --all
________________________________________________________
Note: Use this option only if you are sure that you want to synchronize all users
and groups from LDAP into Ambari. Isilon will also need to be configured for LDAP
authentication for this synchronization to work across the entire cluster.
_________________________________________________________
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 97
Additional User Priviledges
Initially, the users you have enabled all have Ambari User privileges. Ambari Users can read
metrics, view service status and configuration, and browse job information. If you want users to
be able to start or stop services, modify configurations, and run smoke tests, you must give the
users administrator privileges.
Enabling Kerberos for HDFS on Isilon
Using MIT Kerberos 5
This section explains how to set up an Isilon cluster to authenticate HDFS connections with a
stand-alone MIT Kerberos 5 key distribution center. The following instructions assume that you
have already set up a Kerberos system with a resolvable hostname for the KDC and a
resolvable hostname for the KDC admin server. It is assumed your KDC is running on the
Ambari Server, all KDC’s have a different realm name, and the Hadoop client setup for Kerberos
is complete on the compute nodes and you have one KDC per zone.
__________________________________________________________________________
Note: AES encryption must be disabled in krb5.conf and RC4/DES should be listed as the only
supported encryption type on server and clients:
kdc.conf
supported_enctypes = RC4-HMAC:normal DES-CBC-MD5:normal DES-CBC-CRC:normal
__________________________________________________________________________
Note: Deleting principals from Isilon does not remove them from KDC.
Procedure
Connect with SSH as root to any node in your Isilon cluster and run the following commands to
configure Isilon for Kerberos.
1. To prevent auto spn generation in the system zone you need to set ‘All Auth Providers’
setting on the system zone to ‘No’.
isi zone zones modify --zone=system --all-auth-providers=No
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 98
2. Add the KDC to the Isilon cluster and each KDC needs a unique name:
isi auth krb5 create --realm=EXAMPLE.COM --admin-server=kdc.example.com
--kdc=kdc.example.com --user=kadmin/admin --password=isi
3. To verify and list all the auth providers for the cluster run:
isi auth status
4. Modify zone to use authenticaion provider
isi zone zones modify --zone=zone-example --add-auth-provider=krb5:EXAMPLE.COM
5. Verify zone infor with view command:
isi zone zones view --zone=zone-example
6. Create the Isilon spn’s for the zone. The format needs to be hdfs/<cluster hostname/SC
name>@REALM and HTTP/<cluster hostname/SC name>@REALM
isi auth krb5 spn create --provider-name=EXAMPLE.COM --
spn=hdfs/[email protected] --user=kadmin/admin --
password=isi
isi auth krb5 spn create --provider-name=EXAMPLE.COM --
spn=HTTP/[email protected] --user=kadmin/admin --
password=isi
7. Verify spn creation:
isi auth krb5 spn list --provider-name=EXAMPLE.COM
8. Lastly create proxy users
o isi hdfs proxyusers create oozie --zone=zone-example --add-user=ambari-qa
o isi hdfs proxyusers create hive --zone=zone-example --add-user=ambari-qa
o isi hdfs proxyusers create zookeeper --zone=zone-example --add-
user=ambari-qa
o isi hdfs proxyusers create flume --zone=zone-example --add-user=ambari-qa
o isi hdfs proxyusers create hadoop --zone=zone-example --add-user=ambari-
qa
o isi hdfs proxyusers create hbase --zone=zone-example --add-user=ambari-qa
9. Before proceeding to this step, you should be finished with the Kerberos setup on the
compute nodes as well as completed the Ambari Security Wizard. After everything has finished installing you need to configure the Isilon zone to only allow secure connections with the command shown below:
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 99
o isi zone zones modify --zone=zone-example --hdfs-
authentication=kerberos_only
______________________________________________________________________
Note: It is very important during the Ambari Security Wizard (next section) to configure the
HDFS principals (namenode, snamenode, datanode) to, for example -
hdfs/[email protected]. All three principals must point to the FQDN
of the Isilon Hadoop Zone configured@REALM_NAME.
___________________________________________________________________________
Running the Ambari Kerberos Wizard _________________________________________________________________________
Note: Make sure you complete the Enabling Kerberos for HDFS on Isilon (shown in the
following section) setup before completing the Ambari Kerberos Wizard.
_________________________________________________________________________
Your cluster might use a primary KDC and one or more secondary KDCs to ensure continued
availability of Kerberos-enabled services. In this configuration, each KDC contains a copy of the
Kerberos database. The primary KDC contains the writeable copy of the realm database, which
is replicated on each of the secondary KDCs.
The Kerberos realm must trust the server. In Kerberos configuration files, your realm is
typically identified in uppercase characters to differentiate it from any similar DNS domain that
the realm is associated with.
__________________________________________________________________________
Note: To use Kerberos, you must install a few basic packages on the machines in your cluster
or build and install the packages from scratch. If you need to build the packages yourself, you
can download the latest version from the MIT website.
If your system uses a package management system, you can install the following packages to
use a generic version of Kerberos:
krb5-workstation must be installed on all client systems. This package contains basic
Kerberos program, in addition to Kerberos-enabled versions of the telnet and ftp
applications.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 100
krb5-server must be installed on all server and secondary server systems. This package
provides the programs that must be installed on a Kerberos 5 server or server replica.
krb5-libs must be installed on all client and server systems. This package contains the
shared libraries that are used by Kerberos on all clients and services.
pam_krb5 on all client systems. This package provides a pluggable authentication
module (PAM) that enables Kerberos authentication.
Procedure
1. From the Ambari web dashboard, from the menu bar, click Admin > Kerberos.
2. Click Enable Kerberos.
3. Select the type of KDC that you want to use and confirm that you meet the prerequisites.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 101
4. Provide information about the KDC and admin account in the configuration page.
5. Install the Kerberos client. The wizard page shows you the progress, but you can also see the
progress of the install in the file /var/log/ambari-server/ambari-server.log.
The Kerberos clients are installed on the hosts and the access to the KDC is tested by testing
that Ambari can create a principal, generate a keytab and distribute that keytab.
6. Configure the Kerberos identities that are used by Hadoop.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 102
7. Kerberize the cluster.
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 103
____________________________________________________________________
Note: Make sure Isilon is configured for Kerberos before configuring HDFS in the Ambari
Security Wizard – see Enabling Kerberos for HDFS on Isilon. Click through the wizard untill you get to the screen that configures the principals. Note: Isilon does not convert
principal names to short names using rules so don’t use aliases(e.g. rm instead of yarn) o Realm name o Hdfs -> namenode hdfs/[email protected]
o Hdfs -> secondarynamenode hdfs/[email protected]
o Hdfs -> datanode hdfs/[email protected]
o Yarn -> resourceManager yarn/_HOST o Yarn -> nodemanager yarn/_HOST
o Mapreduce2 -> history server principal -> mapred/_HOST
EMC Isilon Hadoop Starter Kit for IBM BigInsights
__________________________________________________________________
EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 104
8. The final step to enable Kerberos is called Start and Test Services. If you see an error that
indicates some services failed to start and execute tests successfully, you might learn
more about the issue by clicking Start and Test Services. If you see a Check HBase failure
error message, such as ERROR: Can't get master address from ZooKeeper; znode data
== null, work around this issue by manually restarting the HBase service. After manually
restarting, retry the Start and Test Services.
Trouble Shooting and Support
To isolate and resolve problems with BigInsights®, you can use the troubleshooting and
support information online. This information contains instructions for using the problem-
determination resources that are provided with BigInsights.
https://www-
01.ibm.com/support/knowledgecenter/SSPT3X_4.1.0/com.ibm.swg.im.infosphere.biginsights.tr
b.doc/doc/troubleshooting.html