Upload
edureka
View
164
Download
1
Embed Size (px)
Citation preview
View Hadoop Administration Course at www.edureka.co/hadoop-admin
Secure your Hadoop Cluster with Kerberos
www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Objectives
At the end of this module, you will be able to
Hadoop Cluster introductionRecommended Configuration for clusterHadoop cluster running modesHadoop Security with Kerberos Hadoop Admin ResponsibilitiesDemo on Kerberos
Slide 3Slide 3Slide 3 www.edureka.co/java-hadoop
Hadoop Core Components
Hadoop 2.x Core Components
HDFS YARN
Storage Processing
DataNode
NameNode Resource Manager
Node Manager
Master
Slave
SecondaryNameNode
www.edureka.co/hadoop-admin
Slide 4
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
StandBy NameNode
Optional
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS
DataNode
DataNode DataNode DataNode
www.edureka.co/hadoop-admin
www.edureka.co/hadoop-adminSlide 5
Seeking cluster growth on storage capacity is often a good method to use!
Cluster Growth Based On Storage Capacity
Data grows by approximately5TB per week
HDFS set up to replicate eachblock three times
Thus, 15TB of extra storagespace required per week
Assuming machines with 5x3TBhard drives, equating to a newmachine required each week
Assume Overheads to be 30%
www.edureka.co/hadoop-adminSlide 6
Slave Nodes: Recommended Configuration
Higher-performance vs lower performance components
Save the Money, Buy more Nodes!
General ( Depends on requirement ‘base’ configuration for a slave Node
» 4 x 1 TB or 2 TB hard drives, in a JBOD* configuration
» Do not use RAID!» 2 x Quad-core CPUs» 24 -32GB RAM» Gigabit Ethernet
General Configuration
Multiples of ( 1 hard drive + 2 cores+ 6-8GB RAM) generally work wellfor many types of applications
Special Configuration
Slave Nodes
“A cluster with more nodes performs better than one with fewer, slightly faster nodes”
www.edureka.co/hadoop-adminSlide 7
Slave Nodes: More Details (RAM)
Slave Nodes (RAM)
Generally each Map or Reduce taskwill take 1GB to 2GB of RAM
Slave nodes should not be usingvirtual memory
RULE OF THUMB!Total number of tasks = 1.5 x numberof processor core
Ensure enough RAM is present torun all tasks, plus the DataNode,TaskTracker daemons, plus theoperating system
www.edureka.co/hadoop-adminSlide 8
Master Node Hardware Recommendations
Carrier-class hardware (Not commodity hardware)
Dual power supplies
Dual Ethernet cards(Bonded to provide failover)
Raided hard drives
At least 32GB of RAM
Master Node
Requires
www.edureka.co/hadoop-adminSlide 9
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
No daemons, everything runs in a single JVM Suitable for running MapReduce programs during development Has no DFS
Hadoop daemons run on the local machine
Hadoop daemons run on a cluster of machines
Standalone (or Local) Mode
www.edureka.co/hadoop-adminSlide 10
Configuration Files
ConfigurationFilenames
Description of Log Files
hadoop-env.shyarn-env.sh
Settings for Hadoop Daemon’s process environment.
core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that common to both HDFS and YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for Resource Manager and Node Manager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and Node Manager.
Slide 11
Core
HDFS
core-site.xml
hdfs-site.xml
yarn-site.xmlYARN
mapred-site.xmlMap
Reduce
Hadoop 2.x Configuration Files – Apache Hadoop
www.edureka.co/hadoop-admin
Slide 12 www.edureka.in/hadoop-admin
Security
The Hadoop ecosystem has only partially adopted Kerberos but many services remain unprotected and use trivial authentication systems.
Service-level authorization and web proxy capabilities in YARN.
Most security tools fail to scale and perform with big data environments.
Slide 13 www.edureka.in/hadoop-admin
Security – Simple Flow
Security Risks
Insufficient Authentication Do not authenticate users services
No Privacy and No Integrity Insecure Network Transport No Message level security
Arbitrary Code Execution No User verification for MapReduce code
execution, malicious users could submit a job
Client Resource Manager
HDFS
Task
HDFS
Node Manager
Task
Node Manager
Slide 14 www.edureka.in/hadoop-admin
Kerberos to the rescue
Network authentication protocol
Developed at MIT in the mid 1980s
Available as open source or in supported commercial software
Slide 15 www.edureka.in/hadoop-admin
Kerberos Design Requirements
Interactions between hosts and clients should be encrypted.
Must be convenient for users (or they won’t use it).
Protect against intercepted credentials.
Kerberos is based on the Secret-Key Distribution Model
-keys are the basis of authentication in Kerberos
-typically a short sequence of bytes.
-used to both encrypt & decrypt
Encryption => plainTxt + Encryption key = cipherTxt
Decryption => cipherTxt + Decryption key = plainTxt
Slide 16 www.edureka.in/hadoop-admin
Kerberos to the rescue
Kerberos Integration
User Authentication User and Group access control list at
cluster level Tokens
Delegation
Job
Block Access
Simple Authentication and Security Layer (SASL) with RPC digest mechanism
Server
1: AuthenticationGet TGT
2: AuthorizationGet Service Ticket
3: Service RequestStart Service Session
Kerberos Key Distribution Center
Authentication Server
Ticket Granting Server
Client
Slide 17 www.edureka.in/hadoop-admin
Kerberos Applications
Authentication
Authorization
Confidentiality
Within networks and small sets of networks
www.edureka.co/hadoop-adminSlide 18
DEMO
www.edureka.co/hadoop-adminSlide 19
Hadoop Admin Responsibilities
Responsible for implementation and administration of Hadoop infrastructure.
Testing HDFS, Hive, Pig and MapReduce access for Applications.
Cluster maintenance tasks like Backup, Recovery, Upgrade, Patching.
Performance tuning and Capacity planning for Clusters.
Monitor Hadoop cluster and deploy security.
LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
www.edureka.co/hadoop-adminSlide 20 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
How it Works?
Questions
www.edureka.co/hadoop-adminSlide 21 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
www.edureka.co/hadoop-adminSlide 22
Course Topics
Module 1 » Hadoop Cluster Administration
Module 2» Hadoop Architecture and Cluster setup
Module 3 » Hadoop Cluster: Planning and Managing
Module 4 » Backup, Recovery and Maintenance
Module 5 » Hadoop 2.0 and High Availability
Module 6» Advanced Topics: QJM, HDFS Federation and
Security
Module 7» Oozie, Hcatalog/Hive and HBase Administration
Module 8» Project: Hadoop Implementation