Upload
fann-wu
View
467
Download
1
Embed Size (px)
Citation preview
Facing Enterprise-specific Challenges – Utility Programming in Hadoop
吳育儒 Fann Wu
Who am I ?• Fann Wu 吳育儒• Sr. Engineer, SPN, Trend Micro• Hadoop Cluster Admin• Splunk Cluster Admin• Monitor Admin• 水電工
Architecture,Operation,TroubleShooting,Automation,Performance Turning
Agenda• How to manage big cluster• How to manage big Hadoop cluster• Datacenter & AWS
How to manage big cluster
武功• 基本心法• 基本招式
心法
招式
無招勝有招
心法• 將幾百台 Server當做一台 Server在管• 將幾百台 Server當做女朋友照顧• Server安穩才能睡個好覺• 放乖乖之必要
Cluster
http://www.quuxlabs.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
招式 -shellscript
實用的土炮 !!!
招式 -PSSHPSSH provides parallel versions of OpenSSH and related tools. Included are pssh, pscp, prsync, pnuke, and pslurp.
https://code.google.com/p/parallel-ssh/
招式 -SaltStack• Role:- Salt Master• Salt Minions• Web UI
招式 -SaltStack• Install SaltStack from epel (CentOS 6)
招式 -SaltStack• Config SaltStack and start SaltStack service
• Master:• Add “ interface: 192.168.50.8” to /etc/salt/master
• Minion:• Add “master:192.168.50.8” to /etc/salt/minion
招式 -SaltStack• List Unaccepted key, Accept all key
招式 -SaltStack• You can use salt command to control the cluster
招式 -SaltStack UI HALITE
Configure Management
Chef
Ansible
Puppet
Puppet Web - Foreman
Wait ………….Where is Hadoop
How to manage Big Hadoop cluster
TrendMicro Hadoop• Server: hundreds of Servers• User : 2 hundred accounts• Daily Input Data: 2TB• Daily Jobs: hundred of jobs
27
Hadoop as a
Service
Central Management
Automation
Highly Availability
Customization
We need….
Hadoop Ecosystem
Puppet
Hadooppet
A project for deploy Trend Micro Hadoop
distribution on a large cluster
28
IT automation software
So…..
CLUSTER DEPLOYMENT BY DISTRIBUTION / ENVIRONMENT• POC, Staging, Production• All-in-one VM, AWS EC2 deployment
CLUSTER DEPLOYMENT• Package installation• Configuration adjustment
CLUSTER OPERATION• Add new Hadoop node/client• Account management• Process management
SANITY CHECK • DFSIO, YCSB , etc• Sample Applications
Hadooppet
29
WE CAN EASILY DEPLOY HUNDREDS OF SERVERS WITHIN ONE HOUR
Hadoop Security
Hadoop Security - Without security• From any machine that can access hadoop• [root@hackserver opt]# su hdfs• [hdfs@hackserver opt]$ hadoop fs -rmr /• Say Goodbye to your data
如有雷同純屬巧合
Hadoop Security - Kerberos
Hadoop Security - Kerberos• Without auth
• Pass auth
Kerberos Common Problem• Problem:Clock skew too great while getting initial
credentials• Solution:Use date or ntpdate to sync the time
Hadoop Security – Folder Permission• POSIX permissions
• POSIX ACLs (Access Control Lists)
Hadoop Security- More Security but still in incubation
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_Knox_Gateway_Admin_Guide/content/ch01.html
Hadoop HA
Sleep well - HA• Kerberos HA• LDAP HA• Namenode HA• Jobtracker HA/Resourcemanger HA• HBase HA
Mertric/Monitor tool
Without Mertric/Monitor tool
Know the present/future-Ganglia
Known the real problem-Nagios
Known the real problem-Nagios mail
Find the detail log and generate fashion report-Splunk
Job collector
User Mapper/Reducer usage
Fail Job Summary
Cluster Mapper Usage/Pending
Cluster Reducer Usage/Pending
Other Tools
Offline Image Viewer • Transform fsimage from binary to text
• Cat fsimage.txt
Total Number of Files for Each UserPig
Common Datanode Decommission Process• Add DataNodes hostnames to dfs.exclude file• On NameNode host, run hdfs dfsadmin –refreshNodes• Check Web UI to see whether the state has changed
to Decommission In Progress for the DataNodes being decommissioned. (1day~2day)
• When all the DataNodes report their state as Decommissioned, You can then shut down the decommissioned nodes.
• Replace the crash HDD, reboot server and re-config the HDD from Raid card. (20 mins)
• Mount the HDD• Start Datanode service
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.5/bk_system-admin-guide/content/admin_decommission-slave-nodes-2-1.html
TrendMicro Datanode Decommission(HDD hot swap)
• Replace the crash HDD• Stop datanode & umount the broken mount point.
(5 mins)• Reinit the HDD from raid card setting• Check /var/log/message , linux will auto rescan• Mount the broken point• Start datanode service
HBase Canary Tool
• Contributor: TrendMicro Scott Miao• Purpose: Check every table’s first region on
regionserver
https://issues.apache.org/jira/browse/HBASE-7525
HBase Canary Tool
• Usage:
• Result
HappyBase+Thrift • What is HappyBase• Purpose:- Check regionserver’s every region response time- Check table’s every region response time
http://happybase.readthedocs.org/en/latest/
Datacenter & Aws
How we test EMR POC
If your EMR cluster running 24x7
Reduce EMR cost• 100 nodes cost running 1 hour == 1 node running 100 hour• AWS charge by hour• If you don’t care about job stable, use spot instance to save
cost• Use Reserve Instance to save cost• Use EMR Auto Scaling • Pilot run your Application to estimate how many machines
and size• Get your monthly cost from aws caculator• http://calculator.s3.amazonaws.com/index.html
Datacenter Cost by Service(Storage)• Application Size / ((Server HDD space * 0.75)* Server Cost/2
Datacenter Cost by Service(Computing)• ((Used Map slot + Used Reduce Slot)/(total Map slot + total Reduce
slot))* total server cost/2
#TrendInsight
Thank you!WE ARE HIRING! WELCOME TO JOIN TRENDMICRO!