Facing enterprise specific challenges – utility programming in hadoop

Facing Enterprise-specific Challenges – Utility Programming in Hadoop

吳育儒 Fann Wu

Who am I ?• Fann Wu 吳育儒• Sr. Engineer, SPN, Trend Micro• Hadoop Cluster Admin• Splunk Cluster Admin• Monitor Admin• 水電工

Architecture,Operation,TroubleShooting,Automation,Performance Turning

Agenda• How to manage big cluster• How to manage big Hadoop cluster• Datacenter & AWS

How to manage big cluster

武功• 基本心法• 基本招式

心法

招式

無招勝有招

心法• 將幾百台 Server當做一台 Server在管• 將幾百台 Server當做女朋友照顧• Server安穩才能睡個好覺• 放乖乖之必要

Cluster

http://www.quuxlabs.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

http://www.quuxlabs.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

招式 -shellscript

實用的土炮 !!!

招式 -PSSHPSSH provides parallel versions of OpenSSH and related tools. Included are pssh, pscp, prsync, pnuke, and pslurp.

https://code.google.com/p/parallel-ssh/

https://code.google.com/p/parallel-ssh/

招式 -SaltStack

http://saltstack.com/

http://saltstack.com/

招式 -SaltStack• Role:- Salt Master• Salt Minions• Web UI

招式 -SaltStack• Install SaltStack from epel (CentOS 6)

招式 -SaltStack• Config SaltStack and start SaltStack service

• Master:• Add “ interface: 192.168.50.8” to /etc/salt/master

• Minion:• Add “master:192.168.50.8” to /etc/salt/minion

招式 -SaltStack• List Unaccepted key, Accept all key

招式 -SaltStack• You can use salt command to control the cluster

招式 -SaltStack UI HALITE

Configure Management

Chef

Ansible

Puppet

Puppet Web - Foreman

Wait ………….Where is Hadoop

How to manage Big Hadoop cluster

TrendMicro Hadoop• Server: hundreds of Servers• User : 2 hundred accounts• Daily Input Data: 2TB• Daily Jobs: hundred of jobs

27

Hadoop as a

Service

Central Management

Automation

Highly Availability

Customization

We need….

Hadoop Ecosystem

Puppet

Hadooppet

A project for deploy Trend Micro Hadoop

distribution on a large cluster

28

IT automation software

So…..

CLUSTER DEPLOYMENT BY DISTRIBUTION / ENVIRONMENT• POC, Staging, Production• All-in-one VM, AWS EC2 deployment

CLUSTER DEPLOYMENT• Package installation• Configuration adjustment

CLUSTER OPERATION• Add new Hadoop node/client• Account management• Process management

SANITY CHECK • DFSIO, YCSB , etc• Sample Applications

Hadooppet

29

WE CAN EASILY DEPLOY HUNDREDS OF SERVERS WITHIN ONE HOUR

Hadoop Security

Hadoop Security - Without security• From any machine that can access hadoop• [root@hackserver opt]# su hdfs• [hdfs@hackserver opt]$ hadoop fs -rmr /• Say Goodbye to your data

如有雷同純屬巧合

Hadoop Security - Kerberos

Hadoop Security - Kerberos• Without auth

• Pass auth

Kerberos Common Problem• Problem:Clock skew too great while getting initial

credentials• Solution:Use date or ntpdate to sync the time

Hadoop Security – Folder Permission• POSIX permissions

• POSIX ACLs (Access Control Lists)

Hadoop Security- More Security but still in incubation

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_Knox_Gateway_Admin_Guide/content/ch01.html

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_Knox_Gateway_Admin_Guide/content/ch01.html

Hadoop HA

Sleep well - HA• Kerberos HA• LDAP HA• Namenode HA• Jobtracker HA/Resourcemanger HA• HBase HA

Mertric/Monitor tool

Without Mertric/Monitor tool

Know the present/future-Ganglia

Known the real problem-Nagios

Known the real problem-Nagios mail

Find the detail log and generate fashion report-Splunk

Job collector

User Mapper/Reducer usage

Fail Job Summary

Cluster Mapper Usage/Pending

Cluster Reducer Usage/Pending

Other Tools

Offline Image Viewer • Transform fsimage from binary to text

• Cat fsimage.txt

Total Number of Files for Each UserPig

Common Datanode Decommission Process• Add DataNodes hostnames to dfs.exclude file• On NameNode host, run hdfs dfsadmin –refreshNodes• Check Web UI to see whether the state has changed

to Decommission In Progress for the DataNodes being decommissioned. (1day~2day)

• When all the DataNodes report their state as Decommissioned, You can then shut down the decommissioned nodes.

• Replace the crash HDD, reboot server and re-config the HDD from Raid card. (20 mins)

• Mount the HDD• Start Datanode service

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.5/bk_system-admin-guide/content/admin_decommission-slave-nodes-2-1.html

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.5/bk_system-admin-guide/content/admin_decommission-slave-nodes-2-1.html

TrendMicro Datanode Decommission(HDD hot swap)

• Replace the crash HDD• Stop datanode & umount the broken mount point.

(5 mins)• Reinit the HDD from raid card setting• Check /var/log/message , linux will auto rescan• Mount the broken point• Start datanode service

HBase Canary Tool

• Contributor: TrendMicro Scott Miao• Purpose: Check every table’s first region on

regionserver

https://issues.apache.org/jira/browse/HBASE-7525

https://issues.apache.org/jira/browse/HBASE-7525

HBase Canary Tool

• Usage:

• Result

HappyBase+Thrift • What is HappyBase• Purpose:- Check regionserver’s every region response time- Check table’s every region response time

http://happybase.readthedocs.org/en/latest/

http://happybase.readthedocs.org/en/latest/

Datacenter & Aws

How we test EMR POC

If your EMR cluster running 24x7

Reduce EMR cost• 100 nodes cost running 1 hour == 1 node running 100 hour• AWS charge by hour• If you don’t care about job stable, use spot instance to save

cost• Use Reserve Instance to save cost• Use EMR Auto Scaling • Pilot run your Application to estimate how many machines

and size• Get your monthly cost from aws caculator• http://calculator.s3.amazonaws.com/index.html

http://calculator.s3.amazonaws.com/index.html

Datacenter Cost by Service(Storage)• Application Size / ((Server HDD space * 0.75)* Server Cost/2

Datacenter Cost by Service(Computing)• ((Used Map slot + Used Reduce Slot)/(total Map slot + total Reduce

slot))* total server cost/2

#TrendInsight

Thank you!WE ARE HIRING! WELCOME TO JOIN TRENDMICRO!

Technology

Facing enterprise specific challenges – utility programming in hadoop