- 1. How to monitor the$H!T out of Hadoop Developing a
comprehensive open approach to monitoring hadoop clusters
2. Relevant Hadoop Information
- Hardware/Software failures common
- Redundant Components DataNode, TaskTracker
- Non-redundant Components NameNode, JobTracker,
SecondaryNameNode
- Fast Evolving Technology (Best Practices?)
3. Monitoring Software
-
- Red Yellow Green Alerts, Escalations
-
- Defacto Standard Widely deployed
-
- Pluggable with shell scripts/external apps
4. Cacti
- Performance Graphing System
- Template System for Graph Types
-
- Shell script /external program
5. 6. hadoop-cacti-jtg
- JMX Fetching Code w/ (kick off) scripts
- Cacti templates For Hadoop
- Premade Nagios Check Scripts
- Helper/Batch/automation scripts
7. Hadoop JMX 8. Sample Cluster P1
-
- DerbyDB (hive) on SecNameNode
9. A Sample Cluster p2
10. Prerequisites
- Nagios (install) DAG RPMs
- Cacti (install) Several RPMS
- Liberal network access to the cluster
11. Alerts & Escalations
- X nodes * Y Services = < Sleep
-
- Review (Daily, Weekly, Monthly)
12. Wake Me Ups
-
- Disk Full (Big Big Headache)
-
- RAID Array Issues (failed disk)
-
- Do not realize it is not working too late
13. Dont Wake Me Ups
-
- Warning Currently Failed Disk will down the Data Node (see
Jira)
- Slaves are expendable (up to a point)
14. Monitoring Battle Plan
- Add Hadoop Specific Alarms
-
- FilesTotal > 1,000,000 or LiveNodes < 50%
15. The Basics Nagios
- * Load based alarms are somewhat useless389% CPU load is not
necessarily a bad thing in Hadoopville
16. The Basics Cacti
17. Disk Utilization 18. RAID Tools
- Hpacucli not a Street Fighter move
-
- Alerts on RAID events (NameNode)
- Dell, SUN, Vendor Specific Tools
19. Before you jump in
- X Nodes * Y Checks * = Lots of work
- About 3 Nodes into the process
-
- Wait!!! I need some interns!!!
- Solution
S.I.C.C.T.Semi-Intelligent-Configuration-cloning-tools
20. Nagios
21. Cacti
- Answers HOW WELL IS IT RUNNING?
22. Monitoring Battle Plan Thus Far
-
- Ping, Disk !!!!!!Done!!!!!!
- Add Hadoop Specific Alarms
-
- FilesTotal > 1,000,000 or LiveNodes < 50%
23. Add Hadoop Specific Alarms
- Hadoop Components with a Web Interface
- check_http + regex = simple + effective
24. nagios_check_commands.cfg
- (Future) Newer Hadoop will have XML status
define command { command_namecheck_remote_namenode
command_line$USER1$/check_http -H$HOSTADDRESS$ -u
http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode }
define service { service_description check_remote_namenode use
generic-service host_name hadoopname1 check_command
check_remote_namenode!50070 } 25. Monitoring Battle Plan
- Add Hadoop Specific Alarms
-
- FilesTotal > 1,000,000 or LiveNodes < 50%
26. JMX Graphing
27. JMX Graphing 28. JMX Graphing 29. JMX Graphing 30. 31.
Standard Java JMX 32. Monitoring Battle Plan Thus Far
- Start With the Basics !!!!!!Done!!!!!
- Add Hadoop Specific Alarms !Done!
-
- FilesTotal > 1,000,000 or LiveNodes < 50%
33. Add JMX based Alarms
- hadoop-cacti-jtg is flexible
-
- Write your own check logic
34. Quick JMX Base Walkthrough
- url, user, pass, object specified from CLI
- wantedVariables, wantedOperations by inheritance
- fetch() output() provided
35. Extend for NameNode 36. Extend for Nagios 37. Monitoring
Battle Plan
- Start With the Basics !DONE!
- Add Hadoop Specific Alarms !DONE!
- Add JMX Based alarms !DONE!
-
- FilesTotal > 1,000,000 or LiveNodes < 50%
38. Review
39. The Future
- JMX Coming to JobTracker and TaskTracker (0.21)
-
- Collect and Graph Jobs Running
-
- Collect and Graph Map / Reduce per node
-
- Profile Specific Jobs in Cacti?