Open Source Enterprise Monitoring with Zabbix - Source Enterprise Monitoring with Zabbix Alexei Vladishev, Founder of Zabbix

  • View

  • Download

Embed Size (px)

Text of Open Source Enterprise Monitoring with Zabbix - Source Enterprise Monitoring with Zabbix Alexei...

  • Open Source Enterprise Monitoring with Zabbix

    Alexei Vladishev, Founder of

  • What is Zabbix:

    Zabbix overview Highlights of Zabbix features Monitoring of large distributed environments


    Zabbix Roadmap


  • Zabbix overview

  • Most important reasons:

    Warn and act in case of any problems. Downtimes are very expensive! To identify and fix problems ASAP before customers start calling. More productive work of IT staff To automate routine tasks, check of availability of resources To plan hardware resources. Capacity planning and trends. To measure and analyse quality of provided and used services (SLA)

    A good monitoring system makes us confident our business is running!

    Why shall we use monitoring?

  • Zabbix is celebrating its 8th anniversary!

    Choice of 1998 HP OpenView, IBM, BMC: expensive to buy and maintain How to name it? ABCDE...Zabbix! April 2001 the first public release Zabbix 1.0alpha1 April 2004 the first stable release Zabbix 1.0 April 2005 the company Zabbix SIA was established: commercial support

    Zabbix today. We have made a good progress!

    Zabbix 1.6.4, 500 downloads per day, 15.000 forum usersZabbix company is growing, 20 Zabbix partners (Europe, Japan, the US)


  • Zabbix is an Open Source distributed monitoring system capable of monitoring availability and performance of servers, network devices, applications.

    Zabbix functionality: Agent-less/based monitoring Auto-discovery Escalations and repeated notifications Pro-active monitoring, remote actions WEB monitoring Graphs, maps, screens IT Services (SLA), reports Distributed monitoring, IPv6 and more!

    What is Zabbix?

  • Zabbix: main componentsServer: Zabbix core, system logic Data processing, escalations

    WEB front-end: Access to historical data Configuration

    Agent: Server data collection, actions

    Proxy: Remote data collection

  • Important technical decisions: WEB front-end for data visualisation and configuration Written in the C language, PHP front-end. No Java/Python/Perl/Ruby on the server and agent side! No fork(), native syscalls() are used instead. Support of virtually all platforms (Linux, *BSD, Solaris, AIX, HP-UX, Windows,...) Choice of database engines: MySQL, PostgreSQL, Oracle, SQLite We do not reuse Nagios, RRD, Cacti

    Key principles of Zabbix development: Keep things simple (KISS), yet be very flexible Maintain low hardware requirements, should not affect production

    Technical details

  • What makes Zabbix so special? All-in-one solution only when it comes to monitoring! All historical data, trends and configuration is stored in a database Ready for monitoring of small and LARGE distributed environments True Open Source (GPLv2) solution, no commercial versions. All logic is on the server side, agents are for data collection only Extremely flexible! Triggers, escalations, new checks, screens, and more. Designed to deal with unstable communications Full support of IPv6

    Why would we choose Zabbix?

  • How to monitorService checks: FTP, SSH, HTTP, SMTP, DNS ...

    Zabbix Agent: ctive and passive checks Monitoring of logs, event logs Easy to extend Remote command execution Extremely efficient!

    Other: WMI, JMX, Nagios plugins

    SNMP v1,v2,v3: Network devices Normally NET-SNMP for servers Monitoring of applications (Oracle, Weblogic, Websphere, PostgreSQL, MySQL, ...) SNMP traps

    IPMI: Monitoring of hardware Remote management (reboot, reset, halt)

  • Use of Zabbix agentActive checks: Highly efficient Buffering of collected data

    Passive checks: Requires polling on the Zabbix server side Additional performance hit because of polling and network bandwidth

  • Zabbix Highlights

  • Mmm... Triggers!Trigger is a flexible logical expression used to define a problem condition. Status (value) of a trigger represents system state Change of trigger value generates events It is one of the ways to deal with flapping

    CPU load is too high: {host:cpuload.last(0)}>5CPU load is too high: {host:cpuload.min(300)}>2CPU load is too high: {host:cpuload.min(300)}>2 & {host:cpuuser.min(300)}>50CPU load is too high: {host:cpuload.min(300)}>2 & {host2:backup.last(0)}=0

    We decide how to define CPU load is too high not Zabbix itself!

  • DependenciesThey are used to:

    Avoid notifications Define dependencies between different problems (related to networks, applications, anything). No host dependencies!

    Server is down Switch1 is down Switch2 is down

    WEB App is down MySQL is not responsive No free disk space on /tmp

  • EscalationsDifferent scenarios: Delayed notifications Repeated notifications Execution of commands Escalation to other users Recovery messages Different actions for acknowledged and not acknowledges events

    Example (reaction to a failed WEB check):

    Increase step every 5 minutes Step 1-3: Send message to Unix Admins Step 3-5: Send message to Boss if not ACK Step 6: Restart Apache if not ACK Step 7: Reboot server if not ACK Step 10: Send message to all of not ACK

  • Visualisation: DashboardFavourite resources: Maps Graphs Screens

    High-level view: Problems by host group Zabbix statistics List of the latest issues WEB monitoring info Auto-discovery

  • Visualisation: GraphsImmediate access: Any period of time Easy time-navigation Two mouse-click zooming Problem conditions displayed Non-working time is marked Not generated in advance!

    Graph types: Standard (dots, lines, colors) Stacked Pie

  • Visualisation: ScreensDifferent blocks: Graphs Maps Plain text data List of problems High level stats

    Slide shows: Combination of screens Displayed one after another

  • WEB monitoringGoals: Monitoring of user experience Support of complex scenarios Performance monitoring Availability monitoring

    Example:Step 1 Access home pageStep 2 Login (POST, GET)Step 3 Run reportStep 4 Logout

  • IT ServicesGoals: Business level monitoring SLA monitoring We care about services Escalation of problems Root cause of the problem

    Tree structure based on: Dependencies Physical location Type of service, etc

  • User managementAuthentication: Standard: Zabbix database LDAP (Active Directory) Apache (Kerberos, Unix, etc)

    Permissions: Depends of user type User group level permissions

    Also: Notifications-only user groups

  • Extending ZabbixNew Zabbix agent-side check: UserParameter=mysql.qps,mysqladmin uroot status|cut f9 d:UserParameter=sum[*],echo $1+$2|bcExamples: mysql.qps = 456, sum[4,5] = 9

    New notification methods: Just a matter of writing a shell script (voice generation, Skype call, anything)

    New server side checks: Just a matter of writing a shell script

  • Monitoring of large environments

  • Our environmentSituation: Several thousands of servers and network devices Distributed accross 2-100 data centers or branches Centralised monitoring is required

  • Zabbix: several approaches

    One Zabbix server does everything

    One Zabbix server One Proxy per data center or company branch

    DistributedDistributed1 Server1 ServerMany ProxiesMany Proxies1 Server1 Server

    One Zabbix server per data center More effort to maintain Can be used with Proxies

  • What is Proxy?Proxy is a data collector. It is also used for auto-discovery.

    Advantages: Makes architecture easier Does not require significant resources Offloads Zabbix server

  • Proxy: how does it work?Connection loss processing: Data is buferred in the Proxy database Will be sent on connection recovery No notifications about local problems!


    Data collection only Fully managed via WEB front-end Configuration is stored on the Zabbix server side All connections are initiated by Proxy Collection of thousands of values per second

  • Distributed monitoringBasic attributes: Tree-like structure Node is a Zabbix server Nodes are platform independent

    Managements: Two-way replication of configuration Parent node controls child nodes

  • Processing of connection lossWhat will stop working? Data sending to parent node Synchronisation of configuration

    Everything else will keep working!

  • Thousands of devices: solutionsProblems and solutions: Huge data volume: use database partitions for historical data Integration with existing systems: LDAP authentication, notifcation methods to open tickets, XML import/export for configuration management and inventory Maintenance: templates, mass updates Upgrades: all Zabbix components are compatible within one major release 1.6.x

  • Choice of the best schema

    Getting used to ZabbixAdopt Open Source

    Adding Proxies

    DistributedDistributed1 Server1 ServerMany ProxiesMany Proxies

    1 Server1 Server Distributed monitoring

    Depends on the requirements: Local administration Full-featured monitoring when no connection between data centers (branches)

  • Zabbix Roadmap

  • General directions

    Better integration REST API/RPC Better scalability

    Flexible Dashboard Personalization (widgets)


    Infrastructure for widgets Business level monitoring

  • Questions?Today and tomorrow I am around!

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slid