62
Open-Falcon A Distributed and High-Performance Monitoring System Yao-Wei Ou & Lai Wei 2017/05/22

Open-Falcon: A Distributed and High-Performance Monitoring System

Embed Size (px)

Citation preview

Page 1: Open-Falcon: A Distributed and High-Performance Monitoring System

Open-FalconA Distributed and High-Performance Monitoring System

Yao-Wei Ou & Lai Wei 2017/05/22

Page 2: Open-Falcon: A Distributed and High-Performance Monitoring System

Let us begin with a little story…

Page 3: Open-Falcon: A Distributed and High-Performance Monitoring System

Grafana PR#3787

[feature] Add Open-Falcon datasource

Page 4: Open-Falcon: A Distributed and High-Performance Monitoring System

“I'm sorry but we will not merge any new datasources unless they are very popular.”

- Grafana

Page 5: Open-Falcon: A Distributed and High-Performance Monitoring System
Page 6: Open-Falcon: A Distributed and High-Performance Monitoring System
Page 7: Open-Falcon: A Distributed and High-Performance Monitoring System

Open-Falcon,now is one of the most popular monitoring systems in China.

Page 8: Open-Falcon: A Distributed and High-Performance Monitoring System

About us

Laiwei✦ Technical director of DiDi, China.✦ Founder of Open-Falcon software.✦ Core maintainer of Open-Falcon

community.✦ Focus on service reliability, DevOps,

Cloud computing, etc.7

Yaowei✦ Director of WiFire oversea R&D center.✦ Core maintainer of Open-Falcon

organization.✦ Leads the development of CDN

monitoring system in Fastweb.✦ Focus on CDN and Blockchain.

Page 9: Open-Falcon: A Distributed and High-Performance Monitoring System

Outline

‣ Motivation

‣ Features

‣ Architecture

‣ Comparison

‣ Community

8

Page 10: Open-Falcon: A Distributed and High-Performance Monitoring System

Motivation(1/3)

There are already so many outstanding open source monitoring systems. Why do we reinvent a wheel? And how does it become the most popular monitoring system in China?

9

Page 11: Open-Falcon: A Distributed and High-Performance Monitoring System

Motivation(2/3)‣ Zabbix

•difficult to scale out (~2000)• Database can grow very large, very quickly if not tuned properly.• not very resource friendly: a lot of connections will be made.• we have to build several zabbix cluster to deal with the rapid business grow.

•a bit of a learning curve• Tuning and tweaking can be a lot of work• more complex to setup• hard to extend• manually configuration

‣ 2015 open sourced by Xiaomi SRE, China, under Apache License Version 2.0•to replace Zabbix

10

Page 12: Open-Falcon: A Distributed and High-Performance Monitoring System

Motivation(3/3)

‣ key factors of an enterprise class monitoring system.

11

Scalability Performance User-OrientedFlexibilityHigh

Availability

Page 13: Open-Falcon: A Distributed and High-Performance Monitoring System

Features

Page 14: Open-Falcon: A Distributed and High-Performance Monitoring System

Scalability‣ Scalable monitoring system is necessary to support rapid business

growth. Each module of Open-Falcon is super easy to scale horizontally.

‣ Supports up to hundreds of million transactions per minute (query/judge/store/search).

‣ Can easily support over 100,000 hosts.

13

Page 15: Open-Falcon: A Distributed and High-Performance Monitoring System

Performance‣ With  RRA(Round Robin Archive) mechanism, the one-year history

data  of 100+ metrics could be  returned in just few seconds.

‣ stores 10+ years historical metrics.

14

Page 16: Open-Falcon: A Distributed and High-Performance Monitoring System

High Availability‣ No critical single point of failure

‣ Easy to operate and deploy

15

Page 17: Open-Falcon: A Distributed and High-Performance Monitoring System

Flexibility‣ Falcon-agent has already 400+ built-in server metrics. Users can

collect their customized metrics by writing plugins or just simply run a script/program to relay metrics to falcon-agent.

‣ Extensive architecture.

‣ Customizable metrics.

‣ Abundant APIs.

16

Page 18: Open-Falcon: A Distributed and High-Performance Monitoring System

Efficiency‣ For easier management of alerting rules, Open-Falcon supports

strategy, expression, template inheritance, and multiple alerting method, and callback for recovery.

‣ Auto discovery of endpoints and counters.

‣ API support.

17

Page 19: Open-Falcon: A Distributed and High-Performance Monitoring System

User-Oriented‣ Supports Grafana Datasource.

18

‣ Open-Falcon could present multi-dimension graph, including user-defined dashboard/screen.

Page 20: Open-Falcon: A Distributed and High-Performance Monitoring System

Grafana with Open-Falcon Screenshot (1/3)

Page 21: Open-Falcon: A Distributed and High-Performance Monitoring System

Grafana with Open-Falcon Screenshot (2/3)

Page 22: Open-Falcon: A Distributed and High-Performance Monitoring System

Grafana with Open-Falcon Screenshot (3/3)

Page 23: Open-Falcon: A Distributed and High-Performance Monitoring System

Architecture

Page 24: Open-Falcon: A Distributed and High-Performance Monitoring System

Components

23

COLLECT STORE

JUDGE

PRESENT

NOTIFY

Page 25: Open-Falcon: A Distributed and High-Performance Monitoring System

Components

23

COLLECT STORE

JUDGE

PRESENT

NOTIFY

Page 26: Open-Falcon: A Distributed and High-Performance Monitoring System

Components

23

COLLECT STORE

JUDGE

PRESENT

NOTIFY

Page 27: Open-Falcon: A Distributed and High-Performance Monitoring System

Components

23

COLLECT STORE

JUDGE

PRESENT

NOTIFY

Falcon-Agent Proxy-Gateway aggregator nodata

graph redis MySQL hbs

Falcon-Dashboard Grafana Falcon-API

alarmjudge

Page 28: Open-Falcon: A Distributed and High-Performance Monitoring System

Center Status

24

Page 29: Open-Falcon: A Distributed and High-Performance Monitoring System

Agent1 Agent2 Agent3 Agent4 … AgentN

UICDashboard Portal2 2 2

AlarmQuery

Graph 20

Transfer 60

Judge 20

5 1 HBS

Sender

Link

5

1

1

Before: Too many modules

Nodata

Aggregator

Tasks

Alarm FE

Page 30: Open-Falcon: A Distributed and High-Performance Monitoring System

After

Page 31: Open-Falcon: A Distributed and High-Performance Monitoring System

After

UICDashboard Portal

Alarm FE

Page 32: Open-Falcon: A Distributed and High-Performance Monitoring System

After

Alarm API Gateway

Graph

Transfer (Queue)

Judge HBS(Control)

UICDashboard Portal

Alarm FEDashboard

Page 33: Open-Falcon: A Distributed and High-Performance Monitoring System

After

Central

Status

Alarm API Gateway

Graph

Transfer (Queue)

Judge HBS(Control)

UICDashboard Portal

Alarm FE

Falcon-Plus

Dashboard

Page 34: Open-Falcon: A Distributed and High-Performance Monitoring System

After

Central

Status

Alarm API Gateway

Graph

Transfer (Queue)

Judge HBS(Control)

UICDashboard Portal

Alarm FE

Falcon-Plus

Dashboard

Agent1 Agent2 Agent3 Agent4 … AgentN

Page 35: Open-Falcon: A Distributed and High-Performance Monitoring System

Grafana

After

Central

StatusInfluxDB Cassandra OpenTSDB ELK

Alarm API Gateway

Graph

Transfer (Queue)

Judge HBS(Control)

UICDashboard Portal

Alarm FE

Falcon-Plus

Dashboard

Agent1 Agent2 Agent3 Agent4 … AgentN

Page 36: Open-Falcon: A Distributed and High-Performance Monitoring System

Design Philosophy

27

Page 37: Open-Falcon: A Distributed and High-Performance Monitoring System

Design Philosophy

27

MICROSERVICES10 major modules

individual deployed

Page 38: Open-Falcon: A Distributed and High-Performance Monitoring System

Design Philosophy

27

PUSHagent as a push proxy

MICROSERVICES10 major modules

individual deployed

Page 39: Open-Falcon: A Distributed and High-Performance Monitoring System

Design Philosophy

27

PUSHagent as a push proxy

RRDTOOLwith consistent hash

MICROSERVICES10 major modules

individual deployed

Page 40: Open-Falcon: A Distributed and High-Performance Monitoring System

Design Philosophy

27

BINARYGo static binary

PUSHagent as a push proxy

RRDTOOLwith consistent hash

MICROSERVICES10 major modules

individual deployed

Page 41: Open-Falcon: A Distributed and High-Performance Monitoring System

Design Philosophy

27

BINARYGo static binary

PUSHagent as a push proxy

RRDTOOLwith consistent hash

DEPLOYMENTmass deployment by ops-updater

MICROSERVICES10 major modules

individual deployed

Page 42: Open-Falcon: A Distributed and High-Performance Monitoring System

Design Philosophy

27

BINARYGo static binary

PUSHagent as a push proxy

RRDTOOLwith consistent hash

GRAFANAopen-falcon datasource

DEPLOYMENTmass deployment by ops-updater

MICROSERVICES10 major modules

individual deployed

Page 43: Open-Falcon: A Distributed and High-Performance Monitoring System

Comparison

Page 44: Open-Falcon: A Distributed and High-Performance Monitoring System

Compared to Prometheus

29

OPEN-FALCON PROMETHEUS

Abundant APIs Metrics API

Push Model: Auto Discovery Pull Model: Manual configuration

Easy to scale out Harder to scale out

simple alert management of own dashboard Alertmanager offers grouping, deduplicationand silencing functionality

Faster query performance of RRA Slower, Recording rules

Simple shellscript as plugin A bit learning curve to write exporter and collector

Limited expression PromQL

Page 45: Open-Falcon: A Distributed and High-Performance Monitoring System

Compared to Prometheus

29

OPEN-FALCON PROMETHEUS

Abundant APIs Metrics API

Push Model: Auto Discovery Pull Model: Manual configuration

Easy to scale out Harder to scale out

simple alert management of own dashboard Alertmanager offers grouping, deduplicationand silencing functionality

Faster query performance of RRA Slower, Recording rules

Simple shellscript as plugin A bit learning curve to write exporter and collector

Limited expression PromQL

Page 46: Open-Falcon: A Distributed and High-Performance Monitoring System

Compared to Prometheus

29

OPEN-FALCON PROMETHEUS

Abundant APIs Metrics API

Push Model: Auto Discovery Pull Model: Manual configuration

Easy to scale out Harder to scale out

simple alert management of own dashboard Alertmanager offers grouping, deduplicationand silencing functionality

Faster query performance of RRA Slower, Recording rules

Simple shellscript as plugin A bit learning curve to write exporter and collector

Limited expression PromQL

Page 47: Open-Falcon: A Distributed and High-Performance Monitoring System

Compared to Prometheus

29

OPEN-FALCON PROMETHEUS

Abundant APIs Metrics API

Push Model: Auto Discovery Pull Model: Manual configuration

Easy to scale out Harder to scale out

simple alert management of own dashboard Alertmanager offers grouping, deduplicationand silencing functionality

Faster query performance of RRA Slower, Recording rules

Simple shellscript as plugin A bit learning curve to write exporter and collector

Limited expression PromQL

Page 48: Open-Falcon: A Distributed and High-Performance Monitoring System

Compared to Prometheus

29

OPEN-FALCON PROMETHEUS

Abundant APIs Metrics API

Push Model: Auto Discovery Pull Model: Manual configuration

Easy to scale out Harder to scale out

simple alert management of own dashboard Alertmanager offers grouping, deduplicationand silencing functionality

Faster query performance of RRA Slower, Recording rules

Simple shellscript as plugin A bit learning curve to write exporter and collector

Limited expression PromQL

Page 49: Open-Falcon: A Distributed and High-Performance Monitoring System

Compared to Prometheus

29

OPEN-FALCON PROMETHEUS

Abundant APIs Metrics API

Push Model: Auto Discovery Pull Model: Manual configuration

Easy to scale out Harder to scale out

simple alert management of own dashboard Alertmanager offers grouping, deduplicationand silencing functionality

Faster query performance of RRA Slower, Recording rules

Simple shellscript as plugin A bit learning curve to write exporter and collector

Limited expression PromQL

Page 50: Open-Falcon: A Distributed and High-Performance Monitoring System

Compared to Prometheus

29

OPEN-FALCON PROMETHEUS

Abundant APIs Metrics API

Push Model: Auto Discovery Pull Model: Manual configuration

Easy to scale out Harder to scale out

simple alert management of own dashboard Alertmanager offers grouping, deduplicationand silencing functionality

Faster query performance of RRA Slower, Recording rules

Simple shellscript as plugin A bit learning curve to write exporter and collector

Limited expression PromQL

Page 51: Open-Falcon: A Distributed and High-Performance Monitoring System

Compared to Prometheus

29

OPEN-FALCON PROMETHEUS

Abundant APIs Metrics API

Push Model: Auto Discovery Pull Model: Manual configuration

Easy to scale out Harder to scale out

simple alert management of own dashboard Alertmanager offers grouping, deduplicationand silencing functionality

Faster query performance of RRA Slower, Recording rules

Simple shellscript as plugin A bit learning curve to write exporter and collector

Limited expression PromQL

Page 52: Open-Falcon: A Distributed and High-Performance Monitoring System

Community

Page 53: Open-Falcon: A Distributed and High-Performance Monitoring System

Ecosystem (1/2)

31

✦ Banking

✦ IaaS, SaaS

✦ CDN

✦ O2O

✦ Social

✦ Entertainment

✦ …

Page 54: Open-Falcon: A Distributed and High-Performance Monitoring System

Ecosystem (2/2)

32

OS

Plugins

Switch Hadoop HBase Docker Redis

MongoDB GPU RabbitMQ HAProxy Nginx

JMX LVS Tomcat WebSphere IIS

UI

Page 55: Open-Falcon: A Distributed and High-Performance Monitoring System

Join us

33

Github

Homepage

Contact us

https://github.com/open-falcon

http://open-falcon.org

[email protected]

Wechat

Page 56: Open-Falcon: A Distributed and High-Performance Monitoring System

Summary

Page 57: Open-Falcon: A Distributed and High-Performance Monitoring System

100+ companies

Page 58: Open-Falcon: A Distributed and High-Performance Monitoring System

40,000+ servers

Page 59: Open-Falcon: A Distributed and High-Performance Monitoring System

400+ built-in metrics

Page 60: Open-Falcon: A Distributed and High-Performance Monitoring System

5,000+ users

Page 61: Open-Falcon: A Distributed and High-Performance Monitoring System

3 seconds

Page 62: Open-Falcon: A Distributed and High-Performance Monitoring System

Q&A

https://github.com/open-falcon/open-falcon