Seminar CNAF

Preview:

Citation preview

Seminar CNAF

Exploiting open source tools to

realize a new monitoring

infrastructure at CERN

Pedro Andrade – CERN IT/CF

Overview

• Agile Infrastructure

• Monitoring Project

• Solutions and Technologies

• Producers

• Transport

• Archive

• Query and Analytics

• Real-time Analytics

• Notifications

3/17/2014 CNAF Seminar 2

Agile Infrastructure

3/17/2014 CNAF Seminar 3

Challenges

• New data centre in Budapest since 2013

• Additional capacity required in view of physics needs

• Local on-site maintenance for installations/repairs

3/17/2014 CNAF Seminar 4

Challenges

• Be ready to handle 15’000 servers

• Increasing users of CERN’s facilities and higher

computing requirements as data rates increase

• Staff numbers are fixed, no more people

• Materials budget decreasing, no more money

• Legacy tools are high maintenance and brittle

• Deploy new services within hours

3/17/2014 CNAF Seminar 5

Challenges

• “We Are Not Special”

• Move to commonly used open source tools

• Focus on strong communities and momentum

• Stop re-inventing tools, not made here syndrome

• Implement clouds at scale

• Aim for 90% infrastructure virtualised

• Ecosystem solutions rather than writing from scratch

• Request to delivery in a coffee break

3/17/2014 CNAF Seminar 6

Agile Infrastructure

• Activity started in 2012

• Remodel IT services

• Move to a more horizontal approach

• Layered model: IaaS, PaaS, SaaS

• Services, Configuration, Installation, Hardware

• Virtualisation is key

• Improve efficiency

• Operational, Resources

3/17/2014 CNAF Seminar 7

Agile Infrastructure

3/17/2014 CNAF Seminar 8

Bamboo

Koji, Mock

AIMS/PXE

Foreman

Yum repo

Pulp

Puppet-DB

mcollective, yum

JIRA

Lemon /

Hadoop /

Elastic Search /

Kibana

git

OpenStack

Nova

Hardware

database

Puppet

Active Directory /

LDAP

Monitoring Project

3/17/2014 CNAF Seminar 9

Challenges

• Several independent monitoring activities in IT

• High level services are interdependent

• Understanding performance more important

• Move to a virtualized dynamic infrastructure

• Preserve our investment in monitoring

Shared architecture & tool-chain components

3/17/2014 CNAF Seminar 10

Objectives

• Deliver solutions for the shared architecture

• Work with all IT monitoring teams

• Deliver simple adoption: PaaS

• Better exploit IT resources

• While at the same time

• Mix and match open source solutions

• Exploit new tools from the Agile Infrastructure

• Retire old tools: Lemon DB, Lemon Web, LAS, etc.

3/17/2014 CNAF Seminar 11

Architecture

3/17/2014 CNAF Seminar 12

Process Improvements

• Establish Agile methodology

• Well defined sprints with clear targets

• Interactive evolution, continuous feedback

• Exploit Open Source tools

• Best fit, large adoption, active community

• Fast to adopt, accept limitations, easily replaced

• Look at DevOps

• Quality Assurance processes

• Contiguous Integration processes

3/17/2014 CNAF Seminar 13

Technologies

• Many options available !

3/17/2014 CNAF Seminar 14

Technologies

3/17/2014 CNAF Seminar 15

Producers

3/17/2014 CNAF Seminar 16

Motivation

• Preserve sensors/probes knowledge

• Many years writing sensors for Lemon

• Integrate other data sources

• Most likely service specific monitoring data

Selected Technology: Lemon + Others

3/17/2014 CNAF Seminar 17

Lemon Producer

• Same old lemon agent

• Running in all data centre nodes

• Lemon agent extended with lemon forwarder

• Send notifications to ActiveMQ

• Send metrics to Flume

• Send syslog to Flume

3/17/2014 CNAF Seminar 18

Other Producers

• Must follow common monitoring specification

• Metric v3.0 and Notification v2.0

• Can use monitoring-data-model to create new

metrics and notifications and validate them

• Messages can be send

• To ActiveMQ using a stomp client

• To Flume gateway using a flume agent

• Planning to evaluate Collectd later this year

3/17/2014 CNAF Seminar 19

Transport

3/17/2014 CNAF Seminar 20

Motivation

• Collect operations data

• Lemon metrics and syslog

• 3rd party applications and services

• Scalable transport layer

• Large data volume

• Easy integration with other technologies

Selected Technology: Flume

3/17/2014 CNAF Seminar 21

Flume

• Distributed service for collecting large data sets

• Robust and fault tolerant

• Horizontally scalable

• Many ready to be used input/output plugins

• Java based, Apache license

• Cloudera is the main contributor

• Using their releases

• Less frequent but more stable releases

3/17/2014 CNAF Seminar 22

Flume

• Flume event

• Payload + set of string headers

• Flume agent

• JVM process hosting “source to sink” flows

3/17/2014 CNAF Seminar 23

Flume

• Many ready-to-be-used plugins

• Sources: Avro, JMS, Spool, Syslog, HTTP, etc.

• Interceptors: decorate events, filter events

• Channels: Memory, File, JDBC

• Sinks: Avro, Thrift, ElasticSearch, HDFS, File, etc.

• Custom sources/sinks can be implemented

3/17/2014 CNAF Seminar 24

Flume

• Routing is static

• On demand subscriptions are not possible

• Requires reconfiguration and restart

• No authN and authZ features

• But secure transport available

• Java process on client side

• Small memory footprint would be nicer

3/17/2014 CNAF Seminar 25

Deployment

• Running flume 1.3, latest is flume 1.4

3/17/2014 CNAF Seminar 26

Deployment

• 1st layer: Flume Data publisher

• Deployed in all data centre nodes

• 2nd layer: Flume Gateway

• 20 VMs aggregating events

• 3rd layer: Flume ElasticSearch

• 10 VMs inserting to ElasticSearch

• 3rd layer: Flume Hadoop HDFS

• 10 VMs inserting to Hadoop HDFS

3/17/2014 CNAF Seminar 27

Feedback

• Sizing flume layers needs some tuning

• Available sources/sinks saved a lot of time

3/17/2014 CNAF Seminar 28

Archive

3/17/2014 CNAF Seminar 29

Motivation

• Store operations raw data

• Long term archival required

• Allow future data replay to other tools

• Feed real-time engine

• Offline processing of collected data

• Security data? Syslog data?

Selected Technology: Hadoop/HDFS

30 3/17/2014 CNAF Seminar 30

Hadoop/HDFS

• Hadoop is a framework that allows the

distributed processing of large data sets

• HDFS is a distributed filesystem designed to

run on commodity hardware

• Suitable for applications with large data sets

• Designed for batch processing, not interactive use

• High throughput preferred to low latency access

3/17/2014 CNAF Seminar 31

Hadoop/HDFS

• Small files not welcome: blocks of 64M,128M

• Tens of millions files limit per cluster

• Namenode holding in memory files map

• Transparent compression not available

• Raw text could take much less space

• Real-time data access is not possible

32 3/17/2014 CNAF Seminar 32

Deployment

• Production cluster

• ~200 TB available in 5 data nodes

• 6.3 TB stored since mid July 2013

• Data organized by hostgroup (cluster)

• Daily jobs to aggregate data by month

• Large files preferred to many small files

33 3/17/2014 CNAF Seminar 33

Query & Analytics

3/17/2014 CNAF Seminar 34

Motivation

• Real-time queries based on clear API

• Dynamic dashboards creation

• Rich user-friendly dashboards

• Horizontally scalable and easy to deploy

• Limited data retention policy

• Handle different data types in the same way

Selected Technology: ElasticSearch + Kibana

35 3/17/2014 CNAF Seminar 35

ElasticSearch

• Distributed RESTful search & analytics engine

• Real time data acquisition and indexing

• Automatically balanced shards and replicas

• Schema free, document oriented (JSON)

• No prior data declaration required

• Automatic data type discovery

• Distributed under Apache license

36 3/17/2014 CNAF Seminar 36

ElasticSearch

• Full text search

• Apache Lucene is used to provide full text search

• Not only text: integer/long, float/double, boolean, etc.

• RESTful JSON API

3/17/2014 CNAF Seminar 37

$ curl -XGET http://es-search:9200/_cluster/health?pretty=true

{

"cluster_name" : "itmon-es",

"status" : "green",

"timed_out" : false,

"number_of_nodes" : 11,

"number_of_data_nodes" : 8,

"active_primary_shards" : 2990,

"active_shards" : 8970,

"relocating_shards" : 0,

"initializing_shards" : 0,

"unassigned_shards" : 0

}

Limitations ElasticSearch

• Requires a lot of RAM, mainlly on data nodes

• IO intensive, careful deployment required

• Shards re-initialisation takes some time (~1h)

• Lots of shards and replicas per index, lots of indexes

• Not frequent operation, only after full cluster reboot

• Authentication not built-in (“bricolage”)

• Apache+Shibboleth on top of Jetty plugin

3/17/2014 CNAF Seminar 38

Kibana Kibana

• “Make sense of a mountain of logs”

• Designed to analyse logs

• Perfectly fits timestamped data (e.g. metrics)

• Profits from ElasticSearch search/analyse features

• No coding required

• Simply point & click to build your own dashboard

• Fully integrated and supported by

ElasticSearch

• Started as separate project

3/17/2014 CNAF Seminar 39

Kibana

• Built with AngularJS

• JavaScript MVC for client-side rich application

• Developed and maintained by

• No backend: web server delivers static files

• JS directly queries ElasticSearch

• Easy to install and configure

• “git clone” OR “tar -xvzf” OR ElasticSearch plugin

• 1-line config file to point to the ElasticSearch cluster

• Save its own configuration in ElasticSearch

Kibana

3/17/2014 CNAF Seminar 40

Our Deployment Deployment

• Production cluster

• Running ElasticSearch 0.90.7

• 2 master nodes (16GB RAM, 8 cores)

• 1 search node (16GB RAM, 8 cores)

• 8 data nodes (48GB RAM, 24 cores, 500GB SSD)

• Monitoring: ElasticHQ, BigDesk, and Head

• Indexes structure

• One index per day with 30 days TTL

• 10 shards per index, 3 replicas per shards

3/17/2014 CNAF Seminar 41

Our Deployment Deployment

• Based on ElasticSearch plugin

• Running v3.pre-4

• Deployed together with search node

• Profits from Jetty authentication

• Different endpoints for AuthN

• Public (read only)

• Private (read write)

3/17/2014 CNAF Seminar 42

Feedback

• Easy to deploy and manage

• Robust, fast, and rich API

• Easy query language (DSL)

• More features with aggregation framework

• Released with ElasticSearch v1.0

3/17/2014 CNAF Seminar 43

Feedback

• Easy to deploy and use

• Very cool user interface

• Fits many use cases: text (syslog), metrics (lemon)

• Many “panels” available: tables, charts, hits, etc.

• Very active community and growing

• A bit limited feature set

• Many developments ongoing

3/17/2014 CNAF Seminar 44

Notifications

3/17/2014 CNAF Seminar 45

Motivation

• Modular tools to manage notifications

• Notifications delivered to multiple endpoints

• Automatic SNOW tickets / Central dashboard / etc.

• More efficient handling of notifications

• Enable SMs to improve automation of their services

• Improve routing of SNOW tickets

• Avoid wasting time in multiple (fake) hops

• Make visible problems hidden to SM before

• Allow others to publish/consumer notifications

3/17/2014 CNAF Seminar 46

GNI

• General Notifications Infrastructure

• Manage all data centre notifications

• Messaging consumers integrating with other tools

• Multiple notification types: HW, APP, OS, NC

• Notifications delivered as SNOW Incidents

• Incidents assigned to appropriate support unit

• Incidents masking per notification type

• Notifications stored in ElasticSearch

• Visible via a dedicated Kibana dashboard

3/17/2014 CNAF Seminar 47

Deployment

• 3 VMs for messaging clients + ES cluster

• Using other IT services: ActiveMQ, SNOW

3/17/2014 CNAF Seminar 48

Real-time Analytics

3/17/2014 CNAF Seminar 49

Motivation

• Real-time analytics engine

• Automatic generation of curated data

• Easy to use under different contexts

• First target is aggregation of notifications

• Online machine learning, ETL, etc.

• Adopt open source tool

• Good candidates: Spark, Storm, ?

• Easy integration with current tools

3/17/2014 CNAF Seminar 50

Summary

3/17/2014 CNAF Seminar 51

Summary

3/17/2014 CNAF Seminar 52

Before After

Many central services More platform services

Notifications limited to lemon Generic notifications producers

Inefficient ticket routing Flexible ticket routing

Limited to lemon metrics Open to any monitoring data

Complex data access Easy data access

Central lemon dashboard Dashboard instances per application

Limited offline analytics Batch analytics in HDFS

No real-time analytics New real-time analytics tools

Summary

• New shared monitoring architecture

• Being adopted by all IT monitoring activities

• Selected technologies look good

• Flume, ES, Kibana, HDFS

• Happy to get your feedback on these and others

• Don’t forget the cultural changes

• Agile methology, DevOps, PaaS, etc.

• As important as the technology changes

3/17/2014 CNAF Seminar 53

Recommended