84
Paul O’Connor [email protected] Superb Supervision of Short-lived Servers with Sensu

Superb Supervision of Short-lived Servers with Sensu

Embed Size (px)

Citation preview

Paul O’[email protected]

Superb Supervision of Short-lived Servers with Sensu

Yelp’s MissionConnecting people with great

local businesses.

Short-Lived Servers

Short-Lived Servers• Servers in auto-scaling groups

Short-Lived Servers• Servers in auto-scaling groups

• Short batch servers existing while the batch runs

Short-Lived Servers• Servers in auto-scaling groups

• Short batch servers existing while the batch runs

• Ensuring latest image build

Why Sensu?

Why Sensu?• Designed to be pluggable / extensible

• Arbitrary check metadata

Why Sensu?• Designed to be pluggable / extensible

• Arbitrary check metadata

• Simple model

• Components do exactly one thing

Why Sensu?• Designed to be pluggable / extensible

• Arbitrary check metadata

• Simple model

• Components do exactly one thing

• Ruby

• Not afraid to extend (or fork!)

https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/

“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”

How to use Sensu

How to use Sensu• Don’t use all of this!

How to use Sensu• Don’t use all of this!

• Standalone checks only

• Default in Puppet module

Sensu Data Flow• Sensu client runs checks locally on each machine

• Results are published to RabbitMQ Cluster

Sensu Data Flow• Sensu client runs checks locally on each machine

• Results are published to RabbitMQ Cluster• Sensu Server in H/A Cluster

• Processes check results from RabbitMQ• Invokes appropriate handlers• Writes state to Redis

Sensu Data Flow• Sensu client runs checks locally on each machine

• Results are published to RabbitMQ Cluster• Sensu Server in H/A Cluster

• Processes check results from RabbitMQ• Invokes appropriate handlers• Writes state to Redis

• Redis + Redis Sentinel• 2+ instances in each cluster• Read by the Sensu API

Sensu Data Flow• Sensu client runs checks locally on each machine

• Results are published to RabbitMQ Cluster• Sensu Server in H/A Cluster

• Processes check results from RabbitMQ• Invokes appropriate handlers• Writes state to Redis

• Redis + Redis Sentinel• 2+ instances in each cluster• Read by the Sensu API

• Every layer is behind HAProxy

Mutually Assured Monitoring• Multiple independent Sensu clusters per data centre/environment

• 2+ RabbitMQ Servers

• 2+ Redis Servers

• 2+ Sensu Server/API Servers

Mutually Assured Monitoring• Multiple independent Sensu clusters per data centre/environment

• 2+ RabbitMQ Servers

• 2+ Redis Servers

• 2+ Sensu Server/API Servers

• Each cluster monitors each other

• /etc/sensu/conf.d/checks/$check_name.json

Machine Readable Config

• /etc/sensu/conf.d/checks/$check_name.json

• One check per file

Machine Readable Config

Machine Readable Config• /etc/sensu/conf.d/checks/$check_name.json

• One check per file

• Extensible with arbitrary metadata

Machine Readable Config• /etc/sensu/conf.d/checks/$check_name.json

• One check per file

• Extensible with arbitrary metadata

• Hash merge

Machine Readable Config• /etc/sensu/conf.d/checks/$check_name.json

• One check per file

• Extensible with arbitrary metadata

• Hash merge

• Never edit by hand!

Let Puppet Do The Work• Puppet is already working in the environment

Let Puppet Do The Work• Puppet is already working in the environment

• It knows everything about every node in the environment

Let Puppet Do The Work• Puppet is already working in the environment

• It knows everything about every node in the environment

• Puppet is human readable

monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -K 10 -p /‘, }

monitoring_check

monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -K 10 -p /‘, }

monitoring_check

monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -K 10 -p /‘, }

monitoring_check

monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -K 10 -p /‘, }

monitoring_check

monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -K 10 -p /‘, }

monitoring_check

sensu::check• monitoring_check wraps this

sensu::check• monitoring_check wraps this

• Writes a JSON file for each check

sensu::check• monitoring_check wraps this

• Writes a JSON file for each check

• Comment safe

{ "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }

{ "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }

{ "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }

{ "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }

{ "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }

• Same as Nagios checks• Simple text output• Posix Exit Codes

Check Scripts

Check Scripts• Same as Nagios checks

• Simple text output• Posix Exit Codes

• Result sent to Sensu Server, along with check definition• Includes all custom metadata• Custom handlers process the extra data

• base

Handlers

• base• JIRA

Handlers

Handlers• base• JIRA• email

Handlers• base• JIRA• email• irc

Handlers• base• JIRA• email• irc• pagerduty

How Do The Checks Get Executed?• Each machine runs the client

How Do The Checks Get Executed?• Each machine runs the client

• Client is managed entirely by Puppet

Situational Awareness

Single Source of Truth• DNS is canonical source for sensu servers

# Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com")

$array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array)

# If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)

Single Source of Truth• DNS is canonical source for sensu servers• Configure things in one place

# Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com")

$array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array)

# If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)

Single Source of Truth• DNS is canonical source for sensu servers• Configure things in one place

# Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com")

$array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array)

# If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)

Single Source of Truth• DNS is canonical source for sensu servers• Configure things in one place

# Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com")

$array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array)

# If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)

Single Source of Truth• DNS is canonical source for sensu servers• Configure things in one place

Automatic Monitoring• Cron Jobs - check if a job was completed successfully

if $staleness_threshold { $actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' ${command}"

cron::staleness_check { $reporting_name: threshold => $staleness_threshold, params => $staleness_check_params, user => $user, } } else { $actual_command = $command }

Automatic Monitoring• Cron Jobs - check if a job was completed successfully• cron::d

if $staleness_threshold { $actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' ${command}"

cron::staleness_check { $reporting_name: threshold => $staleness_threshold, params => $staleness_check_params, user => $user, } } else { $actual_command = $command }

Automatic Monitoring• Cron Jobs - check if a job was completed successfully• cron::d

if $staleness_threshold { $actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' ${command}"

cron::staleness_check { $reporting_name: threshold => $staleness_threshold, params => $staleness_check_params, user => $user, } } else { $actual_command = $command }

Automatic Monitoring• Cron Jobs - check if a job was completed successfully• cron::d

define cron::staleness_check( $threshold, $params, $user, ) {

$threshold_s = cron_human_time_to_seconds($threshold)

# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }

$check_title = "${name}_staleness"

$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }

$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)

file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }

define cron::staleness_check( $threshold, $params, $user, ) {

$threshold_s = cron_human_time_to_seconds($threshold)

# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }

$check_title = "${name}_staleness"

$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }

$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)

file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }

define cron::staleness_check( $threshold, $params, $user, ) {

$threshold_s = cron_human_time_to_seconds($threshold)

# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }

$check_title = "${name}_staleness"

$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }

$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)

file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }

define cron::staleness_check( $threshold, $params, $user, ) {

$threshold_s = cron_human_time_to_seconds($threshold)

# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }

$check_title = "${name}_staleness"

$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }

$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)

file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }

define cron::staleness_check( $threshold, $params, $user, ) {

$threshold_s = cron_human_time_to_seconds($threshold)

# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }

$check_title = "${name}_staleness"

$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }

$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)

file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }

define cron::staleness_check( $threshold, $params, $user, ) {

$threshold_s = cron_human_time_to_seconds($threshold)

# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }

$check_title = "${name}_staleness"

$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }

$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)

file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }

Automatic Remediation• Make the computer try something before paging

Automatic Remediation• Make the computer try something before paging• Try it repeatedly if necessarymonitoring_check { ‘check_syslogd’: page => true, check_every => ‘5m’, alert_after => ‘10m’, realert_every => 10, runbook => ‘http://wiki/syslogd', command => ‘/usr/lib/nagios/plugins/check_proc syslogd‘, remediation_action => ‘/etc/init.d/syslogd start’, remediation_retries => 1 }

Server Maintenance• Don’t alert on call if someone is working on a server

function is_sensu_cli_available { which sensu-cli >/dev/null return $? }

deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi }

case "$1" in start) echo "$0 does nothing on start."

stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac

rc0.d

function is_sensu_cli_available { which sensu-cli >/dev/null return $? }

deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi }

case "$1" in start) echo "$0 does nothing on start."

stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac

rc0.d

function is_sensu_cli_available { which sensu-cli >/dev/null return $? }

deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi }

case "$1" in start) echo "$0 does nothing on start."

stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac

rc0.d

function is_sensu_cli_available { which sensu-cli >/dev/null return $? }

deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi }

case "$1" in start) echo "$0 does nothing on start."

stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac

rc0.d

function is_sensu_cli_available { which sensu-cli >/dev/null return $? }

deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi }

case "$1" in start) echo "$0 does nothing on start."

stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac

rc0.d

function is_sensu_cli_available { which sensu-cli >/dev/null return $? }

function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi }

case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac

rc6.d

function is_sensu_cli_available { which sensu-cli >/dev/null return $? }

function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi }

case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac

rc6.d

function is_sensu_cli_available { which sensu-cli >/dev/null return $? }

function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi }

case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac

rc6.d

function is_sensu_cli_available { which sensu-cli >/dev/null return $? }

function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi }

case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac

rc6.d

function is_sensu_cli_available { which sensu-cli >/dev/null return $? }

function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi }

case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac

rc6.d

Cluster Checks• Assert some % of machines are healthy

• Use to reduce alert noise

Cluster Checks• Assert some % of machines are healthy

• Use to reduce alert noise

• If a cluster becomes unavailable, you want someone to be paged

Cluster Checks• Assert some % of machines are healthy

• Use to reduce alert noise

• If a cluster becomes unavailable, you want someone to be paged

• If one machine becomes unavailable, it’s not a problem - open a

JIRA ticket to get it fixed in core hours

@YelpEngineering

fb.com/YelpEngineers

engineeringblog.yelp.com

github.com/yelp