Upload
paul-oconnor
View
235
Download
1
Embed Size (px)
Citation preview
Short-Lived Servers• Servers in auto-scaling groups
• Short batch servers existing while the batch runs
Short-Lived Servers• Servers in auto-scaling groups
• Short batch servers existing while the batch runs
• Ensuring latest image build
Why Sensu?• Designed to be pluggable / extensible
• Arbitrary check metadata
• Simple model
• Components do exactly one thing
Why Sensu?• Designed to be pluggable / extensible
• Arbitrary check metadata
• Simple model
• Components do exactly one thing
• Ruby
• Not afraid to extend (or fork!)
https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/
“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”
Sensu Data Flow• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster
Sensu Data Flow• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster• Sensu Server in H/A Cluster
• Processes check results from RabbitMQ• Invokes appropriate handlers• Writes state to Redis
Sensu Data Flow• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster• Sensu Server in H/A Cluster
• Processes check results from RabbitMQ• Invokes appropriate handlers• Writes state to Redis
• Redis + Redis Sentinel• 2+ instances in each cluster• Read by the Sensu API
Sensu Data Flow• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster• Sensu Server in H/A Cluster
• Processes check results from RabbitMQ• Invokes appropriate handlers• Writes state to Redis
• Redis + Redis Sentinel• 2+ instances in each cluster• Read by the Sensu API
• Every layer is behind HAProxy
Mutually Assured Monitoring• Multiple independent Sensu clusters per data centre/environment
• 2+ RabbitMQ Servers
• 2+ Redis Servers
• 2+ Sensu Server/API Servers
Mutually Assured Monitoring• Multiple independent Sensu clusters per data centre/environment
• 2+ RabbitMQ Servers
• 2+ Redis Servers
• 2+ Sensu Server/API Servers
• Each cluster monitors each other
Machine Readable Config• /etc/sensu/conf.d/checks/$check_name.json
• One check per file
• Extensible with arbitrary metadata
Machine Readable Config• /etc/sensu/conf.d/checks/$check_name.json
• One check per file
• Extensible with arbitrary metadata
• Hash merge
Machine Readable Config• /etc/sensu/conf.d/checks/$check_name.json
• One check per file
• Extensible with arbitrary metadata
• Hash merge
• Never edit by hand!
Let Puppet Do The Work• Puppet is already working in the environment
• It knows everything about every node in the environment
Let Puppet Do The Work• Puppet is already working in the environment
• It knows everything about every node in the environment
• Puppet is human readable
monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -K 10 -p /‘, }
monitoring_check
monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -K 10 -p /‘, }
monitoring_check
monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -K 10 -p /‘, }
monitoring_check
monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -K 10 -p /‘, }
monitoring_check
monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -K 10 -p /‘, }
monitoring_check
{ "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }
{ "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }
{ "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }
{ "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }
{ "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }
Check Scripts• Same as Nagios checks
• Simple text output• Posix Exit Codes
• Result sent to Sensu Server, along with check definition• Includes all custom metadata• Custom handlers process the extra data
How Do The Checks Get Executed?• Each machine runs the client
• Client is managed entirely by Puppet
# Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth• DNS is canonical source for sensu servers• Configure things in one place
# Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth• DNS is canonical source for sensu servers• Configure things in one place
# Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth• DNS is canonical source for sensu servers• Configure things in one place
# Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth• DNS is canonical source for sensu servers• Configure things in one place
if $staleness_threshold { $actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' ${command}"
cron::staleness_check { $reporting_name: threshold => $staleness_threshold, params => $staleness_check_params, user => $user, } } else { $actual_command = $command }
Automatic Monitoring• Cron Jobs - check if a job was completed successfully• cron::d
if $staleness_threshold { $actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' ${command}"
cron::staleness_check { $reporting_name: threshold => $staleness_threshold, params => $staleness_check_params, user => $user, } } else { $actual_command = $command }
Automatic Monitoring• Cron Jobs - check if a job was completed successfully• cron::d
if $staleness_threshold { $actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' ${command}"
cron::staleness_check { $reporting_name: threshold => $staleness_threshold, params => $staleness_check_params, user => $user, } } else { $actual_command = $command }
Automatic Monitoring• Cron Jobs - check if a job was completed successfully• cron::d
define cron::staleness_check( $threshold, $params, $user, ) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }
$check_title = "${name}_staleness"
$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }
$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
define cron::staleness_check( $threshold, $params, $user, ) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }
$check_title = "${name}_staleness"
$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }
$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
define cron::staleness_check( $threshold, $params, $user, ) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }
$check_title = "${name}_staleness"
$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }
$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
define cron::staleness_check( $threshold, $params, $user, ) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }
$check_title = "${name}_staleness"
$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }
$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
define cron::staleness_check( $threshold, $params, $user, ) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }
$check_title = "${name}_staleness"
$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }
$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
define cron::staleness_check( $threshold, $params, $user, ) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 }
$check_title = "${name}_staleness"
$overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', }
$check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
Automatic Remediation• Make the computer try something before paging• Try it repeatedly if necessarymonitoring_check { ‘check_syslogd’: page => true, check_every => ‘5m’, alert_after => ‘10m’, realert_every => 10, runbook => ‘http://wiki/syslogd', command => ‘/usr/lib/nagios/plugins/check_proc syslogd‘, remediation_action => ‘/etc/init.d/syslogd start’, remediation_retries => 1 }
function is_sensu_cli_available { which sensu-cli >/dev/null return $? }
deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi }
case "$1" in start) echo "$0 does nothing on start."
stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac
rc0.d
function is_sensu_cli_available { which sensu-cli >/dev/null return $? }
deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi }
case "$1" in start) echo "$0 does nothing on start."
stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac
rc0.d
function is_sensu_cli_available { which sensu-cli >/dev/null return $? }
deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi }
case "$1" in start) echo "$0 does nothing on start."
stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac
rc0.d
function is_sensu_cli_available { which sensu-cli >/dev/null return $? }
deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi }
case "$1" in start) echo "$0 does nothing on start."
stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac
rc0.d
function is_sensu_cli_available { which sensu-cli >/dev/null return $? }
deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi }
case "$1" in start) echo "$0 does nothing on start."
stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac
rc0.d
function is_sensu_cli_available { which sensu-cli >/dev/null return $? }
function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi }
case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac
rc6.d
function is_sensu_cli_available { which sensu-cli >/dev/null return $? }
function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi }
case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac
rc6.d
function is_sensu_cli_available { which sensu-cli >/dev/null return $? }
function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi }
case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac
rc6.d
function is_sensu_cli_available { which sensu-cli >/dev/null return $? }
function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi }
case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac
rc6.d
function is_sensu_cli_available { which sensu-cli >/dev/null return $? }
function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi }
case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac
rc6.d
Cluster Checks• Assert some % of machines are healthy
• Use to reduce alert noise
• If a cluster becomes unavailable, you want someone to be paged
Cluster Checks• Assert some % of machines are healthy
• Use to reduce alert noise
• If a cluster becomes unavailable, you want someone to be paged
• If one machine becomes unavailable, it’s not a problem - open a
JIRA ticket to get it fixed in core hours