Monitoring patterns for mitigating technical risk

Monitoring Patterns for Mitigating Technical Risk

@Forter

#1 riskSlow or bad (500) API responses

Auto-healingbecause humans are slowSLA, Failover, Degradation, Throttling

AlertingDetect, Filter, Alert, Diagnostics

SLAPerformance Data Loss Business Logic

TX Processing Low Latency Nope Best Effort

Stream Processing High Throughput Best Effort Best Effort

Batch Processing High Volume Nope Reconciliation

Automatic Failoverhttp fencing (Incapsula)http load balancing (ELB)instance restart (Scaling Group)process restart (upstart)

exceptions bubble up and crash

Graceful Degradation

nginx (lua)

expressjs (nodejs)

storm (java)

Stability

CodeChanges

Throttling (without back-pressure)request priority reduced when TX/sec > thresh

Different priority → Different queue →

Different worker

lower priority inside queue for test probes

Detect -> Filter -> Alert -> Manual Diagnostics

Alerting

Detection

filter & route

alert

diagnostics

redundancyCloudWatch/CollectD - fast, no root causeApp events (exceptions) - too noisy, root causePingdom probes - low coverage, reliableInternal probes - better coverage, false alarms

cloudwatchpagerduty alert

(no riemann)

system testpagerduty alert

(riemann needed)

filter tests using a state machine

filter tests using a state machine(tagged "apisRegression"

(pagerduty-test-dispatch "1234567892ed295d91"))

(defn pagerduty-test-dispatch[key](let [pd (pagerduty key)]

(changed-state {:init "passed"}(where (state "passed") (:resolve pd)) (where (state "failed") (:trigger pd)))))

re-open manually resolved alert

re-open manually resolved alert(tagged "apisRegression" (pagerduty-test-dispatch "1234567892ed295d91"))

(defn pagerduty-test-dispatch[key](let [pd (pagerduty key)] (sdo (changed-state {:init "passed"}

(where (state "passed") (:resolve pd)))

(where (state "failed")(by [:host :service] (throttle 1 60 (:trigger pd)))))))

Diagnostics - Storm topology timing

Diagnostics - Storm timelines

#2 riskSlowing down merchant's website

AlertingMonitor each and every browserAggregations (per browser type)Notify on thresholds

Monitoring our javascript snippet

TimeoutsExceptions by browserException aggregationMonitoring new versions

Riemann's Index (server monitoring)key (host+service) event TTL

10.0.0.1-redisfree { "metric":"5"} 60

10.0.0.1-probe1 {"state":"failed"} 300

10.0.0.2-probe1 { "state":"passed"} 300

Riemann's Indexkey (host+service) event TTL

199.25.1.1-1234 {"state":"loaded"} 300

199.25.2.1-4567 {"state":"downloaded"} 300

199.25.3.1-8901 {"state":"loaded"} 300

For our use case:host=browser-ip, service=cookie

Riemann's state machine(index)

stores last event and creates expired events (TTL)

(by [:host :service] stream) creates a new stream for each host/service

(by-host-service stream) - forter's fork onlyalso closes stream when TTL expires

(defn calc-load-time [& children]

(by-host-service (changed :state {:pairs? true} (smap (fn [[previous current]] (cond

(and (= (:state previous) "downloaded") (= (:state current) "loaded")) (assoc previous :metric (- (:time current) (:time previous)))

(and (= (:state previous) "downloaded") (= (:state current) "expired")) (assoc previous :metric (* JS_TIMEOUT 1000))))

children))))

(defn aggregate-by-browser [& children]

(by [:browser] (fixed-time-window 60 (sdo

(smap folds/median (tag "median-load-time" children))

(smap folds/count (tag "load-count" children)))))))

#3 riskWrong decision (approve/decline)

AlertingAnomaly detection

MotivationControl false alarms mathematicallyThreshold per customerThreshold seasonality

Alert me ifthe probability that we declinemore than k out of n transactions given probability pis 1 in a million (t=0.0001%)

n number of tx (30 minutes)k number of declined txs (30 minutes)p per customer declined/total (24 hours)t alert threshold

Binomial Distribution AssumptionExternal events are uncorrelated

What happens when a customer retries the same Tx because the first one was declined?

Questions?email [email protected]

http://tech.forter.comhttp://www.softwarearchitectureaddict.com

mailto:[email protected]

http://tech.forter.com/

http://www.softwarearchitectureaddict.com/

Software

Monitoring patterns for mitigating technical risk