35
Monitoring Patterns for Mitigating Technical Risk @Forter

Monitoring patterns for mitigating technical risk

Embed Size (px)

Citation preview

Page 1: Monitoring patterns for  mitigating technical risk

Monitoring Patterns for Mitigating Technical Risk

@Forter

Page 2: Monitoring patterns for  mitigating technical risk

#1 riskSlow or bad (500) API responses

Auto-healingbecause humans are slowSLA, Failover, Degradation, Throttling

AlertingDetect, Filter, Alert, Diagnostics

Page 3: Monitoring patterns for  mitigating technical risk

SLAPerformance Data Loss Business Logic

TX Processing Low Latency Nope Best Effort

Stream Processing High Throughput Best Effort Best Effort

Batch Processing High Volume Nope Reconciliation

Page 4: Monitoring patterns for  mitigating technical risk

Automatic Failoverhttp fencing (Incapsula)http load balancing (ELB)instance restart (Scaling Group)process restart (upstart)

exceptions bubble up and crash

Page 5: Monitoring patterns for  mitigating technical risk

Graceful Degradation

nginx (lua)

expressjs (nodejs)

storm (java)

Stability

CodeChanges

Page 6: Monitoring patterns for  mitigating technical risk

Throttling (without back-pressure)request priority reduced when TX/sec > thresh

Different priority → Different queue →

Different worker

lower priority inside queue for test probes

Page 7: Monitoring patterns for  mitigating technical risk

Detect -> Filter -> Alert -> Manual Diagnostics

Alerting

Page 8: Monitoring patterns for  mitigating technical risk

Detection

Page 9: Monitoring patterns for  mitigating technical risk

filter & route

Page 10: Monitoring patterns for  mitigating technical risk

alert

Page 11: Monitoring patterns for  mitigating technical risk

diagnostics

Page 12: Monitoring patterns for  mitigating technical risk

redundancyCloudWatch/CollectD - fast, no root causeApp events (exceptions) - too noisy, root causePingdom probes - low coverage, reliableInternal probes - better coverage, false alarms

Page 13: Monitoring patterns for  mitigating technical risk

cloudwatchpagerduty alert

(no riemann)

Page 14: Monitoring patterns for  mitigating technical risk

system testpagerduty alert

(riemann needed)

Page 15: Monitoring patterns for  mitigating technical risk

filter tests using a state machine

Page 16: Monitoring patterns for  mitigating technical risk

filter tests using a state machine(tagged "apisRegression"

(pagerduty-test-dispatch "1234567892ed295d91"))

(defn pagerduty-test-dispatch[key](let [pd (pagerduty key)]

(changed-state {:init "passed"}(where (state "passed") (:resolve pd)) (where (state "failed") (:trigger pd)))))

Page 17: Monitoring patterns for  mitigating technical risk

re-open manually resolved alert

Page 18: Monitoring patterns for  mitigating technical risk

re-open manually resolved alert(tagged "apisRegression" (pagerduty-test-dispatch "1234567892ed295d91"))

(defn pagerduty-test-dispatch[key](let [pd (pagerduty key)] (sdo (changed-state {:init "passed"}

(where (state "passed") (:resolve pd)))

(where (state "failed")(by [:host :service] (throttle 1 60 (:trigger pd)))))))

Page 19: Monitoring patterns for  mitigating technical risk

Diagnostics - Storm topology timing

Page 20: Monitoring patterns for  mitigating technical risk

Diagnostics - Storm timelines

Page 21: Monitoring patterns for  mitigating technical risk
Page 22: Monitoring patterns for  mitigating technical risk

#2 riskSlowing down merchant's website

AlertingMonitor each and every browserAggregations (per browser type)Notify on thresholds

Page 23: Monitoring patterns for  mitigating technical risk

Monitoring our javascript snippet

TimeoutsExceptions by browserException aggregationMonitoring new versions

Page 24: Monitoring patterns for  mitigating technical risk

Riemann's Index (server monitoring)key (host+service) event TTL

10.0.0.1-redisfree { "metric":"5"} 60

10.0.0.1-probe1 {"state":"failed"} 300

10.0.0.2-probe1 { "state":"passed"} 300

Page 25: Monitoring patterns for  mitigating technical risk

Riemann's Indexkey (host+service) event TTL

199.25.1.1-1234 {"state":"loaded"} 300

199.25.2.1-4567 {"state":"downloaded"} 300

199.25.3.1-8901 {"state":"loaded"} 300

For our use case:host=browser-ip, service=cookie

Page 26: Monitoring patterns for  mitigating technical risk

Riemann's state machine(index)

stores last event and creates expired events (TTL)

(by [:host :service] stream) creates a new stream for each host/service

(by-host-service stream) - forter's fork onlyalso closes stream when TTL expires

Page 27: Monitoring patterns for  mitigating technical risk

(defn calc-load-time [& children]

(by-host-service (changed :state {:pairs? true} (smap (fn [[previous current]] (cond

(and (= (:state previous) "downloaded") (= (:state current) "loaded")) (assoc previous :metric (- (:time current) (:time previous)))

(and (= (:state previous) "downloaded") (= (:state current) "expired")) (assoc previous :metric (* JS_TIMEOUT 1000))))

children))))

Page 28: Monitoring patterns for  mitigating technical risk

(defn aggregate-by-browser [& children]

(by [:browser] (fixed-time-window 60 (sdo

(smap folds/median (tag "median-load-time" children))

(smap folds/count (tag "load-count" children)))))))

Page 29: Monitoring patterns for  mitigating technical risk

#3 riskWrong decision (approve/decline)

AlertingAnomaly detection

Page 30: Monitoring patterns for  mitigating technical risk

MotivationControl false alarms mathematicallyThreshold per customerThreshold seasonality

Page 31: Monitoring patterns for  mitigating technical risk
Page 32: Monitoring patterns for  mitigating technical risk

Alert me ifthe probability that we declinemore than k out of n transactions given probability pis 1 in a million (t=0.0001%)

n number of tx (30 minutes)k number of declined txs (30 minutes)p per customer declined/total (24 hours)t alert threshold

Page 33: Monitoring patterns for  mitigating technical risk
Page 34: Monitoring patterns for  mitigating technical risk

Binomial Distribution AssumptionExternal events are uncorrelated

What happens when a customer retries the same Tx because the first one was declined?

Page 35: Monitoring patterns for  mitigating technical risk

Questions?email [email protected]

http://tech.forter.comhttp://www.softwarearchitectureaddict.com