Upload
itai-frenkel
View
875
Download
5
Embed Size (px)
Citation preview
Monitoring Patterns for Mitigating Technical Risk
@Forter
#1 riskSlow or bad (500) API responses
Auto-healingbecause humans are slowSLA, Failover, Degradation, Throttling
AlertingDetect, Filter, Alert, Diagnostics
SLAPerformance Data Loss Business Logic
TX Processing Low Latency Nope Best Effort
Stream Processing High Throughput Best Effort Best Effort
Batch Processing High Volume Nope Reconciliation
Automatic Failoverhttp fencing (Incapsula)http load balancing (ELB)instance restart (Scaling Group)process restart (upstart)
exceptions bubble up and crash
Graceful Degradation
nginx (lua)
expressjs (nodejs)
storm (java)
Stability
CodeChanges
Throttling (without back-pressure)request priority reduced when TX/sec > thresh
Different priority → Different queue →
Different worker
lower priority inside queue for test probes
Detect -> Filter -> Alert -> Manual Diagnostics
Alerting
Detection
filter & route
alert
diagnostics
redundancyCloudWatch/CollectD - fast, no root causeApp events (exceptions) - too noisy, root causePingdom probes - low coverage, reliableInternal probes - better coverage, false alarms
cloudwatchpagerduty alert
(no riemann)
system testpagerduty alert
(riemann needed)
filter tests using a state machine
filter tests using a state machine(tagged "apisRegression"
(pagerduty-test-dispatch "1234567892ed295d91"))
(defn pagerduty-test-dispatch[key](let [pd (pagerduty key)]
(changed-state {:init "passed"}(where (state "passed") (:resolve pd)) (where (state "failed") (:trigger pd)))))
re-open manually resolved alert
re-open manually resolved alert(tagged "apisRegression" (pagerduty-test-dispatch "1234567892ed295d91"))
(defn pagerduty-test-dispatch[key](let [pd (pagerduty key)] (sdo (changed-state {:init "passed"}
(where (state "passed") (:resolve pd)))
(where (state "failed")(by [:host :service] (throttle 1 60 (:trigger pd)))))))
Diagnostics - Storm topology timing
Diagnostics - Storm timelines
#2 riskSlowing down merchant's website
AlertingMonitor each and every browserAggregations (per browser type)Notify on thresholds
Monitoring our javascript snippet
TimeoutsExceptions by browserException aggregationMonitoring new versions
Riemann's Index (server monitoring)key (host+service) event TTL
10.0.0.1-redisfree { "metric":"5"} 60
10.0.0.1-probe1 {"state":"failed"} 300
10.0.0.2-probe1 { "state":"passed"} 300
Riemann's Indexkey (host+service) event TTL
199.25.1.1-1234 {"state":"loaded"} 300
199.25.2.1-4567 {"state":"downloaded"} 300
199.25.3.1-8901 {"state":"loaded"} 300
For our use case:host=browser-ip, service=cookie
Riemann's state machine(index)
stores last event and creates expired events (TTL)
(by [:host :service] stream) creates a new stream for each host/service
(by-host-service stream) - forter's fork onlyalso closes stream when TTL expires
(defn calc-load-time [& children]
(by-host-service (changed :state {:pairs? true} (smap (fn [[previous current]] (cond
(and (= (:state previous) "downloaded") (= (:state current) "loaded")) (assoc previous :metric (- (:time current) (:time previous)))
(and (= (:state previous) "downloaded") (= (:state current) "expired")) (assoc previous :metric (* JS_TIMEOUT 1000))))
children))))
(defn aggregate-by-browser [& children]
(by [:browser] (fixed-time-window 60 (sdo
(smap folds/median (tag "median-load-time" children))
(smap folds/count (tag "load-count" children)))))))
#3 riskWrong decision (approve/decline)
AlertingAnomaly detection
MotivationControl false alarms mathematicallyThreshold per customerThreshold seasonality
Alert me ifthe probability that we declinemore than k out of n transactions given probability pis 1 in a million (t=0.0001%)
n number of tx (30 minutes)k number of declined txs (30 minutes)p per customer declined/total (24 hours)t alert threshold
Binomial Distribution AssumptionExternal events are uncorrelated
What happens when a customer retries the same Tx because the first one was declined?
Questions?email [email protected]
http://tech.forter.comhttp://www.softwarearchitectureaddict.com