© 2015 ROBERT BIGOS
Capacity rolling disaster challenge
Robert Bigos [email protected] +48 665-‐168-‐240
h<p://www.slideshare.net/RobertBigos
h<ps://pl.linkedin.com/in/robertbigos
@bigosr
© 2015 ROBERT BIGOS2
Agenda
Why is it so important ?
Let’s talk about: • typical monitoring dashboards • sta<s<cs and visualiza<ons basics • queueing basis • mathema<cian vs physicist • complexity • measurements, <me and bigdata • paBern visualiza<ons • lessons learned
© 2015 ROBERT BIGOS11
Why is capacity so important ?
4
Source: presenter studies for top enterprises in Poland.
Source: http://s134.photobucket.com/user/charlesfrith/media/disaster.gif.html
© 2015 ROBERT BIGOS11
Capacity vs performance
5
Capacity = fuel Performance = speed and al:tude
Capacity and performance management how quickly and safely you can achieve
planned des:na:ons
© 2015 ROBERT BIGOS11
Why is capacity so important ?
6
Source: presenter studies for top enterprises in Poland.
Source: http://www.skybrary.aero/index.php/James_Reason_HF_Model
© 2015 ROBERT BIGOS11
Root cause analysis ?
7
Source: presenter studies for top enterprises in Poland.
Source: Józef Tischner "The Highlander's History of Philosophy"
"the truth, the whole truth
and the bullshit truth!
© 2015 ROBERT BIGOS11
Why capacity is so important ?
8
Source: presenter studies for top enterprises in Poland.
is there a capacity for growth ?
is there a capacity for change ?
is there a capacity for backup ?
is there a capacity to restore ?
is it calculated ?
is it tested ?
what about quality expecta:ons ?
what about financial aspects ?
<me !
DR plan
© 2015 ROBERT BIGOS2
There is no magic buBon
Source: http://make-‐everything-‐ok.com/
© 2015 ROBERT BIGOS2
CritSit
… it is a long story… DR procedures automa;ons
people training tes;ng
communica;ons leadership
…
© 2015 ROBERT BIGOS32
Typical “<me-‐graph-‐centric” performance dashboard
Source: Demo site dashboard grafana.org
© 2015 ROBERT BIGOS32
Peeping through the keyhole
Computer vs human scale 5 mins = 5*60/10^-‐9/(60*60*24*365) = 9512 years
few objects, few variables, no dependency, no rela<ons…
© 2015 ROBERT BIGOS 23
Quiz
-‐ Sir , we cannot calculate our of observa:ons because we have to divide X by 0.
(try find physicist answer)
-‐ Change 0 -‐> 0.0001 and show me a picture.
.. fail fast and try to understand big picture …
© 2015 ROBERT BIGOS
Quiz
▪ What if : Your sequeneal program spends 75% eme on server and 25% eme on storage. We replace storage with 5 emes faster. Please, calculate percentage improvement in speed.
© 2015 ROBERT BIGOS
Queues and buffers basis▪ Response eme depends on service eme and queueing eme.
• Queue length depends on arrival rate and service :me • U:liza:on shows how busy the server is: work_:me/measured_:me or arrival rate * service :me • When u:liza:on reaches satura:on = 100% ,response :me going to infinity in some cases……
Source: hbp://perfdynamics.blogspot.com/2010/03/bandwidth-‐vs-‐latency-‐world-‐is-‐curved.html Graph provided by Neil Gunther. Thanks !
To see more:
hbp://en.wikipedia.org/wiki/Amdahl's_law hbp://en.wikipedia.org/wiki/Lible's_law
hbp://www.perfdynamics.com/Manifesto/gcaprules.html ,
© 2015 ROBERT BIGOS
TOTAL_PORT_IO_RATE,READ_TRANSFER_SIZE,TOTAL_PORT_TO_LOCAL_NODE_IO_RATE,PORT_TO_DISK_RECEIVE_DATA_RATE
TOTAL_PORT_DATA_RATE,WRITE_TRANSFER_SIZE,PORT_TO_REMOTE_NODE_SEND_IO_RATE,TOTAL_PORT_TO_DISK_DATA_RATE
TOTAL_PORT_TRANSFER_SIZE,TOTAL_TRANSFER_SIZE,PORT_TO_REMOTE_NODE_RECEIVE_IO_RATE,PORT_TO_LOCAL_NODE_SEND_DATA_RATE
PORT_SPEED,RECORD_MODE_READ_IO_RATE,TOTAL_PORT_TO_REMOTE_NODE_IO_RATE,PORT_TO_LOCAL_NODE_RECEIVE_DATA_RATE
READ_IO_RATE_OVERALL,RECORD_MODE_READ_CACHE_HIT_PERC,PORT_TO_HOST_SEND_DATA_RATE,TOTAL_PORT_TO_LOCAL_NODE_DATA_RATE
WRITE_IO_RATE_OVERALL,DISK_TO_CACHE_TRANSFER_RATE,PORT_TO_HOST_RECEIVE_DATA_RATE,PORT_TO_REMOTE_NODE_SEND_DATA_RATE
TOTAL_IO_RATE_OVERALL,CACHE_TO_DISK_TRANSFER_RATE,TOTAL_PORT_TO_HOST_DATA_RATE,PORT_TO_REMOTE_NODE_RECEIVE_DATA_RATE
READ_CACHE_HIT_PERC_OVERALL,WRITE_CACHE_DELAY_PERCENTAGE,PORT_TO_DISK_SEND_DATA_RATE,TOTAL_PORT_TO_REMOTE_NODE_DATA_RATE
WRITE_CACHE_HIT_PERC_OVERALL,WRITE_CACHE_DELAY_IO_RATE,PORT_TO_DISK_RECEIVE_DATA_RATE,PORT_TO_LOCAL_NODE_SEND_RESPONSE_TIME
TOTAL_CACHE_HIT_PERC_OVERALL,BACKEND_READ_IO_RATE,TOTAL_PORT_TO_DISK_DATA_RATE,OVERALL_PORT_TO_LOCAL_NODE_RESPONSE_TIME
READ_DATA_RATE,BACKEND_WRITE_IO_RATE,PORT_TO_LOCAL_NODE_SEND_DATA_RATE,PORT_TO_LOCAL_NODE_SEND_QUEUE_TIME
WRITE_DATA_RATE,TOTAL_BACKEND_IO_RATE,PORT_TO_LOCAL_NODE_RECEIVE_DATA_RATE,PORT_TO_LOCAL_NODE_RECEIVE_QUEUE_TIME
TOTAL_DATA_RATE,BACKEND_READ_DATA_RATE,TOTAL_PORT_TO_LOCAL_NODE_DATA_RATE,OVERALL_PORT_TO_LOCAL_NODE_QUEUE_TIME
READ_TRANSFER_SIZE,BACKEND_WRITE_DATA_RATE,PORT_TO_REMOTE_NODE_SEND_DATA_RATE,PORT_TO_REMOTE_NODE_SEND_RESPONSE_TIME
WRITE_TRANSFER_SIZE,TOTAL_BACKEND_DATA_RATE,PORT_TO_REMOTE_NODE_RECEIVE_DATA_RATE,OVERALL_PORT_TO_REMOTE_NODE_RESPONSE_TIME
TOTAL_TRANSFER_SIZE,BACKEND_READ_RESPONSE_TIME,TOTAL_PORT_TO_REMOTE_NODE_DATA_RATE,PORT_TO_REMOTE_NODE_SEND_QUEUE_TIME
READ_IO_RATE_OVERALL,BACKEND_WRITE_RESPONSE_TIME,OVERALL_PORT_BANDWIDTH_PERCENTAGE,PORT_TO_REMOTE_NODE_RECEIVE_QUEUE_TIME
WRITE_IO_RATE_OVERALL,OVERALL_BACKEND_RESPONSE_TIME,LOSS_OF_SYNC_RATE,OVERALL_PORT_TO_REMOTE_NODE_QUEUE_TIME
TOTAL_IO_RATE_OVERALL,BACKEND_READ_TRANSFER_SIZE,INVALID_TRANSMISSION_WORD_RATE,PEAK_READ_RESPONSE_TIME
READ_CACHE_HIT_PERC_OVERALL,BACKEND_WRITE_TRANSFER_SIZE,PORT_SEND_BANDWIDTH_PERCENTAGE,PEAK_WRITE_RESPONSE_TIME
WRITE_CACHE_HIT_PERC_OVERALL,OVERALL_BACKEND_TRANSFER_SIZE,PORT_RECEIVE_BANDWIDTH_PERCENTAGE,LOSS_OF_SYNC_RATE
TOTAL_CACHE_HIT_PERC_OVERALL,PORT_SEND_IO_RATE,BUFFER_TO_BUFFER_PERCENTAGE,INVALID_TRANSMISSION_WORD_RATE
READ_DATA_RATE,PORT_RECEIVE_IO_RATE,PORT_SPEED,OVERALL_HOST_ATTRIBUTED_RESPONSE_TIME_PERCENTAGE
WRITE_DATA_RATE,TOTAL_PORT_IO_RATE,READ_IO_RATE_OVERALL,PEAK_BACKEND_READ_RESPONSE_TIME
TOTAL_DATA_RATE,PORT_SEND_DATA_RATE,WRITE_IO_RATE_OVERALL,PEAK_BACKEND_WRITE_RESPONSE_TIME
READ_TRANSFER_SIZE,PORT_WRITE_DATA_RATE,TOTAL_IO_RATE_OVERALL,PEAK_BACKEND_READ_QUEUE_TIME
WRITE_TRANSFER_SIZE,TOTAL_PORT_DATA_RATE,READ_CACHE_HIT_PERC_OVERALL,PEAK_BACKEND_WRITE_QUEUE_TIME
TOTAL_TRANSFER_SIZE,PORT_SEND_RESPONSE_TIME,WRITE_CACHE_HIT_PERC_OVERALL,READ_IO_RATE_OVERALL
REAL_SPACE,PORT_RECEIVE_RESPONSE_TIME,TOTAL_CACHE_HIT_PERC_OVERALL,WRITE_IO_RATE_OVERALL
ND_WRITE_TRANSFER_SIZE,REAL_SPACE
PORT_SPEED,WRITE_RESPONSE_TIME,OVERALL_BACKEND_TRANSFER_SIZE,IO_DENSITY
READ_IO_RATE_NORMAL,TOTAL_RESPONSE_TIME,PORT_SEND_IO_RATE,PORT_SEND_PACKET_RATE
READ_IO_RATE_SEQUENTIAL,READ_TRANSFER_SIZE,PORT_RECEIVE_IO_RATE,PORT_WRITE_PACKET_RATE
READ_IO_RATE_OVERALL,WRITE_TRANSFER_SIZE,TOTAL_PORT_IO_RATE,TOTAL_PORT_PACKET_RATE
WRITE_IO_RATE_NORMAL,TOTAL_TRANSFER_SIZE,PORT_SEND_DATA_RATE,PORT_SEND_DATA_RATE
WRITE_IO_RATE_SEQUENTIAL,RECORD_MODE_READ_IO_RATE,PORT_WRITE_DATA_RATE,PORT_WRITE_DATA_RATE
WRITE_IO_RATE_OVERALL,RECORD_MODE_READ_CACHE_HIT_PERC,TOTAL_PORT_DATA_RATE,TOTAL_PORT_DATA_RATE
TOTAL_IO_RATE_NORMAL,DISK_TO_CACHE_TRANSFER_RATE,READAHEAD_PERCENTAGE_OF_CACHE_HITS,PORT_PEAK_SEND_DATA_RATE
TOTAL_IO_RATE_SEQUENTIAL,CACHE_TO_DISK_TRANSFER_RATE,DIRTY_WRITE_PERCENTAGE_OF_CACHE_HITS,PORT_PEAK_RECEIVE_DATA_RATE
TOTAL_IO_RATE_OVERALL,WRITE_CACHE_DELAY_PERCENTAGE,WRITE_CACHE_FLUSH_THROUGH_PERCENTAGE,PORT_SEND_PACKET_SIZE
READ_CACHE_HIT_PERC_NORMAL,WRITE_CACHE_DELAY_IO_RATE,WRITE_CACHE_FLUSH_THROUGH_IO_RATE,PORT_RECEIVE_PACKET_SIZE
READ_CACHE_HIT_PERC_SEQUENTIAL,REAL_SPACE,PORT_TO_HOST_RECEIVE_IO_RATE,OVERALL_PORT_PACKET_SIZE
READ_CACHE_HIT_PERC_OVERALL,IO_DENSITY,TOTAL_PORT_TO_HOST_IO_RATE,LOSS_OF_SYNC_RATE
Queues: storage example▪ Total IO * Response Time = Queue length (populaeon) ▪ Total IO * Service Time = Utylizaeon ▪ Queue length/(1+ Queue length) = uelizaeon ▪ Service Time -‐ never used ▪ Each IO has own characterisec: r/w, size ,cache, seq/rand ~30 (some
of them represents subqueue
▪ Many of „monitored" variables are calculated !
© 2015 ROBERT BIGOS14
Queues and buffers basis
Source: Kanal von FerdinandLutz "Stay in queue" youtube.com
© 2015 ROBERT BIGOS2
„Home made supercomputer”IT infrastructure in the enterprise is like a “home made supercomputer” : very complicated and interconnected. Designed by business department, acquired by procurement department , implemented and managed by IT department, used by unpredicted users…
Quality always depends on design and implementaeon
Automaeon requires standardizaeon!
© 2015 ROBERT BIGOS14
Complexity of hardware:
Source: Anvaka github user site. 100k pakages and 200k connections
>400 mln, just in IBM database
Source: IBM interoperability site
© 2015 ROBERT BIGOS14
Complexity of sodware … npm
Source: Anvaka github user site. 100k pakages and 200k connections
© 2015 ROBERT BIGOS2
Know unknowns and unknown unknowns
“…Reports that say that something hasn't happened are always interes:ng to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -‐-‐ the ones we don't know we don't know…”
Donald Rumsfeld, February 12th, 2004 DOD News Briefing
Source: http://www.defense.gov/transcripts/transcript.aspx?transcriptid=2636
© 2015 ROBERT BIGOS32
Small BigDataVOLUME VARIETY
VELOCITY
Source: Andrew Clay Shafer -‐ Devops, microservices and platforms, oh my
Small bank has about 48 devices (storage,san) with 2011 monitored components. Over 6 days and 5 minutes sample colleceng 3,37 mln records:
Time, Device_id, Group_ID, Component_id (FACTORs) 34 Variables (average)
about 1,6 GB in RDBMS
That's just the enterprise storage and FC network !
You have to add layers: LAN, WAN, servers, VM, DB, App
© 2015 ROBERT BIGOS2
Big picture understanding
Source: DDoS attack by China & Ukraine
© 2015 ROBERT BIGOS2
Big picture understanding
Source: An Actual 160 Gbps DDoS Attack Being Mitigated by Prolexic s Global
© 2015 ROBERT BIGOS2
Big picture understanding
Source: Sense of Patterns -‐ Animations from Mahir M. Yavuz
© 2015 ROBERT BIGOS11
Example: datacenter
44
Source: Christina Delimitrou Presented April 3rd, 2014 at @TwitterOSS #conf
© 2015 ROBERT BIGOS11
Reserved vs. used 2009
45
Source: Christina Delimitrou Presented April 3rd, 2014 at @TwitterOSS #conf
© 2015 ROBERT BIGOS11
Example: leaders datacenter
46
Source: Christina Delimitrou Presented April 3rd, 2014 at @TwitterOSS #conf
2009
CFO dreamreality issue
© 2015 ROBERT BIGOS20
U<liza<on heatmap
GREEN -‐ not used -‐ losing money RED -‐ business and customer wai:ng YELLOW -‐ perfect balance
About 700 volumens on enterprise Tier 1 and 2, 500 TB
Y = 24 h
X = 7 days observa;ons
Utyliza;on
© 2015 ROBERT BIGOS
Read response <me heatmapY = 24 h
X = 7 days observa;ons
GREEN -‐ ok ? RED -‐?
About 700 volumens on enterprise Tier 1 and 2, 500 TB
© 2015 ROBERT BIGOS20
Peak read response <me heatmap>100 ms
Y = 24 h
X = 7 days observa;ons
GREEN -‐ ok RED -‐ slow or very slow GREY -‐ poteneal „emeouts”
About 700 volumens on enterprise Tier 1 and 2, 500 TB
© 2015 ROBERT BIGOS
Peak read response <me heatmapY = 24 h
X = 7 days observa;ons
GREEN -‐ ok ? RED -‐ slow or very slow GREY -‐ poteneal „emeouts”
About 700 volumens on enterprise Tier 1 and 2, 500 TB
© 2015 ROBERT BIGOS
Peak write response <me heatmap>100 ms
Y = 24 h
X = 7 days observa;ons
GREEN -‐ ok RED -‐ slow or very slow GREY -‐ poteneal „emeouts”
About 700 volumens on enterprise Tier 1 and 2, 500 TB
© 2015 ROBERT BIGOS2
Lessons learned ?• data comes from the devil
• date/eme/log format -‐ wow! • models come from God (in the past) • vizualizaeons come from God (it is a future) • it is all about scale (computer vs human) • understand big picture, dive deep if have to… • try keep standards • break the rules (physicist vs mathemaecian) • good enough is be<er then perfect • understanding is more important then precision • pa<erns/moeon/colors -‐ common human understanding,
• There is No Single Version of the Truth …
© 2015 ROBERT BIGOS2
Lessons learned ?• back-‐up plan should be important part of the plan • underutylizaeon can be a planed goal • rolling disaster stareng mostly in none-‐criecal env and propagate to all connected env (like cholesterol)
• root case analysis is challanging • logs shows „direceons” or one version of true • no backup no fun !
• There is No Single Version of the Truth …
© 2015 ROBERT BIGOS
Robert Bigos [email protected] +48 665-‐168-‐240
h<p://www.slideshare.net/RobertBigos
h<ps://pl.linkedin.com/in/robertbigos
@bigosr
If you need more…
Recommended