High Availability and Scalability: Too Expensive! Architectures for Future Enterprise Systems

Eberhard Wolff - @ewolff

High Availability and Scalability: Too Expensive!–

Architectures for Future Enterprise Systems

Eberhard Wolff Freelance Consultant / Trainer

Head Technolocy Advisory Board adesso AG


The Dream

Foto: http://www.vaxman.de/





Where Are We?


Non-functional Requirements


Availability

Performance


Performance

Availability


Availability: Traditional Approach


•  Buy highly reliable hardware

•  Built a small cluster •  2 machines

•  Maybe add a stand-by data center


•  Eventually system will fail

•  …and you are in real trouble


True Story •  “Machine rebooted over night.” •  “Several times.” •  “No idea how often.” •  “No idea why…”


Let’s look at an example



•  Server fails •  Application fails •  No service to the customer

•  Can we do better?



What You Have Just Seen


•  Failing systems do not impact user •  Failing systems are just restarted •  Restarts happen automatically

•  System run in different data centers •  i.e. eu-west-1a / b / c


Elastic Load

Balancer

System EU West 1a

System EU West 1b

System EU West 1c


What It Takes… •  Virtualization •  +API to start new servers

•  Watchdog to detect failed servers •  Redundant data centers if needed


Can be implemented in your datacenter!

I have none.

So I used the Amazon Cloud


Alternatives


Hardware •  As cheap as it gets

•  Not highly available

•  Availability in Software


Traditional Servers


Traditional Servers


Highly customized

Hard to reproduce


•  Depends on details •  True story: •  Order of patch

installations matter


Stateful


Redundancy in Hardware


Traditional Servers


Phoenix Servers


Easy to create a new server


Reliably reproducible


Stateless


Stateless

•  No data is lost •  New server can take load

immediately


Redundancy in Software


Implementations •  Might use a VM image •  …or a PaaS •  …or provisioning tools


Provisioning Tools


•  Easy to create test environments •  …with other software version


Chaos Monkey

•  Tool by Netflix •  Video streaming •  #1 in Internet usage in the US


Chaos Monkey

•  Kill random machines •  To ensure system survives

hardware failures


Would you rather rely on…

…highly available hardware

…or a Chaos Monkey tested system?


Resilience


Performance

Availability


Availability

Performance


Performance: Traditional Approach


•  Estimate •  #Users •  Use Cases •  Data volume •  Etc.

•  Add a little bit

•  Order servers


Performance: Problems


Problem: Estimate & Scaling •  Performance hard to estimate •  Coarse grained scaling •  Backfires


True Story •  Initial estimate wrong •  Just need a little more •  Cluster: two servers •  Add one •  About 50% higher costs •  Order / install server takes time •  Bad performance until server

delivered


Problem: Load Peak •  Business has load peaks •  i.e. events that people register for

•  Need to have enough hardware for load peaks

•  Costly


Problem: Testing •  Testing •  Need production-like infrastructure

•  Prohibitive costs •  Only needed during tests



Elastic Load

Balancer

System EU West 1b

System EU West 1c

System EU West 1c

System EU West 1c


What You Have Just Seen •  System tunes itself depending on

load •  Same approach as for availability •  +Watchdog for load


Easy to create a new server

Reliably reproducible

Redundancy in Software

Stateless

✔

✔

✔

?


Stateless •  Stateless web servers: best practice •  Some Java framework don’t follow

the approach

•  Can store HTTP session externally •  i.e. RDBMS, NoSQL, Cache


What about Databases?


Databases •  Often assumed to be

just “fast and scalable” •  Large scale doable i.e.

Data Warehouse •  Often use traditional

approach •  Cluster with two nodes •  Highly available

hardware


Database: Problems •  Availability •  Highly available hardware

•  Performance •  Limited scaling

•  Costly


Databases •  New approaches

•  Used by NoSQL databases

•  But also i.e. MySQL •  …or in system architecture


Databases •  Replication •  Read performance •  Availability

•  Sharding •  Spread data across servers •  Write performance


Scaling MongoDB

Replica 1

Shard 1

Replica 2

Replica 3

Shard 2

Replica 1

Replica 2

Replica 3


Availability

Replica 1

Shard 1

Replica 2

Replica 3

Shard 2

Replica 1

Replica 2

Replica 3


Scaling MongoDB

Replica 1

Shard 1

Replica 2

Replica 3

Replica 1

Shard 2

Replica 2

Replica 3

Replica 1

Shard 3

Replica 2

Replica 3


Scaling MongoDB

Replica 1

Shard 1

Replica 2

Replica 3

Shard 2

Replica 1

Replica 2

Replica 3

?


Replicas & Shards •  Easy to understand

•  But: Coarse grained scaling

•  Adding another shard means •  Moving lots of data •  Add quite some servers


Amazon Dynamo Model Server A

Shard1 Shard3

Shard4

Server B Shard2 Shard1

Shard4

Server D Shard4 Shard2

Shard3

Server C Shard3 Shard2

Shard1



Shard1 Shard3

Shard4


Shard4


Shard3


Shard1



Shard1 Shard3

Shard4


Shard4


Shard3


Shard1

New Server


Amazon Dynamo Model •  Published in the Dynamo paper •  Implementations:

Riak, Cassandra etc

•  Fine grained scaling •  Can immediately write to new node


Hardware •  Not highly reliable

•  Scales by distributing load across servers

•  No NAS, SAN, RAID…

•  As cheap as it gets


Sum Up •  Virtualization •  + Phoenix server •  = Better availability •  = Better performance •  = Lower costs •  Stateless servers •  NoSQL


Thank You!

Technology

￼High Availability and Scalability: Too Expensive! Architectures for Future Enterprise Systems

High Availability and Scalability: Too Expensive! Architectures for Future Enterprise Systems