Upload
ali-kheyrollahi
View
963
Download
2
Embed Size (px)
Citation preview
@aliostad
/// ASOS in numbers
2 0 1 6 T u r n O v e r → £15 bln
A c t i v e C u s t o m e r s → 12 M
N e w P r o d u c t s / w k → 4 k
U n i q u e V i s i t s / m o → 123 M
P a g e V i e w s / d a y → 95 M
P l a t f o r m T e a m s → 40
@aliostad
/// why microservices> Scaling people not the solution
> Decentralising decision centres => Agility
> Frequent deployment => Agility
> Reduced complexity of each ms (Divide/Conquere) => Agility
> Overall solution complex but ...
@aliostad
/// anecdote
Often you can measure your success in implementing Microservice Architecture not be the number of services you build, but by the number you decommission.
@aliostad
/// microservices vs soaSOA Microservices
Main Goal Architectual Decoupling Agility
Audience Mainly Architecture Business (Everyone)
Set out to solve Architectural CouplingScaling People,
Frequent Deployment
Organisational Structure Impact Minimal Huge
Service Cardinality Usually up to a dozen >40 (Commonly >100)
When to do Always teams > ~5**** Debateable. There are articles and discussions on this very topic
@aliostad
/// microservice challenges
> Very difficult to build a complete mental picture of solution
> When things go wrong, need to know where before why
> Potentially increased latency
> Performance outliers intractable to solve
> A complete mind-shift requiring a new operating model
@aliostad
/// performance outliersMicroservice
AMicroservie
B
99th Percentile = 500ms 99th Percentile = 500ms
A B Total<1s 99% 99% 98.01%
>500m 1% 99% 0.99%>500m 99% 1% 0.99%
>1s 1% 1% 0.01%
@aliostad
/// ActivityId
> Every customer request matters
> Every request is unique
> Every request creates a chain (or tree) of calls/events
> Activities are correlated
> You need an ActivityId (or CorrelationId) to link calls/events
@aliostad
/// ActivityId - HTTPRequest
GET /api/v2/foo HTTP/1.1 host: foo.com activity-id: 96c5a1f106ce468ebcca8303ed7464bd
Response
200 OK activity-id: 96c5a1f106ce468ebcca8303ed7464bd
@aliostad
/// FailureMicroservice
A
1% chance of failure
XWait (back-off)XWait (back-off longer)
Microservice B
1% chance of failure
@aliostad
/// Preemptive TimeoutMicroservice
A
XretryXretry
Short timeout
Short timeout
Microservice B
@aliostad
/// Blame Game“If there is a single place where
you can play blame game, instead of collective responsibility,
it is in Microservices troubleshooting”
@aliostad
/// Did you say IO??
Microservice
DBAPI
Cache
Measure... every time your code
goes out of your process
@aliostad
/// Recording Methods> Explicitly by calling record()
> Asking the library to record a closure
> Aspect-oriented
Java (spf4j)
private static final MeasurementRecorder recorder = RecorderFactory.createScalableCountingRecorder(forWhat, unitOfMeasurement, sampleTimeMillis);
… recorder.record(measurement);
.NET (PerfIt)
var ins = new SimpleInstrumentor(new InstrumentationInfo() { Counters = CounterTypes.StandardCounters, Description = "test", InstanceName = "Test instance", CategoryName = TestCategory });
ins.Instrument(() => Thread.Sleep(100), "test...");
Java and .NET
@PerformanceMonitor(warnThresholdMillis=1, errorThresholdMillis=100, recorderSource = RecorderSourceInstance.Rs5m.class)
[PerfItFilter(“PerfItTests", InstanceName = "Test")]public string Get(){ return Guid.NewGuid().ToString();}
@aliostad
/// Publishing Methods
> Local file (various to logstash)
> TCP and HTTP (many, to zipkin, influxdb)
> UDP (statsd, collectd to graphite, logstash)
> Raising Kernel-level event (Windows ETW)
> Local communication (statsd)
@aliostad
/// tri-state> Closed traffic can flow normally
> Open traffic does not flow
> Half-open circuit breaker tests the waters again
Closed
Open
Half-open
Test
Failure
Wait timeout
@aliostad
/// Netflix Hysterix
RequestVolumeThreshold
ErrorThresholdPercentage
SleepWindowInMilliseconds
TimeInMilliseconds
NumBuckets
@aliostad
/// Fallback
> Custom: e.g. serve content from a local cache (status 206)
> Silent: return null/no-data/empty (status 200/204)
> Fail-fast: Customer experience is important (status 5xx)
@aliostad
/// Health Endpoints
Ping returns a success code when invoked
Canary returns a connectivity status and latency on the service and dependencies
“… none of them invoke any application code”
@aliostad
/// PingRequest
GET /api/health HTTP/1.1 host: foo.com
Response
200 OK
Response
500 Server Error
@aliostad
/// CanaryRequest
GET /api/canary HTTP/1.1 host: foo.com
Response
200 OK
{
[Nested Structure]
}
@aliostad
/// ChirpResult
{ "serviceName": "foo", "latency": "00:00:00.0542172", "statusCode": 200, "isCritical": true }
@aliostad
/// AOP / Declarative (c#)
[AzureStorageCanary("Foo-AzureStorage-BarDatabaseServer", “config-key-for-cn“)] [SqlCanary("SQL-BazActiveDatabase", null, typeof(SqlConnectionFactory))] [CanaryEndpointCanary("Dependency-Api", “config-key-for-endpoint“)] public class CanaryController : CanaryBaseController { … // some boilerplate code }
@aliostad
/// Wrap-up> If you have more than ~5 teams, consider Microservices
> Logging/Monitoring/Alerting: single most important asset
> Use ActivityId Propagator to correlate (consider zipkin)
> Cloud is a jungleTM. Without retry/timeout you won’t survive
> Monitor and measure all calls to external services (blame game)
> Protect your systems with circuit-breakers (and isolation)
> Canary helps you detect connectivity from customer view
@aliostad
Thomas Wood: Daisy Picture
Thomas Au: Thermometer Picture
Torbakhopper: Cables Picture
Dam Picture - Japan
Hsiung: Lights Picture
Health Endpoint in API Design