37
Code Shaming; Anti Patterns at Work Silicon Valley Code Camp – October 2014 Mark Simms (@mabsimms) Principal Group Program Manager Windows Azure Customer Advisory Team

SVCC: Code Shaming and Antipatterns

  • Upload
    masimms

  • View
    51

  • Download
    4

Embed Size (px)

DESCRIPTION

Presentation from Silicon Valley Code Camp 2014, on subtle anti-patterns that show up in cloud services under load.

Citation preview

Page 1: SVCC: Code Shaming and Antipatterns

Code Shaming; Anti Patterns at WorkSilicon Valley Code Camp – October 2014

Mark Simms (@mabsimms)Principal Group Program ManagerWindows Azure Customer Advisory Team

Page 2: SVCC: Code Shaming and Antipatterns

Designing resilient large-scale services requires careful design and architecture choicesIn this session we will explore key scenarios extracted from customer engagements, and what happens @ big scale.

Azure Customer Advisory Team (CAT) Works with internal and external customers to build out some of the largest applications on Azure

Get our hands dirty on all aspects of delivery; design, implementation and all too often firefighting

This is meant to be an interactive discussion – if you don’t ask questions, we will!

This session will be customer stories, patterns & code.

We will get deeply nerdy with .NET and Azure services.

Setting the stage

Page 3: SVCC: Code Shaming and Antipatterns

A large web site, processing asynchronous work

«...

Azure Cloud Service

Web Role

Page 4: SVCC: Code Shaming and Antipatterns

100k+ connected devices publishing activity reports

Target end to end latency (including cellular link) – 8 seconds

Target throughput 5000 messages / second

Connected device(s) service, asynchronous processing

Azure Cloud Service

Web Role Worker

Service Bus

Page 5: SVCC: Code Shaming and Antipatterns

Batch receiving messages for throughput

Flag completion for individual messages

Connected device(s) service, asynchronous processing

Page 6: SVCC: Code Shaming and Antipatterns

Serialized processing – increasing latency

Batching receive for chunky communication – needed to meet throughput goalsProcessing messages in sequence drives up latency

Service Bus

QueueMessage

Batch

Process Messages

Process Message

Process Message � ..

Page 7: SVCC: Code Shaming and Antipatterns

Switch to parallel processing

Service BusQueue

Message Batch

Process Messages

Process Message

Process Message

� ..

Page 8: SVCC: Code Shaming and Antipatterns

Initial performance very smooth

App quickly spikes to 100% CPU on all cores

Execution time spikes to minutes!

Something isn’t right

Page 9: SVCC: Code Shaming and Antipatterns

Most threads blocked in FindEntry of Dictionary

Using a Dictionary to look up the message handlers

What does windbg say?

Page 10: SVCC: Code Shaming and Antipatterns

Large variations in avg/max latency

After time, processing rate drops to ~5 msg / second

CPU at ~ 0%

Something still isn’t right

Message Type 1

Message Type 2

Message Type 3

Message Type 4

Message Type 5

Message Type 6

Message Type 7

Message Type 8

00:00.0

00:04.3

00:08.6

00:12.9

00:17.3

00:21.6

00:25.9

00:30.2

Variation in Message ProcessingAvg Min Max

Page 11: SVCC: Code Shaming and Antipatterns

What does perf view have to say?

http://channel9.msdn.com/Series/PerfView-Tutorial/Tutorial-12-Wall-Clock-Time-Investigation-Basics

System.Core!System.Dynamics.Utils. TypeExtensions.GetParametersCached

Page 12: SVCC: Code Shaming and Antipatterns

Looks simple enough…Required messaging exchange patterns for queuing (pub/sub, competing consumer)Partitioning and load balancing (affinity) for queue resourcesLatency vs. throughput – batchingResources vs. latency – bounding concurrency of task executionMessage dispatch – dynamic vs. fixed function tablesPoison messages, retriesIdempotent processing

Asynchronous & queue based processing

Cloud Service Boundary

Load Balancer

Web Servers

Database

App Servers

Azure Queue(s)

Page 13: SVCC: Code Shaming and Antipatterns

(Very) Large scale website, backed by 500 Azure SQL databases

Physically collapsed web/app tiers to reduce latency

What can happen during periods of extreme success?

Large website, scale-out relational data storage

«...

Azure Cloud Service

Web Role

500 databases

Page 14: SVCC: Code Shaming and Antipatterns

Each cloud service has a single public IP (VIP)

Each Azure SQL Database cluster also has a single public IP

120 web role instances, 500 databases

Connection pool default size = 100

What’s the limit?

Large website, scale-out relational data storage

Azure Load Balancer

DB1 DB2 DB3

SrcIp SrcPort DestIp DestPort

A.B.C.D 1 E.F.G.H 1433

A.B.C.D 2 E.F.G.H 1433

Page 15: SVCC: Code Shaming and Antipatterns

(Very) Large scale website, leveraging an external service for content moderation

Protected the external service dependency with a retry policy

On average called in 0.5% of service calls

Large website, leveraging external services

«...

Azure Cloud Service

Web Role

500 databases

Content moderation

service

Page 16: SVCC: Code Shaming and Antipatterns

Too much trust in downstream services and client proxies

Not bounding non-deterministic calls

Blocking synchronous operations

No load shedding

Unintended consequences

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728290

50

100

150

200

250

300

350

400

450Web Request Response Latency

Avg Latency Response Latency

Seco

nds

Page 17: SVCC: Code Shaming and Antipatterns

Rich clients (mobile and desktop) publishing documents for processing

Using Shared Access Signature (SAS) tokens for direct writes to storage

Looks like a good design…

Large website, asynchronous document processing

«...

Azure Cloud Service

Web Role Worker

Azure Storage Account

Blob

Queue

Page 18: SVCC: Code Shaming and Antipatterns

Storage account URI is “hard coded” into the client application

Need to update all 100k+ client applications to change storage account

Large website, asynchronous document processing

Page 19: SVCC: Code Shaming and Antipatterns

Design Choices & Challenges

Page 20: SVCC: Code Shaming and Antipatterns

Devices and Services workload – connected embedded devices and applications streaming data to the cloud100k+ devices, growing 50k / monthRegional affinity (North America only)

Optimize for the most stringent case

Simplicity is king

No one, true solution

Exploration – Data Design

Query Throughput

Latency Reach

Every 30 seconds, each device publishes a status update (location, health, etc)

4k – 100k msgs/sec

2000 – 5000 ms

Single device

Every 10 minutes, a batch job retrieves all of the status updates delivered in the past 10 minutes

2M msgs / 10 minutes

2 minutes All devices

On an ad-hoc basis, a user may request the current status and recent history of all of their devices

15 requests / second

500 ms Limited device set

On an ad-hoc basis, a user may request a historical time range of all of their devices

5 requests / second

750 ms Limited device set

Page 21: SVCC: Code Shaming and Antipatterns

Cannot fulfill with a single database Exceeds transactional throughput limitData growth will exceed practical size limits

Insert heavy workloadPressure on transaction log

Partitioning keys?Device ID, User account?

Partitioning approachBucket, range, lookup?

Option 1: Relational – Considerations and Challenges

Page 22: SVCC: Code Shaming and Antipatterns

Periodic query spike on bulk reportingImpact to online operations (30M+ rows)

RebalancingMoving data between partitions / databases

Distribution of reference data (relational model)Keeping in sync

Impact of noisy neighbors (Azure SQL DB)Variable latency, pushback under heavy load

Cost of management (SQL IaaS)Cost of automation for patching, maintenance

Option 1: Relational – Considerations and Challenges

Page 23: SVCC: Code Shaming and Antipatterns

Inserting large volumes of streaming data into a data storeData store is governed on number of operations (transactions)

Trade consistency for throughput – enqueue, batch and publishGet: increased throughput, shift work to ”cheap” resource (app memory)Give up: full durability (potential data loss)

Tackling the Insert Challenge

Page 24: SVCC: Code Shaming and Antipatterns

Challenge: know that your site is having issues before Twitter doesThis is not a randomly chosen anecdote.

Instrument, collect, analyze - reactBest: buy your way to victory (AppDynamics, New Relic, etc)Also need to instrument application effectively for ”contextual” data (aka, logging)

Tackling the Insight Challenge

Page 25: SVCC: Code Shaming and Antipatterns

Instrument for production loggingIf you didn’t log & capture it, it didn’t happen

Implement inter-service monitoring and alertingNothing interesting happens on a single instance

Run-time configurable loggingEnable activation (capture or delivery) of additional channels at run-time

Getting logging rightAll logging must be asynchronous Buffer and filter before pushing to remote service or store

Instrumenting Applications

Page 26: SVCC: Code Shaming and Antipatterns

Bringing down a production system with logging…

Page 27: SVCC: Code Shaming and Antipatterns

Demo: Instrumenting Applications with Event Source

Page 28: SVCC: Code Shaming and Antipatterns

STB Readiness

Option 2: Compositional Azure Storage

This isn’t a relational workloadPer-device insert and lookupPeriodic batch transfer

Per-device lookupNatural fit for table storage Device ID = Pk

Data type = Rk

Periodic batch transferNatural fit for blob storageInstance + Timestamp = blob idBuffer and write into blocksRoll over on time interval (10 min)

0101 1101 0111

1101 0111 ...Time/space

buffer

Pk={Device;Day}, Rk={Timestamp}Payload={fields}

Table Storage

BlobStorage

Uri={Minute;Instance}Payload={JSON Data}

Querying by device By time - direct { PkRk } lookup

By day - direct { Pk } max of 2880 records per partition

Batch transfer by time frameParallel download of all blobs matching timeframe pattern

Adding scale capacity20k operations per storage account,

Page 29: SVCC: Code Shaming and Antipatterns

Azure Storage Account - Blob

Max blob size (block) 200 GB (50k blocks)

Max block size 4 MB

Max blob size (page) 1 TB

Max page size 512 bytes

Max bandwidth / blob 480 Mbps

Latency bounds (per operation)

100ms nominal1-3 sec duringload balancing

Scale-out unit Blob

Scale-out impedance Low

Use the appropriate blob type • Prefer block blogs with immutable / append-only data)

Use the largest practical block size• Note: network performance may require smaller blocks

for“long-haul”

For partial reads use 64 KB block size to maximize throughput

ScaleUse the appropriate blob type

• Prefer block blogs with immutable / append-only data)

Use the largest practical block size• Note: network performance may require smaller blocks

for“long-haul”

Use Async Copy API for copying blobs between accounts, providers, etc

Page 30: SVCC: Code Shaming and Antipatterns

Azure Storage Account - Table

Max operations / secondper partition 5000

Max row size (names + data) 1 MB

Max column size (byte[] or string) 64 KB

Maximum number of rowsN/A (up to

storage account size limit)

Scale-out unit Table partition

Scale-out impedance Low

• Use appropriate partition keys to co-locate data (for query or batch operations) or break data into more partitions (for throughput)

• Avoid use of table storage for applications requiring non-trivial aggregation or function projection

• Store multiple types in same table for normalized queries (do not denormalize table storage schema!)

• Avoid large scans (can be very expensive!); explore use of separate (partially consistent) index table

Scale

• Leverage multiple storage accounts (not multiple tables) to increase operations/second

Page 31: SVCC: Code Shaming and Antipatterns

Azure Storage Account - Queues

Max messages in a queueN/A (up to

storage account size limit)

Max lifetime of a message 1 week (auto purged)

Max message size 64 KB

Max throughput 2000 messages / second

Scale-out unit Queue

Scale-out impedance Medium

• Optimize storage format to reduce message size / avoid 64 KB limit (for larger messages leverage Service Bus or Queues + Blob)

• Retrieve messages in batches to increase throughput

• Use dequeue count on message for poison messages

Scale

• Leverage multiple queues to increase messages / second

• Vertical partitioning: split queues by function

• Horizontal partitioning: split messages between queues (round robin/direct assignment)

Page 32: SVCC: Code Shaming and Antipatterns

Services site for mobile device applications1M+ users at launch, 1M+ users added per monthFront ended by Android, iOS, Windows Phone

Personalized information feeds and data setsExamples: browsing history, shopping cartAssuming up to 30% of user base can be online at any point in timeMaximum response latency 250 ms @ 99th percentile

User centric web application

Page 33: SVCC: Code Shaming and Antipatterns

Where are the scalability bottlenecks?

Where are the availability and failure points? Where are the key insight and instrumentation points?

Tearing apart the architectureCloud Service

Front End Web Role Instance Instance Instance Instance

CachingRole Instance Instance Worker

Role Instance

Databases

DB DB DB DB

Storage

StorageAccount

StorageAccount

Page 34: SVCC: Code Shaming and Antipatterns

Demo: Implementing an information publishing site

Page 35: SVCC: Code Shaming and Antipatterns

Recap

Know the numbers – platform scalability targetsCompute, storage, networking and platform servicesScalability == capacity * efficiency

Watch out for shared resources and contention pointsAt high load and concurrency “interesting” things happenDefault to asynchronous, bound all calls

Insight is power – measuring and observation of behavior Without rich telemetry and instrumentation – down to the call level – apps are running blindBuy your way to victory, leverage asynchronous and structured logging

Page 36: SVCC: Code Shaming and Antipatterns

Resources

Failsafe: Building scalable, resilient cloud services http://channel9.msdn.com/Series/FailSafe Cloud Service Fundamentals - Reference code for Azurehttp://code.msdn.microsoft.com/windowsazure/ContosoSocial-in-Windows-8dd9052c

 

Page 37: SVCC: Code Shaming and Antipatterns

© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.