View
715
Download
1
Category
Preview:
Citation preview
from the
TRENCHESTRENCHES
what you should know before you go to production
AWS LAMBDAAWS LAMBDA
hi, I’m Yan Cui
AWS user since 2009
apr, 2016
hidden complexities and dependencies
low utilisation to leave room for traffic spikes
EC2 scaling is slow, so scale earlier
lots of cost for unused resources
up to 30 mins for deployment
deployment required downtime
- Dan North
“lead time to someone saying thank you is the only reputation
metric that matters.”
“what would good
look like for us?”
be small be fast
have zero downtime have no lock-step
DEPLOYMENTS SHOULD...
FEATURES SHOULD...be deployable independently
be loosely-coupled
WE WANT TO...minimise cost for unused resources
minimise ops effort reduce tech mess
deliver visible improvements faster
nov, 2016
170 Lambda functions in prod
1.2 GB deployment packages in prod
95% cost saving vs EC2
15x no. of prod releases per month
timeis a good fit
1st function in prod!time
is a good fit
?
timeis a good fit
1st function in prod!
ALERTING
CI / CD
TESTING
LOGGING
MONITORING
170 functions
WOOF!
? ?
timeis a good fit
1st function in prod!
SECURITY
DISTRIBUTEDTRACING
CONFIGMANAGEMENT
evolving the PLATFORM
rebuilt search
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearch
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
new analytics pipeline
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery
1 developer, 2 daysdesign production
(his 1st serverless project)
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery“nothing ever got done
this fast at Skype!”
- Chris Twamley
- Dan North
“lead time to someone saying thank you is the only reputation
metric that matters.”
Rebuiltwith Lambda
Rebuiltwith Lambda
BigQuery
BigQuery
grapheneDB
BigQuery
grapheneDB
BigQuery
grapheneDB
BigQuery
getting PRODUCTION READY
CHOOSE A
FRAMEWORK
DEPLOYMENT
TESTING
Level of Testing
1.Unitdo our objects do the right thing?are they easy to work with?
Level of Testing
1.Unit2.Integrationdoes our code work against code we can’t change?
handler
handler
test by invoking the handler
Level of Testing
1.Unit2.Integration3.Acceptancedoes the whole system work?
Level of Testing
unit
integration
acceptance
feedb
ack
confidence
“…We find that tests that mock external libraries often need to be complex to get the code into the right state for the functionality we need to exercise.
The mess in such tests is telling us that the design isn’t right but, instead of fixing the problem by improving the code, we have to carry the extra complexity in both code and test…”
Don’t Mock Types You Can’t Change
“…The second risk is that we have to be sure that the behaviour we stub or mock matches what the external library will actually do…
Even if we get it right once, we have to make sure that the tests remain valid when we upgrade the libraries…”
Don’t Mock Types You Can’t Change
Don’t Mock Types You Can’t ChangeServices
“…Wherever possible, an acceptance test should exercise the system end-to-end without directly calling its internal code.
An end-to-end test interacts with the system only from the outside: through its interface…”
Testing End-to-End
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Test Input
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Test Input
Validate
CI + CD PIPELINE
“the earlier you consider CI + CD, the more time you save in the long run”
- me
“…We prefer to have the end-to-end tests exercise both the system and the process by which it’s built and deployed…
This sounds like a lot of effort (it is), but has to be done anyway repeatedly during the software’s lifetime…”
Testing End-to-End
“deployment scripts that only live on the CI
box is a disaster waiting to happen”
- me
Jenkins build config deploys and tests
unit + integration tests
deploy
acceptance tests
build.sh allows repeatable builds on both local & CI
Auto Auto Manual
LOGGING
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
UTC Timestamp API Gateway Request Id
your log message
function name
date
function version
LOG OVERLOAD
CENTRALISE LOGS
CENTRALISE LOGS
MAKE THEM EASILYSEARCHABLE
+ +the elk stack
CloudWatch Logs
CloudWatch Logs AWS Lambda ELK stack
CloudWatch Events
DISTRIBUTED TRACING
“my followers didn’t receive my new post!”
- a user
where could the problem be?
correlation IDs*
* eg. request-id, user-id, yubl-id, etc.
ROLL YOUR OWNCLIENTS
kinesis client
http client
sns client
ROLL YOUR OWNCLIENTS
X-RAY
MONITORING + ALERTING
“where do I install monitoring agents?”
you can’t
• invocation Count• error Count• latency• throttling• granular to the minute• support custom metrics
• same metrics as CW• better dashboard• support custom metrics
https://www.datadoghq.com/blog/monitoring-lambda-functions-datadog/
“how do I batch up and send logs in the
background?”
you can’t (kinda)
console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”);
console.log(“MONITORING|1489795335|8|count|yubls-served”);
timestamp metric value
metric type
metric namemetrics
logs
CloudWatch Logs AWS Lambda
ELK stacklogs
metrics
CloudWatch
DASHBOARDS
DASHBOARDS
SET ALARMS
DASHBOARDS
SET ALARMS
TRACK APP-LEVELMETRICS
Not Only CloudWatch
“you really don't want your monitoring
system to fail at the same time as the
system it monitors” - me
CONFIG MANAGEMENT
easily and quickly propagate config changes
CENTRALISEDCONFIG SERVICE
config servicegoes here
CENTRALISEDCONFIG SERVICE
CLIENT LIBRARY
sensitive data should be encrypted in-flight, and at rest
(credentials, connection string, etc.)
role-based access
KMS
config API
encrypt
role-based access
config API
HTTPS
encrypted at restencrypted in-flight
config API
HTTPSencrypted in-flight
config API
decrypt
role-based access
config API
HTTPSaccess to config API can be controlled with IAM roles*
*http://amzn.to/2mxTOyH
role-based access
KMS
FRAMEWORKPLUG-INS
plug-ins
serverless-plugin-kmsvariables
serverless-secrets
serverless-meta-sync
PRO TIPS
MAP TIMEOUTSTO HTTP 504
AVOID 128MBFOR PRODUCTION
continuous timeout loop…
AVOIDCOLDSTARTS
functions are unloaded if idle for a while
noticeable coldstart time(package size matters)
CloudWatch Event AWS Lambda
CloudWatch Event AWS Lambda
ping
ping
ping
ping
CloudWatch Event AWS Lambda
ping
ping
ping
ping
CloudWatch Event AWS Lambda
ping
ping
ping
ping
HEALTH CHECKS?
even then…
functions are recycled every 4 hours
https://www.iopipe.com/2016/09/understanding-aws-lambda-coldstarts/
https://www.iopipe.com/2016/09/understanding-aws-lambda-coldstarts/
Coldstarts happen, with few exceptions, 4 hours from the creation of a host VM.
AVOID HARDASSUMPTIONS
ABOUT FUNCTIONLIFETIME
USE STATE FOR
OPTIMISATION
CLEAN UP OLDPACKAGES
max 50 MB deployment package size
max 50 MB deployment package sizemax 75 GB total deployment package size*
* limit is per AWS region
Janitor Monkey
USE RECURSIONFOR LONG
RUNNING TASKS
max 5 mins execution time
CONSIDERPARTIAL
FAILURES
“AWS Lambda polls your stream and invokes your Lambda function. Therefore, if
a Lambda function fails, AWS Lambda attempts to process the erring batch of
records until the time the data expires…”
http://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
should function fail on partial/any failures?
use local state to facilitate partial retries
DLQ after max attempts
PROCESS SQSWITH RECURSIVE
FUNCTIONS
AVOID HOTKINESS
STREAMS
“Each shard can support up to 5 transactions per second for reads, up to a maximum total data
read rate of 2 MB per second.”
http://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html
“If your stream has 100 active shards, there will be 100 Lambda functions running concurrently. Then, each
Lambda function processes events on a shard in the order that they arrive.”
http://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html
when no. of processors goes up…
ReadProvisionedThroughputExceeded
can have too many Kinesis read operations…
ReadRecords.IteratorAge
unpredictable spikes in read ‘latency’…
can kinda workaround…
@theburningmonktheburningmonk.comgithub.com/theburningmonk
Yubl’s journey to Serverlesspart 1 : overview http://theburningmonk.com/2016/12/yubls-road-to-serverless-architecture-part-1/
part 2 : test + CI/CD http://theburningmonk.com/2017/02/yubls-road-to-serverless-architecture-part-2/
part 3 : ops http://theburningmonk.com/2017/03/yubls-road-to-serverless-architecture-part-3/
QUESTIONS?
Recommended