(DVO313) Building Next-Generation Applications with Amazon ECS

  • Published on
    19-Jan-2017

  • View
    1.711

  • Download
    2

Transcript

<ul><li><p> 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.</p><p>Matt DeBergalis</p><p>@debergalis</p><p>Cofounder and VP of Product</p><p>October 2015</p><p>DVO313</p><p>Building Next-Generation Applications </p><p>with Amazon ECS</p></li><li><p>What to expect from the session</p><p> Overview of the connected client application architecture, where rich </p><p>web + mobile clients maintain persistent network connections to cloud </p><p>microservices, and Meteor, a JavaScript application platform for building </p><p>these apps.</p><p> Discussion of the unique devops requirements needed to manage </p><p>connected client apps and microservices.</p><p> Reasons for delivering Galaxy, Meteors cloud runtime, on Amazon ECS.</p><p> Deep dive into our use of Amazon ECS.</p></li><li><p>Open source</p><p>The JavaScript app </p><p>platform, for mobile </p><p>and web</p><p>Open source (MIT)</p><p>10th most starred </p><p>project on GitHub</p><p>Fully supported</p><p>Galaxy runtime</p><p>Deploy, operate, and </p><p>monitor apps and </p><p>services</p><p>Built on Amazon ECS</p><p>Launched to public on </p><p>10/5</p><p>Team of over 30, </p><p>hundreds of OSS </p><p>contributors</p><p>$30M+ raised from </p><p>Andreessen Horowitz, </p><p>Matrix, others</p><p>100+ development </p><p>and training partners</p><p>Cloud platform Complete ecosystem</p></li><li><p>JavaScriptCLRJVM</p><p>Meteor: a JavaScript application platform</p></li><li><p>Galaxy</p><p>Proxy</p><p>App A (dead) Galaxy Server Galaxy Server</p><p>E C S C L U S T E R</p><p>E L B</p><p>Proxy Scheduler</p><p>A Z 1 A Z 2</p><p>App A v2</p><p>App A v2</p><p>App A v2</p><p>App A v2</p><p>App A (dead)</p><p>App A (dead)</p><p>App A (dead)</p></li><li><p>WEBSITES</p><p>Links and forms</p><p>Page-based</p><p>Viewed in a browser</p><p>APPS</p><p>Modern UI/UX</p><p>No refresh button</p><p>Browser, mobile, and more</p></li><li><p>WEBSITES</p><p>Stateless</p><p>Request / response</p><p>Presentation on the wire</p><p>APPS</p><p>Stateful</p><p>Publish / subscribe</p><p>Data on the wire</p></li><li><p>The architecture required for modern </p><p>apps is different</p></li><li><p>Mainframe PC WebConnected</p><p>client</p></li><li><p>1. Stateful clients and servers with </p><p>persistent network connections.</p><p>2. Reactive architecture where data </p><p>is pushed to clients in real time.</p><p>3. Code running on the client and</p><p>in the cloud. The app spans the </p><p>network.</p><p>Connected client</p></li><li><p>Connected client app</p><p>Cloud</p><p>Client</p><p>Application</p><p>microserviceBilling Geo</p></li><li><p>HTML </p><p>Templates</p><p>App </p><p>Logic</p><p>Microservices Database</p><p>Reactive UI update system</p><p>Native mobile container</p><p>Speculative client-side updates</p><p>Client-side data store</p><p>Custom data sync protocol</p><p>Realtime database monitoring</p><p>Build &amp; update system</p><p>Assemble it yourself</p><p>Off the shelf</p><p>Build / integrate</p><p>Modern app architecture</p></li><li><p>HTML </p><p>Templates</p><p>App </p><p>Logic</p><p>Microservices Database</p><p>HTML </p><p>Templates</p><p>App </p><p>Logic</p><p>Microservices Database</p><p>Reactive UI update system</p><p>Native mobile container</p><p>Speculative client-side updates</p><p>Client-side data store</p><p>Custom data sync protocol</p><p>Realtime database monitoring</p><p>Build &amp; update system</p><p>Open-source</p><p>JavaScript app platform</p><p>Off the shelf</p><p>Build / integrate</p><p>Assemble it yourself With Meteor</p><p>Modern app architecture</p></li><li><p>JavaScript is the only reasonable </p><p>language for cross platform app </p><p>development</p></li><li><p>The architecture required for </p><p>modern apps is different</p></li><li><p>The devops required for </p><p>modern apps is different</p></li><li><p>The devops required for </p><p>modern apps is more complex</p></li><li><p> Persistent, stateful connections.</p><p> Seamless application updates </p><p>hot code push.</p><p> Client tracking and metrics.</p><p> Complex array of microservices.</p><p> Mobile considerations (builds, push </p><p>notifications).</p><p>Connected client devops</p></li><li><p> Scalable multi-tenant: 100k users,</p><p>1MM processes, 100M sessions.</p><p> Accessible to developers without </p><p>sophisticated devops background.</p><p> Suitable for expert teams and complex </p><p>apps.</p><p> High availability of user apps and the </p><p>Galaxy infrastructure.</p><p> Online updates of all Galaxy components.</p><p> Smooth path to customer-managed cloud.</p><p> Use off-the-shelf parts wherever possible.</p><p>Design requirements</p></li><li><p>C o n n e c t e d c l i e n t m a n a g e m e n t</p><p>Application logic and services</p><p>MetricsHot code deploySession mgmt</p><p>Container management</p><p>IaaS resources</p><p>Web</p><p>Galaxy: connected client management</p><p>I n f r a s t r u c t u r e</p><p>Mobile Device</p></li><li><p> O(100k) independent user processes that need isolation.</p><p> Granular, efficient essential in multi-tenant.</p><p> Surprisingly important: fast spin-up.</p><p> Speed and responsiveness is an essential part of a great </p><p>developer experience.</p><p> Fast spin-up lets us build around a single-shot container </p><p>model.</p><p> Layering as a path to user-supplied binaries.</p><p>Containers and orchestration</p></li><li><p> Lots of exciting options here: ECS, Kubernetes, Marathon, </p><p> Service argument is compelling. Same case we make for Galaxy to </p><p>our customers.</p><p> Integration with other parts of AWS saves us time and code. </p><p>Example: services automatically register containers with Elastic Load </p><p>Balancing (ELB). </p><p> Support for multiple Availability Zones.</p><p> Bottom line: ECS got us to market faster than the alternatives.</p><p>ECS container management</p></li><li><p>Implementation</p></li><li><p>Logs</p><p>Metrics</p><p>Galaxy UI</p><p>App</p><p>images</p><p>App</p><p>state</p><p>Cluster 1</p><p>Manager</p><p>app app</p><p>app app</p><p>Cluster 2</p><p>app app</p><p>app app</p><p>Cluster 3</p><p>app app</p><p>app appDeveloper Admin</p><p>Manager</p><p>Manager</p><p>F R O N T E N D B A C K E N D</p></li><li><p>Galaxy</p><p>E C S C L U S T E R</p><p>A Z 1 A Z 2</p><p>App AApp A App AApp A</p></li><li><p>Galaxy</p><p>E C S C L U S T E R</p><p>E L B</p><p>A Z 1 A Z 2</p><p>App AApp A App AApp A</p><p>Proxy Proxy</p></li><li><p>Galaxy</p><p>E C S C L U S T E R</p><p>E L B</p><p>A Z 1 A Z 2</p><p>App AApp A App AApp A</p><p>App CService BService B</p><p>Proxy SchedulerProxy SchedulerProxy</p></li><li><p>Galaxy</p><p>Proxy</p><p>E C S C L U S T E R</p><p>E L B</p><p>Proxy</p><p>A Z 1 A Z 2</p><p>App AApp A App AApp A</p><p>Scheduler</p><p>App CService BService B</p><p>Galaxy UI Galaxy UI</p></li><li><p>Deeper Dive</p><p>Custom scheduler</p><p>Connected client proxy</p><p>User metrics</p></li><li><p> Need fine-grained control over how individual tasks are allocated to </p><p>container instances and across Availability Zones.</p><p> Container health depends on high-level behavior of app processes, not </p><p>just low-level checks.</p><p> Need rate limits and backoff policy when restarting application </p><p>containers. (Not our code; potentially not the same policy for all users.)</p><p> Users need visibility into container health.</p><p> Need to ensure that system-essential containers (proxy, Galaxy UI) can </p><p>be scheduled even if resources are over-committed.</p><p>Scheduling containers</p></li><li><p>ECS default scheduler not designed to do these kinds of things.</p><p>Thats okay! Instead, ECS provides cluster state and task </p><p>management APIs needed to write our own. ~1,500 lines of Go.</p><p> High availability app containers must be </p><p>distributed across Availability Zones.</p><p> App containers should be evenly </p><p>distributed across instances in an </p><p>Availability Zone.</p><p> Container instances should be roughly </p><p>equally loaded.</p><p> Each container instance must have </p><p>space to run a proxy and a scheduler.</p><p>Also implements rate-limiting, application health checks, and </p><p>coordinated version updates.</p><p>Writing a custom scheduler</p></li><li><p>Logs</p><p>Metrics</p><p>Galaxy UI</p><p>App</p><p>images</p><p>App</p><p>state</p><p>Cluster 1</p><p>Scheduler</p><p>app app</p><p>app app</p><p>Cluster 2</p><p>app app</p><p>app app</p><p>Cluster 3</p><p>app app</p><p>app appDeveloper Admin</p><p>Scheduler</p><p>Scheduler</p><p>F R O N T E N D B A C K E N D</p><p>State sync</p></li><li><p>policies</p><p>schedulerECS</p><p>APIApp state</p><p>Desired config</p><p>StartTask</p><p>StopTask</p><p>ListTasks</p><p>DescribeTasks</p><p>Actual config</p><p>[]</p><p>State sync</p></li><li><p> To ensure the scheduler stays alive, we create an ECS service </p><p>calling for exactly 1 scheduler task.</p><p> If the scheduler goes down, crashed containers will no longer be </p><p>restarted, and users won't be able to launch new containers or stop </p><p>old ones. Reasonable failure mode.</p><p> Were considering changing to a keep running model,</p><p>using Amazon DynamoDB to broker a leadership election between </p><p>the set.</p><p>Scheduling the scheduler</p></li><li><p> Manages the persisent connection between clients and</p><p>the appropriate application backend / microservice process.</p><p> Implements stable sessions + coordinated version updates.</p><p> Share nothing architecture. Any proxy can serve any request.</p><p> High availability: multiple proxies in multiple Availability Zones.</p><p> Scheduled as an ECS service; binds to ELB.</p><p>Connected client proxy</p></li><li><p>Proxy</p><p>App A Galaxy ServerApp A App A App A Galaxy Server</p><p>E C S C l u s t e r</p><p>E L B</p><p>Proxy</p><p>App B App C App D</p><p>Scheduler</p><p>ELB routes traffic to any proxy. Any proxy can route to any app container. </p><p>ELB routes traffic on ports 80 and 443. The ELB is configured in TCP pass-through mode so that we can use </p><p>WebSockets.</p><p>A Z 1 A Z 2</p><p>Stable sessions</p></li><li><p>Proxy</p><p>App A Galaxy ServerApp A App A App A Galaxy Server</p><p>E C S C l u s t e r</p><p>E L B</p><p>Proxy</p><p>App B App C App D</p><p>Scheduler</p><p>A Z 1 A Z 2</p><p>Proxy routes initial request to random container, and applies a cookie to the client with the ID of the selected container. </p><p>On subsequent connections (XHR or interrupted WebSocket), proxy uses cookie to determine backend.</p><p>Stable sessions</p></li><li><p>Proxy</p><p>Galaxy ServerApp A App A App A Galaxy Server</p><p>E C S C l u s t e r</p><p>E L B</p><p>Proxy</p><p>App B App C App D</p><p>Scheduler</p><p>A Z 1 A Z 2</p><p>If desired backend is unavailable, proxy selects new backend and reapplies a cookie.</p><p>App A (dead)</p><p>App A</p><p>Stable sessions</p></li><li><p>Proxy</p><p>App A v1 Galaxy Server</p><p>App A v1</p><p>App A v1</p><p>App A v1</p><p>Galaxy Server</p><p>E C S C l u s t e r</p><p>E L B</p><p>Proxy Scheduler</p><p>A Z 1 A Z 2</p><p>v1</p><p>App updates require the cooperation of the scheduler and proxy components.</p><p>Coordinated version updates</p></li><li><p>Proxy</p><p>App A v1 Galaxy Server</p><p>App A v1</p><p>App A v1</p><p>App A v1</p><p>Galaxy Server</p><p>E C S C l u s t e r</p><p>E L B</p><p>Proxy Scheduler</p><p>A Z 1 A Z 2</p><p>App A v2</p><p>App A v2</p><p>App A v2</p><p>App A v2</p><p>v1</p><p>First step is to spin up new containers in parallel with the old.</p><p>(This can be done in a rolling fashion, not shown here.)</p><p>Coordinated version updates</p></li><li><p>Proxy</p><p>App A (dead) Galaxy Server Galaxy Server</p><p>E C S C l u s t e r</p><p>E L B</p><p>Proxy Scheduler</p><p>A Z 1 A Z 2</p><p>App A v2</p><p>App A v2</p><p>App A v2</p><p>App A v2</p><p>App A (dead)</p><p>App A v1</p><p>App A v1</p><p>v1</p><p>Once new containers pass health checks, scheduler starts to tear down old client connections and the old </p><p>containers.</p><p>Coordinated version updates</p></li><li><p>Proxy</p><p>App A (dead) Galaxy Server Galaxy Server</p><p>E C S C l u s t e r</p><p>E L B</p><p>Proxy Scheduler</p><p>A Z 1 A Z 2</p><p>App A v2</p><p>App A v2</p><p>App A v2</p><p>App A v2</p><p>App A (dead)</p><p>App A (dead)</p><p>App A (dead)</p><p>v1 v2</p><p>Proxy recognizes code update in progress, ignores session cookie, and routes client to new container</p><p>(establishing new stable session).</p><p>Coordinated version updates</p></li><li><p> Galaxy collects metrics on CPU, memory, network traffic, </p><p>and a count of connected clients from each running app.</p><p> collector process (one per container instance) streams container </p><p>metrics via Docker Remote API, and poll proxy metrics on a known </p><p>port.</p><p> aggregator process (singleton) polls each collector, computes </p><p>aggregate rollups (hourly, daily), stores each time series in </p><p>DynamoDB.</p><p> Aggregator expires old metrics. Tables are sharded by time range.</p><p> Galaxy server reads directly from DynamoDB.</p><p>Metrics</p></li><li><p>With the amount of growth we have seen after our launch last year, keeping </p><p>the servers alive has been an uphill battle until Galaxy came along.</p><p> Tigran Sloyan, Codefights</p><p>Galaxy solved many of the ongoing challenges we had with our previous </p><p>server stack: load balancing across sticky sessions, scaling processes, etc.</p><p> Shawn Young, Classcraft</p><p>Loosely coupled architecture working well for us</p><p>High availability strategy works: apps stayed up during IaaS outages</p><p>Our experience so far</p></li><li><p>Multiple clusters</p><p>Additional AWS regions</p><p>On-prem (customer-supplied IAM credentials)</p><p>Free tier and instance cost optimizations</p><p> and more </p><p>Whats next</p></li><li><p>The JavaScript app platform</p><p>www.meteor.com</p><p>Galaxy available now!</p></li><li><p>Thank you!</p><p>Matt DeBergalis @debergalis</p></li><li><p>Remember to complete </p><p>your evaluations!</p></li></ul>

Recommended

View more >