Evolving the Netflix APIKatharina ProbstEngineering Manager, APIOctober 2015
What is Netflix?
> 1000 Devices
Is it significant?❏ Peak
downstream traffic in the US is almost 35%.
❏ Almost 70 Million subscribers worldwide and growing
Source: http://www.sandvine.com/news/global_broadband_trends.asp
We’re going global!
Source: https://help.netflix.com/en/node/14164
Recent additions: Spain, Portugal, Italy
Current availability
NetflixOriginals
Do we need a Netflix API?
API
Personali-zationEngine
User Info Ratings Similar
MoviesA/B TestEngine….
Uses
❏ Discovery❏ Signup❏ Playback
❏ Internal teams only
API
Goals
❏ Flexibility❏ Resiliency❏ Scalability❏ Excellent tools
API
Goals
❏ Flexibility❏ Resiliency❏ Scalability❏ Excellent tools
API
Lots of devices, lots of variety
Different interaction models
And just to make things a little more interesting….
❏ A/B tests❏ profiles❏ localization
What we felt we had
What we needed
❏ Reduce network chattiness
❏ Support device optimizations
❏ Enable faster development for internal users
Local MethodRemote API
GET/users/{user_id}/lists
apiGateway .getLists(userId)
Discrete HTTP requests pay network tax repeatedly
Single, optimized request; pay network tax once
Single, optimized request; pay network tax once
Client data assembly logic pushed to server
Add server-side scripting capability
❏ Enable independent development & device optimization
❏ Profit
❏ UI (script) changes can happen independently
❏ Script changes can be pushed to running servers, so decoupled from API push schedule
❏ Server+UI changes usually involve API team
Impact on velocity and collaboration
RxJava Hystrix
Java Service Layer
Mid-tierServices
UI Teams
Client Server
Internet
Application
/tv/home
API Team
Service Teams
ELB ZuulMid-tier Services
ScriptableBackend
ScriptableBackend
+
API Layer
Goals
❏ Flexibility❏ Resiliency❏ Scalability❏ Excellent tools
API
https://github.com/Netflix/Hystrixresilience patterns for distributed sys
Hystrix Primer
❏ Protection from and control over
latency and failure from dependencies
❏ Stop cascading failures in a complex
distributed system
❏ Fail fast and rapidly recover
❏ Fall back and gracefully degrade
PersonalizationEngine
SimilarMovies
MovieMetadata Ratings User Info Instant
QueueA/B TestEngine
API
PersonalizationEngine
Similar movies
MovieMetadata Ratings User Info Instant
QueueA/B TestEngine
API
API
PersonalizationEngine
Similar movies
MovieMetadata Ratings User Info Instant
QueueA/B TestEngine
BewareCascading Failure!
PersonalizationEngine
Similar Movies
MovieMetadata Ratings User Info Instant
QueueA/B TestEngine
API
PersonalizationEngine
Similar Movies
MovieMetadata Ratings User Info Instant
QueueA/B TestEngine
Fallback Response
Local Fallback Avoids CascadingFailure!
API
PersonalizationEngine
Similar Movies
MovieMetadata Ratings User Info Instant
QueueA/B TestEngine
Fallback Response
Use FIT to test such failures
API
Goals
❏ Flexibility❏ Resiliency❏ Scalability❏ Excellent tools
API
Autoscaling & Capacity Management
http://nflx.it/1LvqLUi
AWS Controls Reactive, does not scale up fast enough
Fine-grained Control with Scryer Complements AWS Controls
❏ Faster scale-up, improved cost❏ Use reactive policy for organic scale down
Goals
❏ Flexibility❏ Resiliency❏ Scalability❏ Excellent tools
API
Run 1% of your traffic on the new code and see how it does
❏ Errors: 2xx, 4xx, 5xx❏ latency❏ network❏ busy threads❏ load❏ ...
So you’ve run a canary. Now what?
Control Canary
Successful canary
red/black push
Continuous Delivery
http://techblog.netflix.com/2015/09/moving-from-asgard-to-spinnaker.html
Quickly see status of all clusters
http://techblog.netflix.com/2015/09/moving-from-asgard-to-spinnaker.html
Script Management
Deployment & Ops
Deployment & Ops
Deployment & Ops
Real-time analysis
http://www.slideshare.net/g9yuayon/qcon-talk-on-netflix-mantis-a-stream-processing-system
Submit a query, see requests in real time.
Looking ahead - current challenges
❏ Breaking up the monolith❏ Script isolation❏ Thin client libraries
❏ New interaction models
Looking ahead
Source: http://techcrunch.com/2014/03/08/success-reality-and-the-myth-of-up-and-to-the-right/
Looking ahead
❏ Breaking up the monolith❏ Script isolation❏ Thin client libraries
❏ New interaction models
● > 900 active endpoints
● ~ 30 client libraries● 78 thread pools● high memory usage
Breaking up the monolith
Script isolation & node
❏ Groovy scripts run as part of API process
❏ UI teams would like to use other languages (in particular node.js) API remote
service layer
Service client libraries
UI/device scripts (node)
Falcor
var response = model.get("todos[0..2]
['name','done']");
Thin client libraries
❏ Many client libraries contain a lot of business logic and have a lot of dependencies
❏ Move business logic and dependencies to server
API remote service layer
Service client libraries
UI/device scripts (node)
Falcor
Looking ahead
❏ Breaking up the monolith❏ Script isolation❏ Thin client libraries
❏ New interaction models
New interaction models
❏ request/response❏ request/stream❏ fire-and-forget❏ event subscription❏ channel
API remote service layer
Service client libraries
UI/device scripts (node)
Falcor
http://reactivesocket.io
In the beginning...
Katharina Probst | [email protected] | www.linkedin.com/in/katharinaprobst