The Data Dichotomy:
Rethinking Data & Services with Streams
What are microservices really about?
GUI
UI Service
Orders Service
Returns Service
Fulfilment Service
Payment Service
Stock Service
Splitting the Monolith? Single Process /Code base
/Deployment
Many Processes /Code bases
/Deployments
Orders Service
Autonomy?
Orders Service
Stock Service
Email Service
Fulfillment Service
Independently Deployable
Independently Deployable
Independently Deployable
Independently Deployable
Independence is where services get
their value
Allows Scaling
but not just in terms of data/load/throughput...
Scaling in terms of people
What happens when we grow?
Companies are inevitably a collection of applications They must work together to some degree
Interconnection is an afterthought
FTP / Enterprise Messaging etc
Internal World
External world
External world
External World
External world
Service Boundary
Services Force Us To Consider The External World
Internal World
External world
External world
External World
External world
Service Boundary
External World is something we should Design For
But this independence
comes at a cost $$$
Orders Object
Statement Object getOpenOrders(),
Less Surface Area = Less Coupling
Encapsulate State
Encapsulate Functionality
Consider Two Objects in one address space
Encapsulation => Loose Coupling
Change 1 Change 2
Redeploy
Singly-deployable apps are easy
Orders Service Statement Service getOpenOrders(),
Independently Deployable Independently Deployable
Synchronized changes are painful
Services work best where requirements are isolated in a single
bounded context
Single Sign On Business Service authorise(),
It’s unlikely that a Business Service would need the internal SSO state / function to change
Single Sign On
SSO has a tightly bounded context
But business services are
different
Catalog Authorisation
Most business services share the same core stream of facts.
Most services live in here
The futures of business services
are far more tightly intertwined.
We need encapsulation to hide internal state. Be loosely coupled.
But we need the freedom to slice & dice shared data like any other
dataset
But data systems have little to do
with encapsulation
Service Database
Data on inside
Data on outside
Data on inside
Data on outside
Interface hides data
Interface amplifies
data
Databases amplify the data they hold
The data dichotomy Data systems are about exposing data.
Services are about hiding it.
This affects services in one of
two ways
getOpenOrders( fulfilled=false,
deliveryLocation=CA, orderValue=100,
operator=GreatherThan)
Either (1) we constantly add to the interface, as datasets grow
getOrder(Id)
getOrder(UserId)
getAllOpenOrders()
getAllOrdersUnfulfilled(Id)
getAllOrders()
(1) Services can end up looking like kookie home-grown databases
...and DATA amplifies this service boundary
problem
(2) Give up and move whole datasets en masse
getAllOrders()
Data Copied Internally
This leads to a different problem
Data diverges over time
Orders Service
Stock Service
Email Service
Fulfill- ment Service
The more mutable copies,
the more data will diverge over time
Nice neat services
Service contract too
limiting
Can we change both services
together, easily? NO
Broaden Contract
Eek $$$
NO
YES
Is it a shared
Database?
Frack it! Just give me ALL the data
Data diverges. (many different
versions of the same facts) Lets encapsulate
Lets centralise to one copy
YES
Start here
Cycle of inadequacy:
These forces compete in the systems we build
Divergence Accessibility Encapsulation
Shared database
Service Interfaces
Better Accessibility
Better Independence
No perfect solution
Is there a better way?
Data on the Inside
Data on the Outside
Data on the Outside
Data on the Outside
Data on the Outside
Service Boundary
Make data-on-the-outside a 1st class citizen
Separate Reads & Writes with Immutable Streams
CQRS
• Writes => “Normalized” into each service
• Reads => “Denormalized” into Services
• One canonical set of streams
Orders Service
Stock Service
Orders
Fulfilment Service
UI Service
Fraud Service
Writes enter via the Orders Service (which manages the workflow of orders)
Orders Service
Stock Service
Orders
Fulfilment Service
UI Service
Fraud Service
Readers use streams, materialized in each service
Orders Service
Stock Service
Orders
Fulfilment Service
UI Service
Fraud Service
Important: data is not retained, it’s just a view over those centralized streams
Use a Stream Processing tool-kit
Kafka is a Streaming Platform
Kafka: a Streaming Platform
The Log Connectors Connectors
Producer Consumer
Streaming Engine
The Log
Shard on the way in
Producing Services
Kafka
Consuming Services
Each shard is a queue
Producing Services
Kafka
Consuming Services
Consumers share load
Producing Services
Kafka
Consuming Services
Scaling becomes a concern of the broker, not the
service
Load Balanced Services
Fault Tolerant Services
Build ‘Always On’ Services
Rely on Fault Tolerant Broker
Reset to any point in the shared narrative
Rewind & Replay
Compacted Log (retains only latest version)
Version 3
Version 2
Version 1
Version 2
Version 1
Version 5
Version 4
Version 3
Version 2
Version 1
Service Backbone Scalable, Fault Tolerant, Concurrent, Strongly Ordered, Retentive
The Log Connectors Connectors
Producer Consumer
Streaming Engine
A place to keep the data-on-the-outside
as a immutable narrative
Now add Stream Processing
Kafka’s Streaming API A general, embeddable streaming engine
The Log Connectors Connectors
Producer Consumer
Streaming Engine
What is Stream Processing?
Max(price) From orders where ccy=‘GBP’ over 1 day window emitting every second
DB Engine designed to process streams
Features: similar to database query engine
Join Filter Aggr- egate
View
Window
Avg(o.time – p.time) From orders, payment, user Group by user.region over 1 day window emitting every second
Stateful Stream Processing
Streams
Stream Processing Engine
Derived “Table”
Table Domain Logic
Stateful Stream Processing
stream
Compacted stream
Join
Stream Data
Stream-Tabular Data
Domain Logic
Tables /
Views
Kafka
Event Driven Service
Join shared streams from many other services
Email Service
Legacy App Orders Payments Stock
KSTREAMS
Kafka
Shared State is only cached in the service, so there is no way to diverge
Email Service
Orders Payments Stock
Lives here !
Cached here!
KSTREAMS
Tool for accessing shared, retentive streams
customer orders
catalogue Pay-
ments
But sometimes you have to move data
Replicate it, so both copies are identical
Keep Immutable
Take only the data you need today
Oracle
Confluent’s Connectors make this easier
The Log
Connector Connector
Producer Consumer
Streaming Engine
Cassandra
So…
When building services consider more than just
REST
The data dichotomy Data systems are about exposing data.
Services are about hiding it.
Remember:
Shared data is a reality for most organisations
Catalog
Embrace the data that both lives and flows between services
Avoid data services that have complex / amplifying
interfaces
Share Data via Immutable Streams
Embed Function into each service
Distributed Log Event Driven
Service
(A) Shared Database (B) Messaging
(C) Service Interfaces
Data & Function Centralized
Data & Function in Owning Services
Data & Function Everywhere
Middle-ground between shared database, messaging and service interfaces
Middle Ground
Shared database
Service Interfaces
Better Accessibility
Better Independence
Balance the Data Dichotomy
RIDER Principals
• Reactive - Leverage Asynchronicity. Focus on the now.
• Immutable - Build an immutable, shared narrative.
• Decentralized - No GOD services. Receiver driven.
• Evolutionary - Use only the data you need today.
• Retentive - Rely on the central event log, indefinitely.
Event Driven Service Backbone
The Log (streams & tables)
Ingestion Services
Services with Polyglotic
persistence
Simple Services
Streaming Services
Tune in next time
• How do we actually build these things?
• Putting the micro into microservices
Discount code: kafcom17!Use the Apache Kafka community discount code to get $50 off !www.kafka-summit.org!Kafka Summit New York: May 8!Kafka Summit San Francisco: August 28!
Presented by!
Twitter: @benstopford
Blog of this talk at http://confluent.io/blog