View
297
Download
10
Category
Preview:
Citation preview
1Confidential
Kafka Streams: The New Smart Kid On The BlockThe Stream Processing Engine of Apache Kafka
Eno Thereskaeno@confluent.io
enothereska
Big Data London 2016Slide contributions: Michael Noll
2Confidential
Apache Kafka and Kafka Streams API
3Confidential
What is Kafka Streams: Unix analogy
$ cat < in.txt | grep “apache” | tr a-z A-Z > out.txt
Kafka Core
Kafka Connect Kafka Streams
4Confidential
When to use Kafka Streams
• Mainstream Application Development
• When running a cluster would suck• Microservices• Fast Data apps for small and big
data• Large-scale continuous queries
and transformations• Event-triggered processes• Reactive applications• The “T” in ETL• <and more>
• Use case examples• Real-time monitoring and
intelligence• Customer 360-degree view• Fraud detection• Location-based marketing• Fleet management• <and more>
5Confidential
Some use cases in the wild & external articles• Applying Kafka Streams for internal message delivery pipeline at LINE Corp.
• http://developers.linecorp.com/blog/?p=3960 • Kafka Streams in production at LINE, a social platform based in Japan with 220+ million
users• Microservices and reactive applications
• https://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with-kafka-streams
• User behavior analysis• https://
timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html
• Containerized Kafka Streams applications in Scala• https://www.madewithtea.com/processing-tweets-with-kafka-streams.html
• Geo-spatial data analysis• http://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/
• Language classification with machine learning• https://dzone.com/articles/machine-learning-with-kafka-streams
6Confidential
Architecture comparison: use case exampleReal-time dashboard for security monitoring
“Which of my data centers are under attack?”
7Confidential
Architecture comparison: use case example
Other
App Dashboard Frontend
AppOther
App
1 Capture businessevents in Kafka 2Must process events with
separate cluster (e.g. Spark) 4 Other apps access latest resultsby querying these DBs3Must share latest results through
separate systems (e.g. MySQL)
Before: Undue complexity, heavy footprint, many technologies, split ownership with conflicting priorities
Your “Job”
Other
App Dashboard Frontend
AppOther
App
1 Capture businessevents in Kafka 2Process events with standard
Java apps that use Kafka Streams 3Now other apps can directlyquery the latest results
With Kafka Streams: simplified, app-centric architecture, puts app owners in control
KafkaStream
s
Your App
Conflicting priorities: infrastructure teams vs. product teams
Complexity: a lot of moving pieces that are also complex individually
Is all this a part of the solution or part of your problem?
8Confidential
How do I install Kafka Streams?• There is and there should be no “installation” – Build Apps,
Not Clusters!• It’s a library. Add it to your app like any other library.
<dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.1</version></dependency>
9Confidential
How do I package and deploy my apps? How do I …?• Whatever works for you. Stick to what you/your company think
is the best way.• Kafka Streams integrates well with what you already have.• Why? Because an app that uses Kafka Streams is…a normal Java app.
10Confidential
Available APIs
11Confidential
• API option 1: Kafka Streams DSL (declarative)
KStream<Integer, Integer> input = builder.stream("numbers-topic");
// Stateless computationKStream<Integer, Integer> doubled = input.mapValues(v -> v * 2);
// Stateful computationKTable<Integer, Integer> sumOfOdds = input .filter((k,v) -> v % 2 != 0) .selectKey((k, v) -> 1) .groupByKey() .reduce((v1, v2) -> v1 + v2, "sum-of-odds");
The preferred API for most use cases.
The DSL particularly appeals to users:
• When familiar with Spark, Flink• When fans of Scala or functional
programming
12Confidential
• API option 2: Processor API (imperative)
class PrintToConsoleProcessor implements Processor<K, V> {
@Override public void init(ProcessorContext context) {}
@Override void process(K key, V value) { System.out.println("Received record with " + "key=" + key + " and value=" + value); }
@Override void punctuate(long timestamp) {}
@Override void close() {}}
Full flexibility but more manual work
The Processor API appeals to users:• When familiar with Storm, Samza
• Still, check out the DSL!• When requiring functionality that
isnot yet available in the DSL
13Confidential
”My WordCount is better than your WordCount” (?)
Kafka
Spark
These isolated code snippets are nice (and actually quite similar) but they are not very meaningful. In practice, we also need to read data from somewhere, write data back to somewhere, etc.– but we can see none of this here.
14Confidential
WordCount in Kafka
WordCount
15Confidential
Compared to: WordCount in Spark 2.0
1
2
3
Runtime model leaks into processing logic(here: interfacing from Spark with Kafka)
16Confidential
Compared to: WordCount in Spark 2.0
4
5Runtime model leaks into processing logic(driver vs. executors)
17Confidential
18Confidential
Kafka Streams key concepts
19Confidential
Key concepts
20Confidential
Key concepts
21Confidential
Key concepts
Kafka Core Kafka Streams
22Confidential
Streams meet Tables
23Confidential
Streams meet Tables
http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables
24Confidential
Motivating example: continuously compute current users per geo-region
4
7
5
3
2
8
Real-time dashboard“How many users younger than 30y, per region?”
alice Asia, 25y, …
bob Europe, 46y, …
… …
user-locations(mobile team)
user-prefs(web team)
25Confidential
Motivating example: continuously compute current users per geo-region
4
7
5
3
2
8
Real-time dashboard“How many users younger than 30y, per region?”
alice
Europe
user-locations
alice Asia, 25y, …
bob Europe, 46y, …
… …
user-locations(mobile team)
user-prefs(web team)
26Confidential
Motivating example: continuously compute current users per geo-region
4
7
5
3
2
8
Real-time dashboard“How many users younger than 30y, per region?”
alice
Europe
user-locations
user-locations(mobile team)
user-prefs(web team)
alice Asia, 25y, …
bob Europe, 46y, …
… …
alice
Europe, 25y, …
bob Europe, 46y, …
… …
27Confidential
Motivating example: continuously compute current users per geo-region
4
7
5
3
2
8 4
7
6
3
2
7
Alice
Real-time dashboard“How many users younger than 30y, per region?”
alice
Europe
user-locations
alice Asia, 25y, …
bob Europe, 46y, …
… …
alice
Europe, 25y, …
bob Europe, 46y, …
… …
-1+1
user-locations(mobile team)
user-prefs(web team)
28Confidential
Same data, but different use cases require different interpretations
alice San Francisco
alice New York City
alice Rio de Janeiro
alice Sydney
alice Beijing
alice Paris
alice Berlin
29Confidential
Same data, but different use cases require different interpretations
alice San Francisco
alice New York City
alice Rio de Janeiro
alice Sydney
alice Beijing
alice Paris
alice Berlin
Use case 1: Frequent traveler status?
Use case 2: Current location?
30Confidential
Same data, but different use cases require different interpretations
“Alice has been to SFO, NYC, Rio, Sydney,Beijing, Paris, and finally Berlin.”
“Alice is in SFO, NYC, Rio, Sydney,Beijing, Paris, Berlin right now.”
⚑ ⚑ ⚑⚑
⚑⚑
⚑ ⚑ ⚑ ⚑⚑
⚑⚑
⚑
Use case 1: Frequent traveler status? Use case 2: Current location?
31Confidential
Streams meet Tables
record stream
When you need… so that the topic is interpreted as a
All the values of a key KStream
then you’d read theKafka topic into a
Example
All the places Alicehas ever been to
with messagesinterpreted asINSERT(append)
32Confidential
Streams meet Tables
record stream
changelog stream
When you need… so that the topic is interpreted as a
All the values of a key
Latest value of a key
KStream
KTable
then you’d read theKafka topic into a
Example
All the places Alicehas ever been to
Where Aliceis right now
with messagesinterpreted asINSERT(append)
UPDATE(overwrite existing)
33Confidential
Motivating example: continuously compute current users per geo-region
KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”);KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”);
34Confidential
Motivating example: continuously compute current users per geo-region
alice
Europe
user-locationsalice Asia, 25y, …
bob Europe, 46y, …
… …
alice
Europe, 25y, …
bob Europe, 46y, …
… …
KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”);KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”);
// Merge into detailed user profiles (continuously updated)KTable<UserId, UserProfile> userProfiles = userLocations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs));
KTable userProfilesKTable userProfiles
35Confidential
Motivating example: continuously compute current users per geo-region
KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”);KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”);
// Merge into detailed user profiles (continuously updated)KTable<UserId, UserProfile> userProfiles = userLocations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs));
// Compute per-region statistics (continuously updated)KTable<UserId, Long> usersPerRegion = userProfiles .filter((userId, profile) -> profile.age < 30) .groupBy((userId, profile) -> profile.location) .count();
alice
Europe
user-locationsAfrica 3… …
Asia 8Europe 5
Africa 3… …
Asia 7Europe 6
KTable usersPerRegion KTable usersPerRegion
36Confidential
Motivating example: continuously compute current users per geo-region
4
7
5
3
2
8 4
7
6
3
2
7
Alice
Real-time dashboard“How many users younger than 30y, per region?”
alice
Europe
user-locations
alice Asia, 25y, …
bob Europe, 46y, …
… …
alice
Europe, 25y, …
bob Europe, 46y, …
… …
-1+1
user-locations(mobile team)
user-prefs(web team)
KTable
KTable
KTableKTable
KTable
37Confidential
Streams meet Tables – in the Kafka Streams DSL
38Confidential
Kafka Streams key features
39Confidential
Key features in 0.10• Native, 100%-compatible Kafka integration
40Confidential
Native, 100% compatible Kafka integration
Read from Kafka
Write to Kafka
41Confidential
Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant
42Confidential
Scalability, fault tolerance, elasticity
43Confidential
Scalability, fault tolerance, elasticity
44Confidential
Scalability, fault tolerance, elasticity
45Confidential
Scalability, fault tolerance, elasticity
46Confidential
Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations
47Confidential
Stateful computations• Stateful computations like aggregations or joins require state
• We already showed a join example in the previous slides.• Windowing a stream is stateful, too, but let’s ignore this for now.
• Example: count() will cause the creation of a state store to keep track of counts
• State stores in Kafka Streams• … are per stream task for isolation (think: share-nothing)• … are local for best performance• … are replicated to Kafka for elasticity and for fault-tolerance
• Pluggable storage engines• Default: RocksDB (key-value store) to allow for local state that is larger than available RAM• Further built-in options available: in-memory store• You can also use your own, custom storage engine
48Confidential
State management with built-in fault-tolerance
State stores
(This is a bit simplified.)
49Confidential
State management with built-in fault-tolerance
State stores
(This is a bit simplified.)
charlie 3bob 1 alice 1alice 2
50Confidential
State management with built-in fault-tolerance
State stores
(This is a bit simplified.)
51Confidential
State management with built-in fault-tolerance
State stores
(This is a bit simplified.)
alice 1alice 2
52Confidential
Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations• Interactive queries
53Confidential
Interactive Queries
KafkaStreams
AppApp
App
App
1 Capture businessevents in Kafka 2 Process the events
with Kafka Streams 4Other apps query externalsystems for latest results! Must use external systems
to share latest results
App
App
App
1 Capture businessevents in Kafka 2 Process the events
with Kafka Streams 3Now other apps can directlyquery the latest results
Before (0.10.0)
After (0.10.1): simplified, more app-centric architecture
KafkaStreams
App
54Confidential
Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations• Interactive queries• Time model• Windowing• Supports late-arriving and out-of-order data• Millisecond processing latency, no micro-batching• At-least-once processing guarantees (exactly-once is in the works as we
speak)
55Confidential
Wrapping Up
56Confidential
Where to go from here• Kafka Streams is available in Confluent Platform 3.0 and in Apache Kafka
0.10 • http://www.confluent.io/download
• Kafka Streams demos: https://github.com/confluentinc/examples • Java 7, Java 8+ with lambdas, and Scala• WordCount, Interactive Queries, Joins, Security, Windowing, Avro integration, …
• Confluent documentation: http://docs.confluent.io/current/streams/• Quickstart, Concepts, Architecture, Developer Guide, FAQ
• Recorded talks• Introduction to Kafka Streams:
http://www.youtube.com/watch?v=o7zSLNiTZbA• Application Development and Data in the Emerging World of Stream Processing (higher level
talk): https://www.youtube.com/watch?v=JQnNHO5506w
57Confidential
Thank You
Recommended