Kafka Streams: The Stream Processing Engine of Apache Kafka

1Confidential

Kafka Streams: The New Smart Kid On The BlockThe Stream Processing Engine of Apache Kafka

Eno Thereskaeno@confluent.io

enothereska

Big Data London 2016Slide contributions: Michael Noll

2Confidential

Apache Kafka and Kafka Streams API

3Confidential

What is Kafka Streams: Unix analogy

$ cat < in.txt | grep “apache” | tr a-z A-Z > out.txt

Kafka Core

Kafka Connect Kafka Streams

4Confidential

When to use Kafka Streams

• Mainstream Application Development

• When running a cluster would suck• Microservices• Fast Data apps for small and big

data• Large-scale continuous queries

and transformations• Event-triggered processes• Reactive applications• The “T” in ETL• <and more>

• Use case examples• Real-time monitoring and

intelligence• Customer 360-degree view• Fraud detection• Location-based marketing• Fleet management• <and more>

5Confidential

Some use cases in the wild & external articles• Applying Kafka Streams for internal message delivery pipeline at LINE Corp.

• http://developers.linecorp.com/blog/?p=3960 • Kafka Streams in production at LINE, a social platform based in Japan with 220+ million

users• Microservices and reactive applications

• https://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with-kafka-streams

• User behavior analysis• https://

timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html

• Containerized Kafka Streams applications in Scala• https://www.madewithtea.com/processing-tweets-with-kafka-streams.html

• Geo-spatial data analysis• http://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/

• Language classification with machine learning• https://dzone.com/articles/machine-learning-with-kafka-streams

6Confidential

Architecture comparison: use case exampleReal-time dashboard for security monitoring

“Which of my data centers are under attack?”

7Confidential

Architecture comparison: use case example

App Dashboard Frontend

AppOther

1 Capture businessevents in Kafka 2Must process events with

separate cluster (e.g. Spark) 4 Other apps access latest resultsby querying these DBs3Must share latest results through

separate systems (e.g. MySQL)

Before: Undue complexity, heavy footprint, many technologies, split ownership with conflicting priorities

Your “Job”

App Dashboard Frontend

AppOther

1 Capture businessevents in Kafka 2Process events with standard

Java apps that use Kafka Streams 3Now other apps can directlyquery the latest results

With Kafka Streams: simplified, app-centric architecture, puts app owners in control

KafkaStream

Your App

Conflicting priorities: infrastructure teams vs. product teams

Complexity: a lot of moving pieces that are also complex individually

Is all this a part of the solution or part of your problem?

8Confidential

How do I install Kafka Streams?• There is and there should be no “installation” – Build Apps,

Not Clusters!• It’s a library. Add it to your app like any other library.

<dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.1</version></dependency>

9Confidential

How do I package and deploy my apps? How do I …?• Whatever works for you. Stick to what you/your company think

is the best way.• Kafka Streams integrates well with what you already have.• Why? Because an app that uses Kafka Streams is…a normal Java app.

10Confidential

Available APIs

11Confidential

• API option 1: Kafka Streams DSL (declarative)

KStream<Integer, Integer> input = builder.stream("numbers-topic");

// Stateless computationKStream<Integer, Integer> doubled = input.mapValues(v -> v * 2);

// Stateful computationKTable<Integer, Integer> sumOfOdds = input .filter((k,v) -> v % 2 != 0) .selectKey((k, v) -> 1) .groupByKey() .reduce((v1, v2) -> v1 + v2, "sum-of-odds");

The preferred API for most use cases.

The DSL particularly appeals to users:

• When familiar with Spark, Flink• When fans of Scala or functional

programming

12Confidential

• API option 2: Processor API (imperative)

class PrintToConsoleProcessor implements Processor<K, V> {

@Override public void init(ProcessorContext context) {}

@Override void process(K key, V value) { System.out.println("Received record with " + "key=" + key + " and value=" + value); }

@Override void punctuate(long timestamp) {}

@Override void close() {}}

Full flexibility but more manual work

The Processor API appeals to users:• When familiar with Storm, Samza

• Still, check out the DSL!• When requiring functionality that

isnot yet available in the DSL

13Confidential

”My WordCount is better than your WordCount” (?)

These isolated code snippets are nice (and actually quite similar) but they are not very meaningful. In practice, we also need to read data from somewhere, write data back to somewhere, etc.– but we can see none of this here.

14Confidential

WordCount in Kafka

WordCount

15Confidential

Compared to: WordCount in Spark 2.0

Runtime model leaks into processing logic(here: interfacing from Spark with Kafka)

16Confidential

Compared to: WordCount in Spark 2.0

5Runtime model leaks into processing logic(driver vs. executors)

17Confidential

18Confidential

Kafka Streams key concepts

19Confidential

Key concepts

20Confidential

Key concepts

21Confidential

Key concepts

Kafka Core Kafka Streams

22Confidential

Streams meet Tables

23Confidential

Streams meet Tables

http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables

24Confidential

Motivating example: continuously compute current users per geo-region

Real-time dashboard“How many users younger than 30y, per region?”

alice Asia, 25y, …

bob Europe, 46y, …

… …

user-locations(mobile team)

user-prefs(web team)

25Confidential

Europe

user-locations

… …

26Confidential

Europe

user-locations

… …

Europe, 25y, …

… …

27Confidential

Europe

user-locations

… …

Europe, 25y, …

… …

28Confidential

Same data, but different use cases require different interpretations

alice San Francisco

alice New York City

alice Rio de Janeiro

alice Sydney

alice Beijing

alice Paris

alice Berlin

29Confidential

alice San Francisco

alice New York City

alice Rio de Janeiro

alice Sydney

alice Beijing

alice Paris

alice Berlin

Use case 1: Frequent traveler status?

Use case 2: Current location?

30Confidential

“Alice has been to SFO, NYC, Rio, Sydney,Beijing, Paris, and finally Berlin.”

“Alice is in SFO, NYC, Rio, Sydney,Beijing, Paris, Berlin right now.”

⚑ ⚑ ⚑⚑

⚑⚑

⚑ ⚑ ⚑ ⚑⚑

⚑⚑

Use case 1: Frequent traveler status? Use case 2: Current location?

31Confidential

Streams meet Tables

record stream

When you need… so that the topic is interpreted as a

All the values of a key KStream

then you’d read theKafka topic into a

Example

All the places Alicehas ever been to

with messagesinterpreted asINSERT(append)

32Confidential

Streams meet Tables

record stream

changelog stream

When you need… so that the topic is interpreted as a

All the values of a key

Latest value of a key

KStream

KTable

then you’d read theKafka topic into a

Example

All the places Alicehas ever been to

Where Aliceis right now

with messagesinterpreted asINSERT(append)

UPDATE(overwrite existing)

33Confidential

KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”);KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”);

34Confidential

Europe

user-locationsalice Asia, 25y, …

… …

Europe, 25y, …

… …

// Merge into detailed user profiles (continuously updated)KTable<UserId, UserProfile> userProfiles = userLocations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs));

KTable userProfilesKTable userProfiles

35Confidential

// Merge into detailed user profiles (continuously updated)KTable<UserId, UserProfile> userProfiles = userLocations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs));

// Compute per-region statistics (continuously updated)KTable<UserId, Long> usersPerRegion = userProfiles .filter((userId, profile) -> profile.age < 30) .groupBy((userId, profile) -> profile.location) .count();

Europe

user-locationsAfrica 3… …

Asia 8Europe 5

Africa 3… …

Asia 7Europe 6

KTable usersPerRegion KTable usersPerRegion

36Confidential

Europe

user-locations

… …

Europe, 25y, …

… …

KTable

KTableKTable

KTable

37Confidential

Streams meet Tables – in the Kafka Streams DSL

38Confidential

Kafka Streams key features

39Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration

40Confidential

Native, 100% compatible Kafka integration

Read from Kafka

Write to Kafka

41Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant

42Confidential

Scalability, fault tolerance, elasticity

43Confidential

44Confidential

45Confidential

46Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations

47Confidential

Stateful computations• Stateful computations like aggregations or joins require state

• We already showed a join example in the previous slides.• Windowing a stream is stateful, too, but let’s ignore this for now.

• Example: count() will cause the creation of a state store to keep track of counts

• State stores in Kafka Streams• … are per stream task for isolation (think: share-nothing)• … are local for best performance• … are replicated to Kafka for elasticity and for fault-tolerance

• Pluggable storage engines• Default: RocksDB (key-value store) to allow for local state that is larger than available RAM• Further built-in options available: in-memory store• You can also use your own, custom storage engine

48Confidential

State management with built-in fault-tolerance

State stores

(This is a bit simplified.)

49Confidential

State stores

charlie 3bob 1 alice 1alice 2

50Confidential

State stores

51Confidential

State stores

alice 1alice 2

52Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations• Interactive queries

53Confidential

Interactive Queries

KafkaStreams

AppApp

1 Capture businessevents in Kafka 2 Process the events

with Kafka Streams 4Other apps query externalsystems for latest results! Must use external systems

to share latest results

1 Capture businessevents in Kafka 2 Process the events

with Kafka Streams 3Now other apps can directlyquery the latest results

Before (0.10.0)

After (0.10.1): simplified, more app-centric architecture

KafkaStreams

54Confidential

Key features in 0.10• Native, 100%-compatible Kafka integration• Secure stream processing using Kafka’s security features• Elastic and highly scalable• Fault-tolerant• Stateful and stateless computations• Interactive queries• Time model• Windowing• Supports late-arriving and out-of-order data• Millisecond processing latency, no micro-batching• At-least-once processing guarantees (exactly-once is in the works as we

speak)

55Confidential

Wrapping Up

56Confidential

Where to go from here• Kafka Streams is available in Confluent Platform 3.0 and in Apache Kafka

0.10 • http://www.confluent.io/download

• Kafka Streams demos: https://github.com/confluentinc/examples • Java 7, Java 8+ with lambdas, and Scala• WordCount, Interactive Queries, Joins, Security, Windowing, Avro integration, …

• Confluent documentation: http://docs.confluent.io/current/streams/• Quickstart, Concepts, Architecture, Developer Guide, FAQ

• Recorded talks• Introduction to Kafka Streams:

http://www.youtube.com/watch?v=o7zSLNiTZbA• Application Development and Data in the Emerging World of Stream Processing (higher level

talk): https://www.youtube.com/watch?v=JQnNHO5506w

57Confidential

Thank You

Kafka Streams: The Stream Processing Engine of Apache Kafka

Technology

Slides - Apache Kafka® Architecture & Fundamentals Explained€¦ · for Apache Kafka (aligns to Confluent Developer Skills for Building Apache Kafka course) Confluent Certified

Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka

Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core

Kubernetes Deploy and use Apache Kafka on - Puzzle...Kafka Streams Client library for building streaming applications Just a library, not a framework Perfect integration with Apache

Diplomado Apache Kafka

Introducing Kafka Streams, the new stream processing library of Apache Kafka, Berlin Buzzwords 2016

User's Guide Apache Kafka Software Release 2 diagram shows the parts (green) of Apache Kafka: Core Apache Kafka (light green), including the Kafka client API and the Kafka broker

Evaluation of Apache Kafka in Real-Time Big Data Pipeline ... · Apache Kafka in the pipeline architecture. 3.1 Apache Kafka Architecture . Kafka [9] is an open source, distributed

apache kafka event stream processing solution is Apache Kafka? Apache Kafka is a Stream Processing Element (SPE) taking care of the needs of event processing. Apache Kafka was initially

Kafka Connect & Streams - the ecosystem around Kafka

Kafka Streams: Hands-on Session - ce.uniroma2.it · Kafka Streams Kafka Streams: • Kafka Streams is a client library for processing and analyzing data stored in Kafka • Supports

Building Event Driven Services with Apache Kafka and Kafka Streams - Devoxx Belgium

Akka streams kafka kinesis

What is Apache Kafka - Infographic 1 - Instaclustr · 2018-04-30 · Apache Kafka reliably accepts and stores high volume streams of incoming data from a range of sources so that

Apache Kafka Security

Machine Learning and Deep Learning Applied to Real Time with Apache Kafka Streams

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Apache kafka intro_20150313_springloops

What is Apache Kafka - Infographic 1 · What is APACHE KAFKA? Apache Kafka reliably accepts and stores high volume streams of incoming data from a range of sources so that the data

Apache Kafka Topic Design€¦ · Keys are required for Kafka Streams and some Kafka Connect functions Choice of key will be significantly driven by functional design However, poor