Kafka Streams is the component of the Kafka ecosystem which can be used for stream processing.
This involves taking messages from Kafka, usually as they are produced in real time, and processing or responding to them in some way. Common examples of this include:
In many situations, we would process the message, then write a new message back to Kafka. For instance, the average order value or the transformed message described above could go onto another topic ready to be processed elsewhere. In other instances, the stream could terminate after processing as in the email example.
Of course, we could achieve the same patterns using the traditional Kafka Consumer API. Kafka streams is a slightly higher level of abstraction than that, and focusses more on the data processing and transformation side in an opinionated way.
Why Is Kafka Streams Different?
Kafka Steams is one of a number of options for stream processing frameworks, with alternatives including Flink, Google Cloud Dataflow and Spark Streams. Kafka Streams does however have some compelling benefits.
Firstly, no cluster is required to execute the Kafka Streams job. Instead, they simply run as a standalone service embedded directly within your application. This can avoid the need to build and operate a cluster, simplifying your estate.
Kafka Streams is naturally tightly integrated with Kafka, and makes use of the Kafka primitives for performance and failover in a very integrated way. This includes partitions, rebalancing, offset management etc giving us a high degree of control of how our application interacts with Kafka whilst giving us the same guarantees about exactly once processing.
Operational and environmental concerns like these tend to be the biggest difference when choosing a stream processing framework. The actual process of writing the stream processing is generally similar, being based around a DSL which provides operations over the data stream.
Kafka Streams DSL Walkthrough
As with many of the stream processing engines, Kafka Streams exposes a Domain Specific Language (DSL) for developing the processing topologies. Though intended to simplify things, this declarative style of programming does take some getting used to for people new to it.
The first thing we might need to do is define a stream representing our inbound data. In the example below, we have defined a stream tied to the topic customer_orders, where the key and value are of type Strings and will be serialised and deserialised by String “Serdes” objects.
KStream<String, String> customerOrdersStream = builder.stream("customer_orders", Consumed.with(Serdes.String(), Serdes.String()));
Type handling and serialisation lead to most. of the complexity in Kafka streams, but are necessary to move between binary data on Kafka and types we can work with in the JVM.
In this example, we were were receiving events in the format <customer_id>:<product_id>, and the aim was to count the number of orders for each custoemr in a given our tie window. The first task is to do a simple grouByKey operation every time there is an event to ask how many orders we had for each customer_id.
KGroupedStream<String, String> ordersByCustomer = customerOrdersStream.groupByKey();
Next, we group the events into time windows of 5 seconds each using the default tumbling window semantics.
TimeWindowedKStream<String, String> windowedStream = ordersByCustomer.windowedBy(TimeWindows.of(Duration.ofSeconds(5)));
For each of the groupings within each window, we then count how many entries we have and change it the event to a pair of <customer_id>:<count> records.
KTable<Windowed<String>, Long> count = windowedStream.count();
Finally, we prepare the message to be sent out to a new topic. Because our client was expecting a string, we mapped the long object storing the count back to a string before publishing.
KStream<String, String> mapped = count.toStream().map((key, val) -> new KeyValue<>(key.key(), val.toString()));
Finally, the object is pushed to a new Kafka topic called orders_by_customer.
This was a very simple example of taking a stream of orders, then counting how many orders each customer has placed in 5 minute windows. Though there is some boilerplate code around this to create the connection and tear it down cleanly, in just 5 lines of code we have a very high performing windowed aggregation which would be quite tricky to derive in traditional database environments.
Kafka Streams tends to be our go to stream processing engine as a simple but powerful solution which keeps you in the Kafka ecosystem. The DSL does take some getting used to even for experienced Java developers, but this is applicable to the entire stream processing world. Overall Kafka Streams is one of the fastest and lightest frameworks to begin the stream processing journey with.