A case for Kafka Streams or perhaps Spark Structured Streaming?

I’ve been contracted as an independent advisor and subject matter expert in Apache Kafka, Apache Spark and Scala.

We ran a series of architecture design sessions where we hoped to answer questions about how and what tools to use for the project.

One thing was certain however — we’d be using Apache Kafka. There was no doubt about it in the room (and the team was more than happy with it).

We knew that Kafka alone however won’t solve all our problems and given I’m passionate about Apache Spark the client almost believed that the first recommendation they could have heard from me right after we would have found out that some calculation is required would be to use Apache Spark. At least I thought they’d “count” on me to make the decision. With me in the room, it seemed just a matter of time when it comes up.

At some point we found we’d need to do some aggregations on the messages in a topic and publishing the results to another. A simple stateful aggregation over stream of messages in a Kafka topic with results published to another topic. If one said we’ve been on a Kafka-centric / topic-oriented solution, that’d be 100% correct.

With the requirements, we considered the following tools (we only considered Java- or Scala-supported ones):

  1. Kafka Consumer and Publisher APIs
  2. Akka Streams + Reactive Kafka
  3. Kafka Streams
  4. Spark Structured Streaming

We’ve quickly crossed out 2. Akka Streams + Reactive Kafka as we had no experience in it and my limited understanding of it was that it was too low-level and lacking features others offered out of the box with their higher-level APIs (esp. Spark Structured Streaming).

I was against proposing Spark Structured Streaming as I believed the others could do it fine without this extra mentoring cost.

We developed a solution with the Kafka Consumer and Publisher APIs with ease. It took us less than two hours and we kept adding new features until we figured we’d need to keep some state that should be available after a failure.

That requirement left us with two options: Kafka Streams and Spark Structured Streaming.

Because I knew nothing about Kafka Streams I was about to have proposed Spark Structured Streaming. There was this strong feeling however that Kafka Streams might be a better fit as being more Kafka-centric. After all, the purpose of Kafka Streams is to do aggregations and alike on datasets from topics. And the team lead asked to see how far it can lead us.

That was my very first encounter with Kafka Streams.

I must admit I did not want to spend my life with yet another Spark SQL-like tech. You should see my face when the tech lead asked “Jacek, can you develop a PoC with Kafka Streams?”

“Ouch” I thought and said “Yes. Indeed.” After all, I could develop it in Scala so what could have happened?!

We took the sample from the Kafka Streams API documentation and spun up a single-broker Kafka cluster.

Kafka Streams Hello World-like Application

To our surprise the example did not work. We got no output to the sink (another topic) when using kafka-console-consumer (we think it could have been something with the console consumer and we have not sorted it out).

We spent a half an hour to fix it until I said “Enough” followed by “Let me show you how to do it using Spark Structured Streaming.” I did not actually want to give up so quickly, but felt I had no choice.

As you may have figured it worked right off the bat. I used spark-shell and managed to develop a simple transformation in Spark Structured Streaming with datasets from Kafka in under a minute (I’ve been doing it for months so there was no surprise whatsoever).

Lesson learnt is that although Spark Structured Streaming worked right off the bat does not necessarily mean that we should give up on Kafka Streams as the solution is so heavily topic-oriented. It’s just that we didn’t have much luck today. Hope dies last so we’re going to give Kafka Streams a shot tomorrow. Stay tuned!

Contact me at jacek@japila.pl if you want to deep dive into Apache Spark 2.2 (and Spark Structured Streaming in particular) or Apache Kafka in your projects and use them in the most efficient way.

Follow @jaceklaskowski on twitter to learn more about the latest and greatest of Apache Spark 2.2.0 and the upcoming Apache Kafka 1.0.0 (in 144-char-long chunks and up to 4 pictures).

IT Freelancer for Apache Spark, Delta Lake, Apache Kafka, Kafka Streams