What is Apache Kafka?
Why, When, and How to use Apache Kafka.
April 27, 2021
Kafka has different meanings, the way We prefer look at Kafka is "A temporary place to store, consume and forward data between different services."
Why not just transport the data between the services directly, for example, REST API?
That's a great question.
Buffer approach. Please have a look at the following architecture -
Each component relies on the component output before. In case something falls in the middle it lost, and what if I want to add another model? For example, one that collects data for statistical purposes? It would be painful.
How will it look like with Kafka?
Apache Kafka acts as a buffer between services.
While the data is residing on Kafka, it can be (If you configure it) secured, resistant to failures, and can be consumed in different methods by multiple consumers.
In case we want to add a new module we developed such as statistics, it would be simply look like this:
Apache Kafka’s real-world adoption is exploding, and it claims to dominate the world of stream data. It has a huge developer community all over the world that keeps on growing.
Pain points to consider
Apache Kafka has great power, but to make it production-ready several key elements need to be taken into consideration -
1. Shorten your messages
Apache Kafka works best with messages under 10KB. It's not an easy task when it comes to big data, but we highly recommended doing so to fully utilize resources.
2. Apache Kafka cannot transform your data
When your use case or the consuming service requires that the raw schema of the data will be changed before it can handle it. You need to code the transformation, Kafka will not be able to do it for you.
The KafkaStreams client allows us to perform continuous computation on input coming from one or more input topics and sends output to zero, one, or more output topics. Internally a KafkaStreams instance contains a normal KafkaProducer and KafkaConsumer instance that is used for reading input and writing output.
3. Managing Apache Kafka is COMPLICATED
As of today, there is a limited free UI-based management system for Apache Kafka, and most of the DevOps I worked with are using scripting tools. However, it can be tedious for beginners to jump into Apache Kafka scripting tools without taking the time for training. The Learning curve is steep and takes some time to get moving and integrate into big running systems.
4. No data/application-level monitoring
The existing tools mentioned above will provide infrastructure-level monitoring for the Kafka brokers, zookeeper, memory, and CPU utilization, but not to broken data pipelines, upstream changes, failed consumers, producers, and more.
5. Data can still be lost!
Apache Kafka is probably the most popular tool for distributed asynchronous messaging. This is mainly due to his high throughput, low latency, scalability, centralized, and real-time abilities. Most of this is due to using data replicas which in Kafka are called partitions.
However, with misconfiguration, there is a high chance of data loss when machines/processes are failing, and they will fail.
Can I use Strech?
Yes, you can.
Instead of protecting, managing, monitoring, scaling, create a lot of code, and all that for one use case, you can use Strech and create all the functionality around Kafka. Let's see how it look like by taking a "Fraud detection" use case as an example -
Each square is a piece of code, written by the user. Kafka provides stateful temporary storage and API only.
The green squares can be provided by the user or Strech, while all the rest provided by Strech (except the Kafka itself).