Apache Kafka — Get start with Kafka
Introduction
Kafka is a word that gets heard a lot nowadays… A lot of leading digital companies seem to use it as well. But what is it actually?
Kafka was originally developed at LinkedIn in 2011 and has improved a lot since then. Nowadays it is a whole platform, allowing you to redundantly store absurd amounts of data, have a message bus with huge throughput (millions/sec) and use real-time stream processing on the data that goes through it all at once.
This is all well and great, but stripped down to its core, Kafka is a distributed, horizontally-scalable, fault-tolerant, commit log.
Those were some fancy words, let’s go at them one by one and see what they mean. Afterwards, we will dive deep into how it works.
Distributed
A distributed system is one which is split into multiple running machines, all of which work together in a cluster to appear as one single node to the end user. Kafka is distributed in the sense that it stores, receives and sends messages on different nodes (called brokers).
I have a Thorough Introduction on this as well
The benefits to this approach are high scalability and fault-tolerance.
Horizontally-scalable
Let’s define the term vertical scalability first. Say, for instance, you have a traditional database server which is starting to get overloaded. The way to get this solved is to simply increase the resources (CPU, RAM, SSD) on the server. This is called vertical scaling — where you add more resources to the machine. There are two big disadvantages to scaling upwards:
There are limits defined by the hardware. You cannot scale upwards indefinitely.
It usually requires downtime, something which big corporations cannot afford.
Horizontal scalability is solving the same problem by throwing more machines at it. Adding a new machine does not require downtime nor are there any limits to the amount of machines you can have in your cluster. The catch is that not all systems support horizontal scalability, as they are not designed to work in a cluster and those that are are usually more complex to work with.
Horizontal scaling becomes much cheaper after a certain threshold
Fault-tolerant
Something that emerges in non-distributed systems is that they have a single point of failure (SPoF). If your single database server fails (as machines do) for whatever reason, you’re screwed.
Distributed systems are designed in such a way to accommodate failures in a configurable way. In a 5-node Kafka cluster, you can have it continue working even if 2 of the nodes are down. It is worth noting that fault-tolerance is at a direct tradeoff with performance, as in the more fault-tolerant your system is, the less performant it is.
Commit Log
A commit log (also referred to as write-ahead log, transaction log) is a persistent ordered data structure which only supports appends. You cannot modify nor delete records from it. It is read from left to right and guarantees item ordering.
Sample illustration of a commit log, taken from here
- Are you telling me that Kafka is such a simple data structure?
In many ways, yes. This structure is at the heart of Kafka and is invaluable, as it provides ordering, which in turn provides deterministic processing. Both of which are non-trivial problems in distributed systems.
Kafka actually stores all of its messages to disk (more on that later) and having them ordered in the structure lets it take advantage of sequential disk reads.
Reads and writes are a constant time O(1) (knowing the record ID), which compared to other structure’s O(log N) operations on disk is a huge advantage, as each disk seek is expensive.
Reads and writes do not affect another. Writing would not lock reading and vice-versa (as opposed to balanced trees)
These two points have huge performance benefits, since the data size is completely decoupled from performance. Kafka has the same performance whether you have 100KB or 100TB of data on your server.
How does it work?
Applications (producers) send messages (records) to a Kafka node (broker) and said messages are processed by other applications called consumers. Said messages get stored in a topic and consumers subscribe to the topic to receive new messages.
As topics can get quite big, they get split into partitions of a smaller size for better performance and scalability. (ex: say you were storing user login requests, you could split them by the first character of the user’s username)
Kafka guarantees that all messages inside a partition are ordered in the sequence they came in. The way you distinct a specific message is through its offset, which you could look at as a normal array index, a sequence number which is incremented for each new message in a partition.
Kafka follows the principle of a dumb broker and smart consumer. This means that Kafka does not keep track of what records are read by the consumer and delete them but rather stores them a set amount of time (e.g one day) or until some size threshold is met. Consumers themselves poll Kafka for new messages and say what records they want to read. This allows them to increment/decrement the offset they’re at as they wish, thus being able to replay and reprocess events.
It is worth noting that consumers are actually consumer groups which have one or more consumer processes inside. In order to avoid two processes reading the same message twice, each partition is tied to only one consumer process per group.
Recoverable
Now assume that Broker 2 has failed. Broker 1 and 3 can still serve the data for topic 1. So a replication factor of 3 is always a good idea since it allows for one broker to be taken down for maintenance purposes and also for another one to be taken down unexpectedly. If we are planning to use Kafka as a storage we also need to be aware of the configurable retention period for every topic. If we don’t take care of this setting, we might lose our data according to the docs (For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space).
Consistent
It can be said that the Order’s Customer contact information is eventually consistent with the Account’s Customer contact information, by way of Kafka. Eventually consistent means it achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Kafka makes the following guarantees about data consistency:
(1) Messages sent to a topic partition will be appended to the commit log in the order they are sent.
(2) A single consumer instance will see messages in the order they appear in the log.
(3) A message is ‘committed’ when all in sync replicas Fault-Tolerant.
Why has it seen so much use?
High performance, availability and scalability alone are not strong enough reasons for a company to adopt a new technology. There are other systems which boast similar properties, but none have become so widely used. Why is that?
The reason Kafka has grown in popularity (and continues to do so) is one key thing — businesses nowadays benefit greatly from event-driven architecture. This is because the world has changed — an enormous (and ever-growing) amount of data is being produced and consumed by many different services (Internet of Things, Machine Learning, Mobile, Microservices).
A single real-time event broadcasting platform with durable storage is the cleanest way to achieve such an architecture. Imagine what kind of a mess it would be if streaming data to/from each service used a different technology specifically catered to it.
This, paired with the fact that Kafka provides the appropriate characteristics for such a generalized system (durable storage, event broadcast, table and stream primitives, abstraction via KSQL, open-source, actively developed) make it an obvious choice for companies.