These days, numerous applications follow a microservices architecture. And many applications manage large amounts of data (user activity on the application, logs, metrics, etc.) that are constantly travelling back and forth between microservices. This can produce a series of problems when it comes to integrating all this information – such as the synchronization, scaling and processing of the data.
A potential solution would be to apply a traditional messaging system like queues or a publish-subscribe pattern. In the case of queues, each message is read by a consumer, thus allowing scaling, while in publish-subscribe, the message is sent to all subscribers, so scaling is not permitted. And with both systems, no guarantee can be made about the order in which messages are consumed.
Therefore, we need something that guarantees greater reliability and security when reading messages. This is where Apache Kafka comes into play, a distributed, redundant system for event management, initially developed by LinkedIn in 2001. Kafka allows you to combine the benefits of queues and publish-subscribe, as well as guarantee that messages are consumed in order.
We can summarize the main benefits of Kafka in the following points:
● Allows multiple consumers and products. The so-called consumer groups allow multiple consumers to consume the same message.
● It guarantees the order of the messages consumed.
● Temporarily holds the data on disk. Messages are not deleted after being consumed. Rather they are maintained on disk until the user-defined retention policy is met, either for a set amount of time or after reaching a certain capacity.
● It is both vertically and horizontally scalable and provides resistance to failures.
● It provides high availability and lets you manage large amounts of data daily in applications in real time.
Kafka has several clients for many languages, such as Java, Python or Node.js, thereby facilitating communication between applications written in different languages.
How does it work?
Apache Kafka works based on a series of fundamental elements:
Messages or records: The minimum unit of information in Kafka is the message, an array of bytes with no specific format equivalent to a database tuple. Thus, it can be used to transmit messages between microservices written in different languages. Each message stores an integer (offset) between its metadata that identifies it unequivocally.
You normally define a scheme (e.g., Json or XML) for these messages to make them easier to read and to provide consistence.
Topics and partitions: Kafka messages are classified into topics which are similar to a table in a database. Topics, in turn, are divided into partitions, where messages are stored in order of arrival with an offset. This guarantees the chronological order of the messages in a partition, but not for the same topic.
Brokers and clusters: A broker is a Kafka server which in charge of receiving messages from producers and committing them to logs – the disk storage structure used by Kafka. Brokers are also responsible for managing consumer requests. A cluster is nothing more than a group of brokers, where one is the cluster controller performing administrative tasks such as partition management.
A partition can be assigned to several brokers, such that replicas provide redundancy of messages. Producers and consumers will act on a partition called leader, but if it fails one of the replicas will become the new leader. The partitions are also scalable; each one can be hosted in a different broker, allowing the topics to be horizontally scaled.
Producers: Produce messages that are stored in a certain partition of a topic. In principle, the partition is chosen based on a hash applied to the message key (since each message is composed of a key and value), so the user can establish a specific key if they want to write in a particular topic partition.
Consumers: They read messages from one or several topics in the order in which they were stored. The message offset helps consumers keep track of where they left off in order to avoid losses after a system reboot.
Consumers can belong to a consumer group defined by an id. The groups ensure that each partition is consumed by one member, while consumers from different groups can consume the same partition. In this way, consumers can be scaled horizontally to consume topics with many messages and partitions can be rebalanced among the remaining consumers in the event of a failure.
Apache ZooKeeper
This software, developed by Apache as a centralized configuration storage service, is fundamental to Kafka working correctly. Some of the key functions of ZooKeeper in the Kafka ecosystem include:
• Electing a controller from among the brokers. The controller is responsible for maintaining the leader/follower relationship for all partitions.
• Maintaining information about the topics: list of existing topics, how many partitions, where the replicas are, etc.
• Maintaining a list of each cluster’s active brokers.
We, at Teldat, use Apache Kafka to develop our SD-WAN to communicate microservices and efficiently manage large amounts of device-related data.