Admios Initial Letter

CTO Cheat Sheet: Apache Kafka

Our collection of knowledge, best practices, and tips we’ve learned over the past 15 years.

What is Apache Kafka?

Apache Kafka is an open-source stream-processing software created by LinkedIn and maintained by Confluent. Apache Kafka helps you to decouple data streams & systems to achieve a few goals:

Distributed, resilient architecture, fault tolerant
Horizontal scalability
High Performance

What Problem Does it Help Solve?

Kafka is used as a transportation mechanism. Here are some common applications:

Messaging systems
Activity tracking
Gather metrics from different locations
Gather Logs
Stream processing
Decoupling of system dependencies

Netflix embraces Apache Kafka as the de-facto standard for its eventing, messaging, and stream processing needs. Kafka acts as a bridge for all point-to-point and Netflix Studio wide communications. It provides us with the high durability and linearly scalable, multi-tenant architecture required for operating systems at Netflix.
‍

Basic Concepts

Topics: a particular stream of data (similar to a table in a database)
Topics are split in partitions
Each partition is ordered
Each message within a partition gets an incremental id, called offset

.

A Kafka Cluster is composed of multiple brokers (Servers)
Each broker is identified with its ID

Producers write data to topics
Producers automatically know which broker and partitions to write to
In case of Failures, the Producer will automatically recover

From the consumer side

Read data from a topic
Know which broker to read from
In case a broker failures, consumer know how to recover
Data is read in order within each partition

Kafka vs

Storm

Distributed real time processing
Stateless, Data is streamed
Stream abstraction
Micro batching processing

Kafka

It is a distributed message broker
It is about transferring messages, data is store in the filesystem
Use publisher - subscriber paradigm
Stream Processing

Hadoop

Distributed processing
State based, data is static and stored
MapReduce cluster computing paradigm
Batch Processing

Spark

Distributed processing
Stateless / Stateful
Resilient distributed dataset (RDD)
Batch processing

‍

‍