Admios Initial Letter

CTO Cheat Sheet: Apache Kafka

Our collection of knowledge, best practices, and tips we’ve learned over the past 15 years.

What is Apache Kafka?

Apache Kafka is an open-source stream-processing software created by LinkedIn and maintained by Confluent. Apache Kafka helps you to decouple data streams & systems to achieve a few goals:

  • Distributed, resilient architecture, fault tolerant
  • Horizontal scalability
  • High Performance

What Problem Does it Help Solve? 

Kafka is used as a transportation mechanism. Here are some common applications:

  • Messaging systems
  • Activity tracking
  • Gather metrics from different locations
  • Gather Logs
  • Stream processing
  • Decoupling of system dependencies

Netflix embraces Apache Kafka as the de-facto standard for its eventing, messaging, and stream processing needs. Kafka acts as a bridge for all point-to-point and Netflix Studio wide communications. It provides us with the high durability and linearly scalable, multi-tenant architecture required for operating systems at Netflix. 

Basic Concepts

  • Topics: a particular stream of data (similar to a table in a database)
  • Topics are split in partitions
  • Each partition is ordered
  • Each message within a partition gets an incremental id, called offset

.

  • A Kafka Cluster is composed of multiple brokers (Servers)
  • Each broker is identified with its ID
  • Producers write data to topics
  • Producers automatically know which broker and partitions to write to
  • In case of Failures, the Producer will automatically recover

From the consumer side

  • Read data from a topic
  • Know which broker to read from
  • In case a broker failures, consumer know how to recover
  • Data is read in order within each partition

Kafka vs

Storm
  • Distributed real time processing
  • Stateless, Data is streamed
  • Stream abstraction
  • Micro batching processing

Kafka

  • It is a distributed message broker
  • It is about transferring messages, data is store in the filesystem
  • Use publisher - subscriber paradigm 
  • Stream Processing

Hadoop

  • Distributed processing
  • State based, data is static and stored
  • MapReduce cluster computing paradigm
  • Batch Processing

Spark

  • Distributed processing
  • Stateless / Stateful
  • Resilient distributed dataset (RDD)
  • Batch processing

10 DevOps Optimizations
Love Programming And System Administration? Then DevOps Is For You