CTO Cheat Sheet: Apache Storm

CTO Cheat Sheet: Apache Storm

What is Storm?

Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing.

What Problem Does it Solve?

Storm makes it easy to reliably process unbounded streams of data. Here are some examples:

  • Real Time Processing
  • Machine Learning
  • Business intelligence
  • Big data analytics
  • Log monitoring/auditing system

Basic Concepts

  • Topology: A topology defines the workload for real time stream processing. It consists of 1 spout and 1 or more bolts. It’s like a mapreduce job in Hadoop (but mapreduce jobs end and a topology runs forever).
  • Stream: The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. (Each tuple carries information that was processed by a node (bolt) and it passed to others node to transform that information)
  • Spout: A source of streams (Pull data from social media like Twitter, Instagram, Facebook)
  • Bolts: All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more. (Filter data from twitter based on certain criteria, such as get all tweets in English or get some trending event, and then use another Bolt to store those tweets in a repository, send them to a external service or send them to external services and await some outcome to pass the data to another bolt)

Storm Vs

Storm

  • Distributed real time processing
  • Stateless, Data is streamed
  • Stream abstraction
  • Micro batching processing

Kafka

  • It is a distributed message broker
  • It is about transferring messages, data is store in the filesystem
  • Use publisher - subscriber paradigm 
  • Stream Processing

Hadoop

  • Distributed processing
  • State based, data is static and stored
  • MapReduce cluster computing paradigm
  • Batch Processing

Spark

  • Distributed processing
  • Stateless / Stateful
  • Resilient distributed dataset (RDD)
  • Batch processing


3 factors when choosing a Client
Why Your Team Should Have Junior Devs

Suscribe to our newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.