Stream Processing in Apache Storm

2021-01-15

字数统计: 377字 | 阅读时长≈ 2分

Brought by the trend of social media, large amounts of data is needed for real-time views.

Stream processing

Stream processing is used to process large amount of data with low latency and high throughput. Some classical solution such as Mapreduce may not be efficient here since it is essentially batch processing and needs to wait for entire computation of a large dataset to complete before viewing (there is not intermediate result.).

So I choode one of the most common project from Apache called : “Storm” to investigate its inner mechanism and components.

Apache Storm

The component of storm stream processing includes entities: Tuples, Streams, Spouts, Bolts, Topologies.

Tuples

Tuple is basically an ordered list of elements.

Streams

Stream is a sequence of tuples with potentially unbounded number.

Spouts

Spouts is a storm entity(process) that is a source of streams, often read from crawlers of database.

Bolts

Bolts is a storm entity(process) that: processes input streams, and outputs more stream for other bolts. In other word, this is the entity that actually process data.

Topologies

Topology is a directed graph of spouts and bolts. It corresponds to a storm “Application”. It can have cycle.

Bolts Operations

Filter: Only those which satisfy a condition can go through.

Join: Output all pairs of A,B streams which satisfy a condition.

Apply/Transform: Apply a function on each tuple in input stream and modify it.

Bolt is parallelizable. Each incoming tuple is assigned to bolts according to some grouping strategy, popular ones are: all grouping (all bolts receive all inputs), shuffle grouping (assign evenly), fields grouping (assign according to field subset)

Storm Cluster

Master Node:

runs daemon called Nimbus;
Distribute code around cluster and assign tasks to machines.
Monitor failure in cluster

Worker Node:

runs daemon called Supervisor
Listen for assigned work

Zookeeper System:

Coordinates Nimbus and Supervisor communication
All state of Supervisor and Nimbus is kept here.

Tuple-wise API

Emit(tuple, output): this function emits tuple as output, and provide input as first argument if it will be anchored.

Ack(tuple): Acknowledge that you finish processing a tuple

Fail: fail the spout tuple at the root of tuple topology if anything wrong (exception from DB, etc.)

If we do not call ack/fail on each tuple, since each tuple costs resources, may lead to memory leak.

本文作者： Yu Wan
本文链接： https://cyanh1ll.github.io/2021/01/15/Stream/
版权声明： CYANH1LL