Brought by the trend of social media, large amounts of data is needed for real-time views.
Stream processing
Stream processing is used to process large amount of data with low latency and high throughput. Some classical solution such as Mapreduce may not be efficient here since it is essentially batch processing and needs to wait for entire computation of a large dataset to complete before viewing (there is not intermediate result.).
So I choode one of the most common project from Apache called : “Storm” to investigate its inner mechanism and components.
Apache Storm
The component of storm stream processing includes entities: Tuples, Streams, Spouts, Bolts, Topologies.
Tuples
Tuple is basically an ordered list of elements.
Streams
Stream is a sequence of tuples with potentially unbounded number.
Spouts
Spouts is a storm entity(process) that is a source of streams, often read from crawlers of database.
Bolts
Bolts is a storm entity(process) that: processes input streams, and outputs more stream for other bolts. In other word, this is the entity that actually process data.
Topologies
Topology is a directed graph of spouts and bolts. It corresponds to a storm “Application”. It can have cycle.
Bolts Operations
Filter: Only those which satisfy a condition can go through.
Join: Output all pairs of A,B streams which satisfy a condition.
Apply/Transform: Apply a function on each tuple in input stream and modify it.
Bolt is parallelizable. Each incoming tuple is assigned to bolts according to some grouping strategy, popular ones are: all grouping (all bolts receive all inputs), shuffle grouping (assign evenly), fields grouping (assign according to field subset)
Storm Cluster
Master Node:
- runs daemon called Nimbus;
- Distribute code around cluster and assign tasks to machines.
- Monitor failure in cluster
Worker Node:
- runs daemon called Supervisor
- Listen for assigned work
Zookeeper System:
- Coordinates Nimbus and Supervisor communication
- All state of Supervisor and Nimbus is kept here.
Tuple-wise API
Emit(tuple, output): this function emits tuple as output, and provide input as first argument if it will be anchored.
Ack(tuple): Acknowledge that you finish processing a tuple
Fail: fail the spout tuple at the root of tuple topology if anything wrong (exception from DB, etc.)
If we do not call ack/fail on each tuple, since each tuple costs resources, may lead to memory leak.
- 本文作者: Yu Wan
- 本文链接: https://cyanh1ll.github.io/2021/01/15/Stream/
- 版权声明: CYANH1LL