Upload
stanley-nguyen-xuan-tuong
View
48
Download
0
Embed Size (px)
Citation preview
Implement a scalable statistical aggregation system using Akka
Scala by the Bay, 12 Nov 2016
Stanley Nguyen, Vu Ho
Email Security@Symantec Singapore
The system
Provides service to answer time-series analytical questions such as COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set of data streams by using statistical approach.
Motivation The system collects data from multiple sources in streaming
log format Some common questions in Email Anti-Abuse system
Most frequent Items (IP, domain, sender, etc.) Number of unique items Have we seen an item before?
=> Need to be able to answer such questions in a timely manner
Data statistics 6K email logs/second One email log is flatten out to subevents
Ip, sender, sender domain, etc Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week,
etc)
Total ~200K messages/second
Sketching data structures
How many times have we seen a certain IP? Count Min Sketch (CMS): Counting things + TopK
How many unique senders have we seen yesterday? HyperLogLog (HLL): Set cardinality
Did we see a certain IP last month? Bloom Filter (BF): Set membership
SPACE / SPEED
Implement data structure for finding cardinality (i.e. counting things); set membership; top-k elements – solved by using streamlib / twitter algebird
Implement a dynamic, reactive, distributed system for answering cardinality (i.e. counting things); set membership; top-k elements
What we try to solveWhat is available
Sketching data structures
Update internal count-min-instance
Initialize streamlib count-min-instance
Akka ActorProduce
r
BF Domai
nActor
HLL SenderActor
CMS Ip
Actor
Producer
MasterActor BACK PRESSURE?
Splitter Hub Split the stream based on event type to a dynamic set of
downstream consumers. Consumers are actors which implement CMS, BF, HLL, etc
logic. Not available in akka-stream.
Splitter Hub API Similar to built-in akka stream’s BroadcastHub; different in back-
pressure implementation.
[[SplitterHub]].source can be supplied with a predicate/selector function to return a filtered subset of data.
Splitter
Consumer
selector
Splitter Hub The [[Source]] can be materialized any number of times — each
materialization creates a new consumer which can be registered with the hub, and then receives items matching the selector function from the upstream.
Consumer can be added at run time
Consumers Can be either local or remote. Managed by coordination actor. Implements a specific data structure (CMS/BF/HLL) for a particular
event type from a specific time-range. Responsibility:
Answer a specific query. Persisting serialization of internal data structure such as count-min-
table, etc. regularly.
SplitterHub
HLLConsumer
Coordination actor
COUNT-QUERY
CMSConsumer
forward
ref
DBsnapshot
Scaling out If data does not fit in one machine. Server crashes. How to maintain back pressure end-to-end.
Akka stream TCP Handled by Kernel (back-pressure, reliable). For each worker, we create a source for each message type
it is responsible for using SplitterHub source() API. Connect each source to a TCP connection and send to
worker. Backpressure is maintained across network.
TCP.bind()IP
domainSplitterHub
~>~>
Master Failover The Coordinator is the Single Point of Failure. Run multiple Coordinator Actors as Cluster Singleton . Worker communicates to master (heartbeat) using Cluster
Client.
Worker Failover Worker persists all events to DB journal + snapshot.
Akka Persistent. Redis for storing Journal + Snapshot.
When a worker is down, its keys are re-distributed. Master then redirects traffic to other workers. CMS Actors are restored on new worker from Snapshot +
Journal.
BenchmarkAkka-stream on single node 100K+ msg/second (one msg-
type)Akka-stream on remote node (remote TCP)
15-20K msg/second (one msg-type)
Akka-stream on remote node (remote TCP) with akka persistent journal
2000+ msg/second (one msg-type)
Conclusion Our system is
Responsive Reactive Scalable Resilient
Future works: Make worker metric agnostics Scale out master Exactly one delivery for worker More flexible filter using SplitterHub