[ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Implement a scalable statistical aggregation system using Akka

Scala by the Bay, 12 Nov 2016

Stanley Nguyen, Vu Ho

Email Security@Symantec Singapore

The system

Provides service to answer time-series analytical questions such as COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set of data streams by using statistical approach.

Motivation The system collects data from multiple sources in streaming

log format Some common questions in Email Anti-Abuse system

Most frequent Items (IP, domain, sender, etc.) Number of unique items Have we seen an item before?

=> Need to be able to answer such questions in a timely manner

Data statistics 6K email logs/second One email log is flatten out to subevents

Ip, sender, sender domain, etc Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week,

etc)

Total ~200K messages/second

Challenges Our system needs to be

Responsive Space efficient Reactive Extensible Scalable Resilient

Sketching data structures

How many times have we seen a certain IP? Count Min Sketch (CMS): Counting things + TopK

How many unique senders have we seen yesterday? HyperLogLog (HLL): Set cardinality

Did we see a certain IP last month? Bloom Filter (BF): Set membership

SPACE / SPEED

Implement data structure for finding cardinality (i.e. counting things); set membership; top-k elements – solved by using streamlib / twitter algebird

Implement a dynamic, reactive, distributed system for answering cardinality (i.e. counting things); set membership; top-k elements

What we try to solveWhat is available

Sketching data structures

Update internal count-min-instance

Initialize streamlib count-min-instance


Akka ActorProduce

r

BF Domai

nActor

HLL SenderActor

CMS Ip

Actor

Producer

MasterActor BACK PRESSURE?

Akka Stream

GraphDSL

FLOW-SHAPE NODE

Using GraphDSL

(msg-type, @timestamp, key, value)

GraphDSL - Limitations

FIXED

Our design – Dynamic stream

Merge Hub Provided by Akka Stream:

Allow dynamic set of TCP producers

Splitter Hub Split the stream based on event type to a dynamic set of

downstream consumers. Consumers are actors which implement CMS, BF, HLL, etc

logic. Not available in akka-stream.

Splitter Hub API Similar to built-in akka stream’s BroadcastHub; different in back-

pressure implementation.

[[SplitterHub]].source can be supplied with a predicate/selector function to return a filtered subset of data.

Splitter

Consumer

selector

Splitter Hub’s Implementation

Splitter Hub The [[Source]] can be materialized any number of times — each

materialization creates a new consumer which can be registered with the hub, and then receives items matching the selector function from the upstream.

Consumer can be added at run time

Consumers Can be either local or remote. Managed by coordination actor. Implements a specific data structure (CMS/BF/HLL) for a particular

event type from a specific time-range. Responsibility:

Answer a specific query. Persisting serialization of internal data structure such as count-min-

table, etc. regularly.

SplitterHub

HLLConsumer

Coordination actor

COUNT-QUERY

CMSConsumer

forward

ref

DBsnapshot


Scaling out If data does not fit in one machine. Server crashes. How to maintain back pressure end-to-end.

Scaling out

Akka stream TCP Handled by Kernel (back-pressure, reliable). For each worker, we create a source for each message type

it is responsible for using SplitterHub source() API. Connect each source to a TCP connection and send to

worker. Backpressure is maintained across network.

TCP.bind()IP

domainSplitterHub

~>~>

Master-Worker communication

Master Failover The Coordinator is the Single Point of Failure. Run multiple Coordinator Actors as Cluster Singleton . Worker communicates to master (heartbeat) using Cluster

Client.

Worker Failover Worker persists all events to DB journal + snapshot.

Akka Persistent. Redis for storing Journal + Snapshot.

When a worker is down, its keys are re-distributed. Master then redirects traffic to other workers. CMS Actors are restored on new worker from Snapshot +

Journal.

BenchmarkAkka-stream on single node 100K+ msg/second (one msg-

type)Akka-stream on remote node (remote TCP)

15-20K msg/second (one msg-type)

Akka-stream on remote node (remote TCP) with akka persistent journal

2000+ msg/second (one msg-type)

Conclusion Our system is

Responsive Reactive Scalable Resilient

Future works: Make worker metric agnostics Scale out master Exactly one delivery for worker More flexible filter using SplitterHub

Q&A

Data & Analytics

[ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka