An adaptive and eventually self healing framework for geo-distributed real-time data ingestion

An adaptive and eventually self-healing framework for geo-distributed real-time data ingestion

Angad SinghInMobi

Towards

^

The problem domain

Scale● 15 billion events per day (post filtering)● 1.5+ billion users, 200 million per day● 4 geographically distributed data centers (DCs)● User’s request may land on non-local DC

Ingestion requirements● multiple tenants, multiple schemas per tenant● batch, stream, micro-batch and on-demand ingestion● 20+ streams, 100+ data types● need to ingest, transform, validate and aggregate this data● need to ingest streaming data in real-time (<1 min) for ad-serving/targeting use

cases (strict SLA)

The problem domain

Usage/serving requirements● need to pivot this data by user, activity type and other primary keys● serve an aggregated view (profile) at the end in < 5ms p99 latency● need both real-time serving of the view● as well as batch summaries for analytics, inference algorithms, feedback loops● need to be resilient to failure, absolutely no room for data loss/lag in ingestion

Data arrival, volume and velocity● data may be received out of order, or duplicated● data can arrive in periodic batches or real-time/streaming or once in a while● data may arrive in bursts or trickle slowly in some streams (autoscale)● user data may be received in any DC, but needs to be collectively available in a

single DC

The problem domain

Multi-tenancy● Quotas● Rate limiting/SLAs● Isolation

Manageability● need to be self-serve, flexible for specific changes in the flow, easily deployable● may need online migration, reprocessing, etc. of data● hassle-free schema evolution across the stack● monitoring, visibility, operability aspects for all of the above

The architecture

Serving layer(user store)

aerospikecluster

AP

I

dedup, aggregate, business rules

Rat

e lim

iting

/quo

tas

AP

I

Ad serving

<5m

s, 9

9.95

% s

ucce

ss

notifications

pubsub (kafka)

notification listeners (storm)

periodic dumps

streamingoffline snapshot store (HDFS)

batch inference jobs (MR/spark)

analytics engine (cubes, lens)

real-timeenrichment

on user engagement

Ingestion layer

globaldcglobaldc

offline snapshot store (HDFS)

globaldc

localdc

localdc

localdc

upst

ream

inge

stio

n so

urce

s

batc

h so

urce

sst

ream

ing

sour

ces

adaptors

localdc

adaptorsadaptors(MR/storm)

localdclocaldc

routers

localdc

routersrouters(MR/storm)

localdclocaldc

sinkssinkssinks(MR/storm)

remotedc

Ingestion Service

orchestrate/manage

remotedc remotedc

Architecture

DC1 (global)

DC2 (slave)

DC3 (slave)

adaptors(MR)

adaptors(storm)

adaptors(storm)

routers(MR)

routers(storm)

routers(storm)

User-Colo Metadata

(aerospike)

User-Colo Metadata

(aerospike)

User-Colo Metadata

(aerospike)

cust

om

repl

icat

ion

(contains userid)

(contains userid)

(contains userid)

sinks

getColo(userid)

sinks

sinks

Kaf

ka-d

ata-

repl

icat

or(s

torm

topo

logy

)

global colo tagger(storm)

tag not found

taggeddata

tag found

write tag

User Store

AP

I

History

Profile

User Store

AP

I

History

Profile

User Store

AP

I

History

Profile

XDR(profile)

Cross-DC architecture

cust

om

repl

icat

ion

DC1 (global)

DC2 (slave)

DC3 (slave)

adaptors(MR)

adaptors(storm)

adaptors(storm)

routers(storm)

routers(storm)

routers(storm)

User-Colo Metadata

(aerospike)

User-Colo Metadata

(aerospike)

User-Colo Metadata

(aerospike)

XD

RX

DR

(contains userid)

(contains userid)

(contains userid)

sinks

getColo(userid)

sinks

sinks

Kaf

ka-d

ata-

repl

icat

or(s

torm

topo

logy

)

global colo tagger(storm)

tag not found

taggeddata

tag found

write tag

User Store

AP

I

History

Profile

User Store

AP

I

History

Profile

User Store

AP

I

History

Profile

XDR(profile)

Mapper Comparison to map-reducePartitioner Shuffler Reducer

The ingestion layer

Current Features

Business-agnostic APIs● Built on simple RESTful APIs: Schema, Feed, Sink, Source, Flow, Driver, Data router, Adaptor● Unified APIs for doing batch, streaming, micro-batch ingestion.● Self-serve system which provides rule validation, metrics, etc. and makes the expression of

sources, sinks and flows easy with custom DSL.

Platform-agnostic Flow Execution● Pluggable execution engine (storm, hadoop, spark) - provides a Driver API● Uses falcon for batch scheduling, in-built scheduler for streaming drivers (storm, etc.)

Serialization support● Pluggable schema serde support (thrift, avro)

Current Features

Schema management● Schema is a first class citizen.● Contract between source, sink and flow all based on and validated against schema● Schema versioning and compatibility checks.● Error-free schema evolution across data flows● Clean abstractions to centrally manage all the schemas, data sources/feeds, sinks (key value store,

HDFS, etc.) and data flows (storm topologies, MR jobs) which are part of the ingestion pipelines

Manageability, operability● All entities - schemas, sinks and flows - can be updated online without any downtime.● Retries, error handling, metrics, orchestration hooks, etc. come standard

Out of the box support for● Cross-colo flow chaining● Data routing● Transformation, validation, conversion● All based on pluggable code

The problems we’ve seen

Storm● as usual, lot of knobs to tune based on lot of metrics: workers, threads, tasks, acks, max

spout pending, buffer sizes, xmx, num slots, execute/process/ack latency, capacity, etc.● debugging storm topology’s isn’t easy: threads, workers, shared logs, shuffling of data

between workers, netty, the ack system, etc.● storm (0.9.x) doesn’t like heterogenous load: unbalanced distribution between supervisors.

heavy topologies can choke each other. rebalancing not fully resource aware (1.x tries to solve this)

● no rolling upgrades, supervisor failures cause unrecoverable errors● zookeeper issues: too many executors leads to worker heartbeat update failure to zk.● storm-kafka issue: storm-kafka spout unaware of purging (earliestOffset update)● storm-kafka issue: invisible data loss● retries should done cautiously● etc

Kafka● topic deletion asynchronous, slow● tuning num partitions manually● bad consumers can cause excessive logging on brokers

Features under development

● Autoscaling flows - rebalance storm topology based on spout lag, priority and current throughput (or bolt capacity) - runtime metrics or linear regression on historical metrics

● Streaming and batch compaction / dedup of data based on domain specific rules

● Automatic fallback from streaming to batch ingestion in case of huge backlogs, for low priority ingestions

● Dynamic rerouting / sharding of data between DCs for load balancing cross-DC flows

● Eventual self-correction of data based on validations on the aggregated view (data received from multiple streams)

● Data lineage/auditing● Backfill management

Technology

An adaptive and eventually self healing framework for geo-distributed real-time data ingestion