53
Scalable Streaming Data Pipelines with Redis Avram Lyon Scopely / @ajlyon / github.com/avram LA Redis Meetup / April 18, 2016

Scalable Streaming Data Pipelines with Redis

Embed Size (px)

Citation preview

Page 1: Scalable Streaming Data Pipelines with Redis

Scalable Streaming Data Pipelines with Redis

Avram Lyon Scopely / @ajlyon / github.com/avram

LA Redis Meetup / April 18, 2016

Page 2: Scalable Streaming Data Pipelines with Redis

ScopelyMobile games publisher and

developer

Diverse set of games

Independent studios around the world

Page 3: Scalable Streaming Data Pipelines with Redis

What kind of data?• App opened

• Killed a walker

• Bought something

• Heartbeat

• Memory usage report

• App error

• Declined a review prompt

• Finished the tutorial

• Clicked on that button

• Lost a battle

• Found a treasure chest

• Received a push message

• Finished a turn

• Sent an invite

• Scored a Yahtzee

• Spent 100 silver coins

• Anything else any game designer or developer wants to learn about

Page 4: Scalable Streaming Data Pipelines with Redis

How much?Recently:

Peak: 2.8 million events / minute

2.4 billion events / day

Page 5: Scalable Streaming Data Pipelines with Redis

Collection Kinesis

WarehousingEnrichment Realtime MonitoringKinesisPublic API

Primary Data Stream

Page 6: Scalable Streaming Data Pipelines with Redis

CollectionHTTP

CollectionSQS

SQS

SQS

Studio A

Studio BStudio C

Kinesis

SQS Failover

Redis Caching App Configurations

System Configurations

Page 7: Scalable Streaming Data Pipelines with Redis

Kinesis

SQS Failover

K

K

Data Warehouse Forwarder

Enricher

S3

Kinesis K

Ariel (Realtime)

Idempotence

Elasticsearch

Idempotence

Idempotence

?

Aggregation

Page 8: Scalable Streaming Data Pipelines with Redis

Kinesisa short aside

Page 9: Scalable Streaming Data Pipelines with Redis

Kinesis

• Distributed, sharded streams. Akin to Kafka.

• Get an iterator over the stream— and checkpoint with current stream pointer occasionally.

• Workers coordinate shard leases and checkpoints in DynamoDB (via KCL)

Shard 0Shard 1Shard 2

Page 10: Scalable Streaming Data Pipelines with Redis

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Given: Worker checkpoints every 5

Page 11: Scalable Streaming Data Pipelines with Redis

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Given: Worker checkpoints every 5

K Worker A

Page 12: Scalable Streaming Data Pipelines with Redis

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Given: Worker checkpoints every 5

K Worker A 🔥

Page 13: Scalable Streaming Data Pipelines with Redis

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5

K Worker A 🔥

Page 14: Scalable Streaming Data Pipelines with Redis

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5

K Worker A 🔥

K Worker B

Page 15: Scalable Streaming Data Pipelines with Redis

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5

K Worker A 🔥

K Worker B

Page 16: Scalable Streaming Data Pipelines with Redis

Auxiliary Idempotence

• Idempotence keys at each stage

• Redis sets of idempotence keys by time window

• Gives resilience against various types of failures

Page 17: Scalable Streaming Data Pipelines with Redis

Auxiliary Idempotence

Page 18: Scalable Streaming Data Pipelines with Redis

Auxiliary Idempotence

• Gotcha: Set expiry is O(N)

• Broke up into small sets, partitioned by first 2 bytes of md5 of idempotence key

Page 19: Scalable Streaming Data Pipelines with Redis

Collection Kinesis

WarehousingEnrichment Realtime MonitoringKinesisPublic API

Page 20: Scalable Streaming Data Pipelines with Redis

Kinesis

SQS Failover

K

K

Data Warehouse Forwarder

Enricher

S3

Kinesis K

Ariel (Realtime)

Idempotence

Elasticsearch

Idempotence

Idempotence

?

Aggregation

Page 21: Scalable Streaming Data Pipelines with Redis

1. Deserialize

2. Reverse deduplication

3. Apply changes to application properties

4. Get current device and application properties

5. Generate Event ID

6. Emit.

Collection Kinesis

Enrichment

Page 22: Scalable Streaming Data Pipelines with Redis

1. Deserialize

2. Reverse deduplication

3. Apply changes to application properties

4. Get current device and application properties

5. Generate Event ID

6. Emit.

Collection Kinesis

Enrichment

Idempotence Key: DeviceToken + API Key + EventBatch Sequence + Event

Batch Session

Page 23: Scalable Streaming Data Pipelines with Redis

Now we have a stream of well-described, denormalized event

facts.

Page 24: Scalable Streaming Data Pipelines with Redis

K

Enriched Event Data

Preparing for Warehousing (SDW Forwarder)

dice app open

bees level complete

slots payment0:10

0:01

0:05

emitted by time

emitted by size

• Game • Event Name • Superset of

Properties in batch • Data

Slice

… ditto …Slice

SQS

Page 25: Scalable Streaming Data Pipelines with Redis

K

Enriched Event Data

Preparing for Warehousing (SDW Forwarder)

dice app open

bees level complete

slots payment0:10

0:01

0:05

emitted by time

emitted by size

• Game • Event Name • Superset of

Properties in batch • Data

Slice

… ditto …Slice

SQS

Idempotence Key: Event ID

Page 26: Scalable Streaming Data Pipelines with Redis

K

But everything can die!

dice app open

bees level complete

slots payment

Shudder

ASG

SNS

SQS

Page 27: Scalable Streaming Data Pipelines with Redis

K

But everything can die!

dice app open

bees level complete

slots payment

Shudder

ASG

SNS

SQS

HTTP“Prepare to Die!”

Page 28: Scalable Streaming Data Pipelines with Redis

K

But everything can die!

dice app open

bees level complete

slots payment

Shudder

ASG

SNS

SQS

HTTP“Prepare to Die!”

emit!

emit!

emit!

Page 29: Scalable Streaming Data Pipelines with Redis

Pipeline to HDFS

• Partitioned by event name and game, buffered in-memory and written to S3

• Picked up every hour by Spark job

• Converts to Parquet, loaded to HDFS

Page 30: Scalable Streaming Data Pipelines with Redis

A closer look at Ariel

Page 31: Scalable Streaming Data Pipelines with Redis

K

Live Metrics (Ariel)Enriched Event Data

name: game_end time: 2015-07-15 10:00:00.000 UTC _devices_per_turn: 1.0 event_id: 12345 device_token: AAAA user_id: 100

name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12346 device_token: BBBB user_id: 100

name: Cheating Games predicate: _devices_per_turn > 1.5 target: event_id type: DISTINCT id: 1

name: Cheating Players predicate: _devices_per_turn > 1.5 target: user_id type: DISTINCT id: 2

name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12347 device_token: BBBB user_id: 100

PFADD /m/1/2015-07-15-10-00 12346 PFADD /m/1/2015-07-15-10-00 123467 PFADD /m/2/2015-07-15-10-00 BBBB PFADD /m/2/2015-07-15-10-00 BBBB

PFCOUNT /m/1/2015-07-15-10-002 PFCOUNT /m/2/2015-07-15-10-001

Configured Metrics

Page 32: Scalable Streaming Data Pipelines with Redis

Dashboards

Alarms

Page 33: Scalable Streaming Data Pipelines with Redis
Page 34: Scalable Streaming Data Pipelines with Redis
Page 35: Scalable Streaming Data Pipelines with Redis
Page 36: Scalable Streaming Data Pipelines with Redis
Page 37: Scalable Streaming Data Pipelines with Redis

HyperLogLog• High-level algorithm (four bullet-point version stolen from my

colleague, Cristian)

• b bits of the hashed function is used as an index pointer (redis uses b = 14, i.e. m = 16384 registers)

• The rest of the hash is inspected for the longest run of zeroes we can encounter (N)

• The register pointed by the index is replaced with max(currentValue, N + 1)

• An estimator function is used to calculate the approximated cardinality

http://content.research.neustar.biz/blog/hll.html

Page 38: Scalable Streaming Data Pipelines with Redis

K

Live Metrics (Ariel)Enriched Event Data

name: game_end time: 2015-07-15 10:00:00.000 UTC _devices_per_turn: 1.0 event_id: 12345 device_token: AAAA user_id: 100

name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12346 device_token: BBBB user_id: 100

name: Cheating Games predicate: _devices_per_turn > 1.5 target: event_id type: DISTINCT id: 1

name: Cheating Players predicate: _devices_per_turn > 1.5 target: user_id type: DISTINCT id: 2

name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12347 device_token: BBBB user_id: 100

PFADD /m/1/2015-07-15-10-00 12346 PFADD /m/1/2015-07-15-10-00 123467 PFADD /m/2/2015-07-15-10-00 BBBB PFADD /m/2/2015-07-15-10-00 BBBB

PFCOUNT /m/1/2015-07-15-10-002 PFCOUNT /m/2/2015-07-15-10-001

Configured Metrics

We can count different things

Page 39: Scalable Streaming Data Pipelines with Redis

Kinesis

K

Collector Idempotence

Aggregation

Ariel

Web

PFCOUNT

PFADD

Workers

Are installs anomalous?

Page 40: Scalable Streaming Data Pipelines with Redis

Pipeline Delay

• Pipelines back up

• Dashboards get outdated

• Alarms fire!

Page 41: Scalable Streaming Data Pipelines with Redis

Alarm Clocks

• Push timestamp of current events to per-game pub/sub channel

• Take 99th percentile age as delay

• Use that time for alarm calculations

• Overlay delays on dashboards

Page 42: Scalable Streaming Data Pipelines with Redis

Kinesis

K

Collector Idempotence

Aggregation

Ariel, now with clocks

Web

PFCOUNT

PFADD

Workers

Are installs anomalous?

Event Clock

Page 43: Scalable Streaming Data Pipelines with Redis

Ariel 1.0

• ~30K metrics configured

• Aggregation into 30-minute buckets

• 12KB/30min/metric

Page 44: Scalable Streaming Data Pipelines with Redis

Ariel 1.0

• ~30K metrics configured

• Aggregation into 30-minute buckets

• 12KB/30min/metric

Page 45: Scalable Streaming Data Pipelines with Redis

Challenges

• Dataset size. RedisLabs non-cluster max = 100GB

• Packet/s limits: 250K in EC2-Classic

• Alarm granularity

Page 46: Scalable Streaming Data Pipelines with Redis

Challenges

• Dataset size. RedisLabs non-cluster max = 100GB

• Packet/s limits: 250K in EC2-Classic

• Alarm granularity

Page 47: Scalable Streaming Data Pipelines with Redis

Hybrid Datastore: Requirements

• Need to keep HLL sets to count distinct

• Redis is relatively finite

• HLL outside of Redis is messy

Page 48: Scalable Streaming Data Pipelines with Redis

Hybrid Datastore: Plan

• Move older HLL sets to DynamoDB

• They’re just strings!

• Cache reports aggressively

• Fetch backing HLL data from DynamoDB as needed on web layer, merge using on-instance Redis

Page 49: Scalable Streaming Data Pipelines with Redis

Kinesis

K

Collector IdempotenceAggregation

Ariel, now with hybrid datastore

Web

PFCOUNT

PFADD

Workers

Are installs anomalous?

Event Clock

DynamoDB

Report Caches

Old Data Migration

Merge Scratchpad

Page 50: Scalable Streaming Data Pipelines with Redis

Much less memory…

Page 51: Scalable Streaming Data Pipelines with Redis

Redis Roles• Idempotence

• Configuration Caching

• Aggregation

• Clock

• Scratchpad for merges

• Cache of reports

Page 52: Scalable Streaming Data Pipelines with Redis

Other Considerations

• Multitenancy. We run parallel stacks and give games an assigned affinity, to insulate from pipeline delays

• Backfill. System is forward-looking only; can replay Kinesis backups to backfill, or backfill from warehouse

Page 53: Scalable Streaming Data Pipelines with Redis

Thanks!

Questions?

scopely.com/jobs@ajlyon

[email protected] github.com/avram