Using druid for interactive count distinct queries at scale @ nmc

Yakir Buskilla + Itai YaffeNielsen

USING DRUID FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE

Introduction

Yakir Buskilla Itai Yaffe

● Software Architect

● Focusing on Big Data and Machine Learning problems

● Big Data Infrastructure Developer

● Dealing with Big Data challenges for the last 5 years

Nielsen Marketing Cloud (NMC)

● eXelate was acquired by Nielsen 2 years ago

● A leader in the Ad Tech and Marketing Tech industry

● What do we do ?

○ Data as a Service (DaaS)

○ Software as a Service (SaaS)

NMC high-level architecture

The need

● Nielsen Marketing Cloud business question

○ How many unique devices we have encountered:

■ over a given date range

■ for a given set of attributes (segments, regions, etc.)

● Find the number of distinct elements in a data stream which

may contain repeated elements in real time

The need

● Store everything

● Store only 1 bit per device

○ 10B Devices-1.25 GB/day

○ 10B Devices*80K attributes - 100 TB/day

● Approximate

Possible solutions

Bit VectorApprox.

Our journey

● Elasticsearch

○ Indexing data■ 250 GB of daily data, 10 hours

■ Affect query time

○ Querying

■ Low concurrency

■ Scans on all the shards of the corresponding index

What we tried

● Preprocessing

● Statistical algorithms (e.g HyperLogLog)

● K Minimum Values (KMV)

● Estimate set cardinality

● Supports set-theoretic operations

● ThetaSketch mathematical framework - generalization of KMV

ThetaSketch

KMV intuition

Number of Std Dev 1 2

Confidence Interval 68.27% 95.45%

16,384 0.78% 1.56%

32,768 0.55% 1.10%

65,536 0.39% 0.78%

ThetaSketch error

“Very fast highly scalable columnar data-store”

Roll-up

ThetaSketchAggregator

2016-11-15

Timestamp Attribute Device ID

11111 3a4c1f2d84a5c179435c1fea86e6ae02

2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02

2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02

2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02

2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02

Timestamp Attribute Count Distinct

2016-11-15

Druid architecture

How do we use Druid

Guidelines and pitfalls

● Setup is not easy

● Monitoring your system

● Data modeling

○ Reduce the number of intersections

○ Different datasources for different use cases

2016-11-15

Timestamp Attribute Count Distinct Timestamp Attribute Region Count

Distinct

US XXXXXX US

Porsche Intent

XXXXXX

Porsche Intent

... ......

XXXXXX

● Query optimization

○ Combine multiple queries into single query

○ Use filters

● Batch Ingestion

○ EMR Tuning

■ 140-nodes cluster

● 85% spot instances => ~80% cost reduction

○ Druid input file format - Parquet vs CSV

■ Reduced indexing time by X4

■ Reduced used storage by X10

● Community

Summary

10TB/day

4 Hours/day

15GB/day

280ms-350ms

$55K/month

250GB/day

10 Hours/day

2.5TB (total)

500ms-6000ms

$80K/month

THANK YOU!

Using druid for interactive count distinct queries at scale @ nmc

Technology

YOU SUN JEONG DATA ANALYTICS WITH DRUID°œ표... · data analytics with druid druid architecture realtime broker historical 15. data analytics with druid architecture - batch ingestion

DRUID D06 CE 2 · DRUID D06 DRUID D06 DRUID For Speech Protection System: 12V—- , 8—10W; For Adapter: Input: 100-240V-, 50/60Hz, o.7A; Output: 15V=, 1.66A, Max.25W; Class Il The

Svensk Druid-Tidning

Complete Druid Spell Book

List - Spells, Druid

The Druid Network Newsletterdruidnetwork.org/files/members/newsletter/TDN... · The Druid Network Newsletter Samhain 2012 Greetings of the turning seasons, Druid friends! The fires

Abstract - DRUID

Ross druid project (1)

NMC-2008-Alberico-FINAL.ppt - nmc - nmc | the new media

Druid Collection

Quintessential Druid 2

DnD3.5Index Spells Druid

PCB Flexure Comparison Testing NMC Standard & NMC-P Soft Terminal Ceramic Chip Capacitor MLCCs PCB Flexure Comparison Testing NMC Standard & NMC-P Soft

Druid UK Collection

Druid & Drombeg Technical Slide Pack - Providence … & Drombeg... · Druid & Drombeg AVO, Southern Porcupine Basin . Seabed response . Druid . Drombeg . 5-40 degree angle band

DRUID Working Paper No. 97-2 Capabilities and Governance ...webdoc.gwdg.de/ebook/lm/1999/druid/druid-attach/pdf_files/97-2.pdf · DRUID Working Paper No. 97-2 Capabilities and Governance:

Druid is Me

Celt Druid and Culdee

Druid: Sub-Second OLAP queries over Petabytes of Streaming Data

Druid Park Lake