Using druid for interactive count distinct queries at scale @ nmc

  • View

  • Download

Embed Size (px)


Yakir Buskilla + Itai YaffeNielsen


IntroductionYakir BuskillaItai Yaffe

Software ArchitectFocusing on Big Data and Machine Learning problemsBig Data Infrastructure Developer Dealing with Big Data challenges for the last 5 years

Intro of us + NMC

Nielsen Marketing Cloud (NMC)eXelate was acquired by Nielsen 2 years agoA leader in the Ad Tech and Marketing Tech industryWhat do we do ?Data as a Service (DaaS) Software as a Service (SaaS)

Daas = marketplace for device level data connecting buyers and sellers Saas - Nielsen Marketing cloud platform which help brands to connect with their customers by using our big data sets and our analytics tools

NMC high-level architecture

Our serving layer(Front End) aggregates data from various online + offline sourcesWe aggregate around 10B events per day

The need

Nielsen Marketing Cloud business questionHow many unique devices we have encountered: over a given date rangefor a given set of attributes (segments, regions, etc.)Find the number of distinct elements in a data stream which may contain repeated elements in real time

PastMention cardinality and real-time dashboardExplain the need to union and intersect

The need

The need

Store everythingStore only 1 bit per device10B Devices-1.25 GB/day10B Devices*80K attributes - 100 TB/dayApproximatePossible solutionsNaiveBit VectorApprox.

-Bit vector - Elastic search /Redis is an example of such system

Our journeyElasticsearchIndexing data250 GB of daily data, 10 hoursAffect query timeQueryingLow concurrencyScans on all the shards of the corresponding index

We tried to introduce new cluster dedicated for indexing only and then use backup and restore to the second clusterThis method was very expensive and was partially helpfulTuning for better performance also didnt help too much

What we triedPreprocessingStatistical algorithms (e.g HyperLogLog)

Preprocessing - Too many combinations - The formula length is not bounded (show some numbers)

HyperLogLog-Implementation in ElasticSearch was too slow (done on query time)- Set operations increase the error dramatically

K Minimum Values (KMV)Estimate set cardinalitySupports set-theoretic operationsX YThetaSketch mathematical framework - generalization of KMVX YThetaSketch

Unions and Intersections increase the errorThe problematic case is intersection of very small set with very big set

KMV intuition

Number of Std Dev12Confidence Interval68.27%95.45%16,3840.78%1.56%32,7680.55%1.10%65,5360.39%0.78%

ThetaSketch error

The larger the K the smaller the ErrorHowever larger K means more memory & storage needed

Very fast highly scalable columnar data-store


So we talked about statistical algorithms, which is nice, but we needed a practical solutionOOTB supports ThetaSketch algorithm

Roll-upThetaSketchAggregator2016-11-15TimestampAttributeDevice ID111113a4c1f2d84a5c179435c1fea86e6ae022016-11-15222223a4c1f2d84a5c179435c1fea86e6ae022016-11-15111115dd59f9bd068f802a7c6dd832bf60d022016-11-15222225dd59f9bd068f802a7c6dd832bf60d022016-11-15333333 5dd59f9bd068f802a7c6dd832bf60d02TimestampAttributeCount Distinct2016-11-152016-11-152016-11-15111112222233333221

Timeseries database - first thing you need to know about Druid

Column types :TimestampDimensionsMetricsTogether they comprise a Datasource

Agg is done on ingestion time (outcome is much smaller in size)In query time, its closer to a key-value search

Druid architecture

We have 3 types of processes - ingestion, querying, managementAll processes are decoupled and scalableIngestion (real time - e.g from Kafka, batch - talk about deep storage, how data is aggregated in ingestion time)Querying (brokers, historicals, query performance during ingestion)Lambda architecture

How do we use Druid

Explain the tuple and what is happening during the aggregation

Guidelines and pitfalls

Setup is not easy

Setup is not easy Separate config/servers/tuningCaused the deployment to take a few monthsUse the Druid recommendation for Production configuration

Guidelines and pitfallsMonitoring your system

Monitoring Your SystemDruid has built in support for Graphite ( exports many metrics )

Guidelines and pitfallsData modeling Reduce the number of intersectionsDifferent datasources for different use cases



2016-11-15 TimestampAttributeCount DistinctTimestamp AttributeRegion Count DistinctUSXXXXXXUS

Porsche IntentXXXXXXPorsche Intent.........XXXXXX...

Data ModelingIf using Theta sketch - reduce the number of intersections (show a slide of the old and new data model).It didnt solve all use-cases, but it gives you an idea of how you can approach the problemDifferent datasources - e.g lower accuracy for faster queries VS higher accuracy with a bit slower queries

Guidelines and pitfalls

Query optimizationCombine multiple queries into single queryUse filters

Combine multiple queries over the REST APIThere can be billions of rows, so filter the data as part of the query

Guidelines and pitfallsBatch IngestionEMR Tuning140-nodes cluster85% spot instances => ~80% cost reductionDruid input file format - Parquet vs CSV Reduced indexing time by X4Reduced used storage by X10

EMR tuning (spot instances (80% cost reduction), druid MR prod config)Use Parquet

The picture here - maybe money??

Guidelines and pitfallsCommunity


4 Hours/day




DRUID 250GB/day

10 Hours/day

2.5TB (total)




Ingestion doesnt affect query + sub-second response for even 100s or 1000s of concurrent queriesCost is for the entire solution (Druid cluster, EMR, etc.)With Druid and ThetaSketch, weve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution(Weve achieved a more performant, scalable, cost-effective solution)