per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps

How Cloudflare analyzes >1m DNS queries per secondTom Arnfeld (and Marek Vavrusa )

100+Data centers globally

2.5BMonthly unique visitors

>10%Internet requests

everyday

≦3MDNS queries/second

websites, apps & APIs in 150 countries

6M+ 5M+HTTP requests/second

Anatomy of a DNS query$ dig www.cloudflare.com

; <<>> DiG 9.8.3-P1 <<>> www.cloudflare.com;; global options: +cmd;; Got answer:;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:;www.cloudflare.com. IN A

;; ANSWER SECTION:www.cloudflare.com. 5 IN A 198.41.215.162www.cloudflare.com. 5 IN A 198.41.214.162

;; Query time: 34 msec;; SERVER: 192.168.1.1#53(192.168.1.1);; WHEN: Sat Sep 2 10:48:30 2017;; MSG SIZE rcvd: 68

Fields30+

http://www.cloudflare.com

Cloudflare DNS Server

Log Forwarder

HTTP & Other Edge Services

AnycastDNS

Logs from all edge services and all PoPs are shipped over TLS to be processed

Logs are received and de-multiplexed

Logs are written into various kafka topics


Log Forwarder


AnycastDNS

Log messages are serialized with Cap’n’Proto




What did we want?

- Multidimensional query analytics

- Complex ad-hoc queries

- Capable of current and expected future scale

- Gracefully handle late arriving log data

- Roll-ups/aggregations for long term storage

- Highly available and replicated architecture

QueriesPer Second

≦3M

Edge Points of Presence

100+

Query Dimensions

20+

Years of stored aggregation

5+



Kafka, Apache Spark and Parquet

- Scanning firehose is slow and adding filters is time consuming

- Offline analysis is difficult with large amounts of data

- Not a fast or friendly user experience

- Doesn’t work for customersConverted into Parquet and written to HDFS

Download and filter data from Kafka using Apache Spark

Let’s aggregate everything... with streams

Timestamp QName QType RCODE

2017/01/01 01:00:00 www.cloudflare.com A NODATA

2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR

Time Bucket QName QType RCODE Count p50 Response Time

2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms

2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms


- Counters

- Total number of queries

- Query types

- Response codes

- Top-n query names

- Top-n query sources

- Response time/size quantiles



- Spark experience in-house, though Java/Scala

- Batch-oriented and need a DB to serve online queries

- Difficult to support ad-hoc analysis

- Low resolution aggregates

- Scanning raw data is slow

- Late arriving data

Aggregating with Spark Streaming

Produce low cardinality aggregates with Spark Streaming



- Spark experience in-house, though Java/Scala

- Batch-oriented and need a DB to serve online queries

- Difficult to support ad-hoc analysis

- Low resolution aggregates

- Scanning raw data is slow


Aggregating with Spark Streaming




- Distributed time-series DB

- Existing deployments of CitusDB

- High cardinality aggregations are tricky due to insert performance


- SQL API

Spark Streaming + CitusDB


Insert aggregate rows into CitusDB cluster for reads



Apache Flink + (CitusDB?)

- Dataflow API and support for stream watermarks

- Checkpoint performance issues

- High cardinality aggregations are tricky due to insert performance

- SQL APIProduce low cardinality aggregates with Flink

Insert aggregate rows into CitusDB cluster for reads



Druid

- Insertion rate couldn’t keep up inour initial tests

- Estimated costs of a suitable cluster were way expensive

- Seemed performant for random reads but not the best we’d seen

- Operational complexity seemed high

Insert into a cluster ofDruid nodes


Timestamp QName QType RCODE

2017/01/01 01:00:00 www.cloudflare.com A NODATA

2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR

Time Bucket QName QType RCODE Count p50 Response Time

2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms

2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms

- Raw data isn’t easily queried ad-hoc

- Backfilling new aggregates is impossible or can be very difficult without custom tools

- A stream can’t serve actual queries

- Can be costly for high cardinality dimensions

*https://clickhouse.yandex/docs/en/introduction/what_is_clickhouse.html

https://clickhouse.yandex/docs/en/introduction/what_is_clickhouse.html

ClickHouse

- Tabular, column-oriented data store

- Single binary, clustered architecture

- Familiar SQL query interfaceLots of very useful built-in aggregation functions

- Raw log data stored for 3 months~7 trillion rows

- Aggregated data for ∞1m, 1h aggregations across 3 dimensions


Log Forwarder


AnycastDNS

Log messages are serialized with Cap’n’Proto




Go Inserters write the data in parallel

Multi-tenant ClickHouse cluster stores data

ClickHouse Cluster

TinyLog

dnslogs_2016_01_01_14_30_pN

ReplicatedMergeTree

dnslogs_2016_01_01

ReplicatedMergeTree

dnslogs_2016_01

ReplicatedMergeTree

dnslogs_2016

- Raw logs are inserted into sharded tables

- Sidecar processes aggregates data into day/month/year tables

Initial table design

ClickHouse Cluster

r{0,2}.dnslogs

- Raw logs are inserted into one replicated, sharded table

- Multiple r{0,2} databases to better pack the cluster with shards and replicas

First attempt in prod.

ReplicatedMergeTree

Speeding up typical queries

- SUM() and COUNT() over a few low-cardinality dimensions

- Global overview (trends and monitoring)

- Storing intermediate state for non-additive functions

ClickHouse Cluster

r{0,2}.dnslogs

- Raw logs are inserted into one replicated, sharded table

- Multiple r{0,2} databases to better pack the cluster with shards and replicas

- Aggregate tables for long-term storage

Today...

ReplicatedMergeTree

ReplicatedAggregatingMergeTree

dnslogs_rollup_X

October 2016Began evaluating technologies and architecture, 1 instance in Docker

Finalized schema, deployed a production ClickHouse cluster of 6 nodes

November 2016Prototype ClickHouse cluster with 3 nodes, inserting a sample of data

August 2017Migrated to a new cluster with multi-tenancy

Growing interest among other Cloudflare engineering teams, worked on standard tooling

December 2016ClickHouse visualisations with Superset and Grafana

Spring 2017TopN, IP prefix matching, Go native driver, Analytics library, pkey in monotonic functions

October 2016Began evaluating technologies and architecture, 1 instance in Docker

Finalized schema, deployed a production ClickHouse cluster of 6 nodes

November 2016Prototype ClickHouse cluster with 3 nodes, inserting a sample of data

August 2017Migrated to a new cluster with multi-tenancy

Growing interest among other Cloudflare engineering teams, worked on standard tooling

December 2016ClickHouse visualisations with Superset and Grafana

Spring 2017TopN, IP prefix matching, Go native driver, Analytics library, pkey in monotonic functions

Multi-tenant ClickHouse cluster

Row Insertion/s

8M+Raid-0 Spinning Disks

2PB+Insertion Throughput/s

4GB+Nodes

33

ClickHouse Today… 12 Trillion Rows

SELECT table, sum(rows) AS totalFROM system.cluster_partsWHERE database = 'r0'GROUP BY tableORDER BY total DESC

┌─table──────────────────────────────┬─────────────total─┐│ ███████████████ │ 9,051,633,001,267 ││ ████████████████████ │ 2,088,851,716,078 ││ ███████████████████ │ 847,768,860,981 ││ ██████████████████████ │ 259,486,159,236 ││ … │ … │

- TopK(n) Aggregateshttps://github.com/yandex/ClickHouse/pull/754

- TrieDictionaries (IP Prefix)https://github.com/yandex/ClickHouse/pull/785

- SpaceSaving: internal storage for StringRef{}https://github.com/yandex/ClickHouse/pull/925

- Bug fixes to the Go native driverhttps://github.com/kshvakov/clickhouse

- sumMap(key, value)https://github.com/yandex/ClickHouse/pull/1250

Contributions to ClickHouse

https://github.com/yandex/ClickHouse/pull/754



https://github.com/kshvakov/clickhouse

Other Contributions

- Grafana Pluginhttps://github.com/vavrusa/grafana-sqldb-datasource(see also https://github.com/Vertamedia/clickhouse-grafana)

- SQLAlchemy (Superset)https://github.com/cloudflare/sqlalchemy-clickhouse

https://github.com/vavrusa/grafana-sqldb-datasource

https://github.com/Vertamedia/clickhouse-grafana

https://github.com/cloudflare/sqlalchemy-clickhouse

Python w/ Jupyter Notebooks

import requestsimport pandas as pd

def ch(q, host='127.0.0.1', port=9001): start = timer() r = requests.get( 'https://%s:%d/' % (host, port), params={'user': 'xxx', 'query': q + '\nFORMAT TabSeparatedWithNames'}, stream=True) end = timer()

if not r.ok: raise RuntimeError(r.text)

print 'Query finished in %.02fs' % (end - start) return pd.read_csv(r.raw, sep="\t")


import requestsimport pandas as pd

def ch(q, host='127.0.0.1', port=9001): start = timer() r = requests.get( 'https://%s:%d/' % (host, port), params={'user': 'xxx', 'query': q + '\nFORMAT TabSeparatedWithNames'}, stream=True) end = timer()

if not r.ok: raise RuntimeError(r.text)

print 'Query finished in %.02fs' % (end - start) return pd.read_csv(r.raw, sep="\t")



blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-queries-per-second

Check it

Thanks!

@tarnfeld @vavrusamhttps://cloudflare.com/careers/departments/engineering

Documents

per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps