The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

The Data Driven Network

Kapil Surlaker Director of Engineering

Real-Time Data at LinkedIn

Kapil Surlaker and Shirshanka Das XLDB 2016

2

What does real-time data mean?

Real-time Ingestion

Stream Processing

Real-time Serving

Batch Processing

Real-time Data: Use Cases

In-product analytics Reporting Features for ML models (impression discounting) Standardization Monitoring, Alerting

Member-Facing

Business-Facing

Data

corp.linkedin.com

Transformlinkedin.com

Every pipeline looks like this

Ingest Process ServeCreate[ ]

The Unified Metrics PlatformHow Single source of truth for metrics Centralized multi-tenant pipeline Easy on boarding

Why Disjointed efforts, unreliable systems Fragmented data pipelines across online and offline Unpredictable SLA across all systems Diminished trust in any dashboard

WorkflowMetric

Definition

Sandbox

Code Repository

Metric Owner

System Jobs

Build

Core Metrics Job

Central Team, Relevant

Stakeholders

1. iterate

2. create 3. review

4. check in

Metric Definition

Name Description TagsOwners

Dataset

Dimensions

TimeScript

Metrics

Entity Ids

Tier

Formulas

Entity Dimensions

Input Datasets

Time windows

UMP Data FlowUmp Monitor

Primary Data

(tracking, databases, external)

UMP Raw Data

UMP Aggregated

Data Relevance

Experiment analysis

Ad-hoc

Metrics Script

Data Prep agg cube

dimension verify

HDFS + Pinot

Dashboards

…

Ingest Process ServeCreate

Espresso Kafka Databus

Real-time data: DatabasesEspresso - key-document storage - secondary indexes - transactional updates

to related data - real-time change-

stream

Apps

Espresso DB

Databus

Downstream

Scale - XXX TB un-replicated - 1M+ qps, 0.1M+ wps - XXXX machines

Real-time Data: Tracking

Kafka (Tracking)Client-side Tracking

Tracking Frontend

Services

Downstream

Kafka - high-volume pub-sub

- tunable guarantees

- Scale - ~1.5T messages/day

- ~17M messages/sec

- XXX machines

What about our acquisitions?

Slideshare

Bizo

Lynda

…

Where does this data live?

RESTSFTPJDBC

What about other integrations?

• Salesforce • Google

Doubleclick • Responsys • Eloqua • …

Where does this data live?

RESTSFTP

JDBC


Source Diversity

Batch +

StreamingData

Quality

Requirements

Gobblin Architecture

17

Source

Work Unit

Work Unit

Work Unit

Extract

Extract

Extract

Convert

Convert

Convert

Quality

Quality

Quality

Write

Write

Write

Data Publish

Task

Task

Task

Solving for real-timeInefficiencies in batch

YARN based

Apache Helix

Continuous

Auto-scaling

YARN

Helix

Executor 1

Executor 2

Executor 3

HDFS

Stream Source

Current ActivityOpen source @ github.com/linkedin/gobblin

Adopted by LinkedIn, Intel, Swisscom, NerdWallet,…

@LinkedIn ~100 TB per day @ LinkedIn

Hundreds of datasets

~20 different sources

Under Development Metadata-driven

Hadoop-Hadoop copies


Hadoop Spark Samza

Transformation engines @ LinkedIn

@ LinkedInUse-cases

- Machine Learning

- Photon ML (Machine learning library on Spark)

- Much faster iteration times

- Larger feature sets

By the Numbers

- ~XXX machines

- ~XXXX unique flows

Improvement

Feeds relevance 2 h to 14 m (9 x)

Jobs relevance 32 m to 1.3 m (24 x)

Ads relevance 24 h to 45 m (32 x)

Communication relevance 18 h to 30 m (36 x) with 10 x more features

@ LinkedInUse-cases

- Standardization

- Email optimization

- Site-speed

By the Numbers

- ~XXX machines

- ~XXX jobs

What

- Stream processing framework

- Apache YARN for resource allocation

- Apache Kafka for stream storage

- Local state optimizations using RocksDB

There’s more out there…

Heron

2.0

Streams


P not

Real-time. Interactive.

Slice and Dice metrics

Precompute!

Device Geo View

Android US 1

Android IN 1

iOS US 1

Dimension View

Android 2

iOS 1

US 2

IN 1

Android,US 1

iOS,US 1

Android,IN 1

More dimensions!Device Geo Carrier View

Android US ATT 1

Android IN Reliance 1

iOS US Verizon 1

Dimension View

Android 2

iOS 1

US 2

IN 1

ATT 1

Reliance 1

Verizon 1

Android,US 1

... ...

ChallengesHorizontally scalable Low latency Data freshness Fault tolerance OLAP features

SQL-like interface

(minus joins)

Sub-second query latency

Data load from Hadoop

and Kafka

Capabilities

Pinot Data Flow

Kafka Hadoop

Samza Process

Pinot

minuteshour +

Pinot@LinkedIn

Site-‐facing Apps Reporting dashboards Monitoring

In production since 2012 Open sourced in 2015 @ github.com/linkedin/pinot

http://github.com/linkedin/pinot

(S)QL: Filters and AggsSELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ AND action = 'stop'

(S)QL: Group BySELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ GROUP BY action

(S)QL: ORDER BY and LIMITSELECT * FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND entityId = 1000 AND action = 'start' ORDER BY creationTime DESC LIMIT 1

Broker Helix

Real time Historical

Kafka Hadoop

Pinot Architecture

Queries

Raw Data Samza

Multi-tenantDeclarative specification Role-specific assignment Seamless cluster expansion Powered by Apache Helix

“numCopies”: 2 }

{ “resourceName”: “MyStore”,“numDataNodes”: 4,“numBrokers”: 2,

Scale Largest cluster: xxx nodes # of stores: xxx

Columnar Storage

Forward Index

Fast but needs a ton of RAM

Single-node OptimizationsDisk-based structures Indexes: inverted, bitmap, hybrid File optimization Compression: dictionary, p4delta Multi-valued columns, skip lists Vectorization: ~7x latency improvement Query rewrite Sketching algorithms: HyperLogLog

To pre-compute or not?

Data aware pre-computation

Speeding up the cycle

Form hypothesis Query Repeat

Form hypothesis Query Repeat

OR …

Breaking the cycle


Hadoop Spark Samza

PinotEspresso Kafka Databus

Real-Time Data at LinkedIn

Kapil Surlaker @kapilsurlaker

49

Shirshanka Das @shirshanka

Thanks!

Documents

The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless