Data Storage Infra at LinkedIn

Data Storage Infra at LinkedIn

Yan YanStaff software engineer

Today’s Agenda

1. LinkedIn Overview

2. Data Infra at LinkedIn

3. Espresso – Distributed Document Store

4. Ambry – Distributed Object Store

5. Venice – Derived Data Platform

6. Summary

546 million users> 100 million MAU

Over 200 countries

ADVANCE MY CAREER

Get the right job

ADVANCE MY CAREER

Build meaningfulrelationships

ADVANCE MY CAREER

Establish & manage my reputation

ADVANCE MY CAREER

Research & contact people

ADVANCE MY CAREER

Stay well informed

Data Infra at LinkedIn

Common Data Patterns

Activity that should be reflected immediately

Online

Activity that should be reflected “soon”

Nearline

ETL processing –generally updated in

batches

Offline

UI

Business Service Layer

Data Service LayerEvent Buffer

OfflineStorage

Online Data Storage

StreamingPipeline

Offline Pipeline ETL

Nearline Data Storage

CDC

Data Analytics Platform

Espresso: Distributed Document StoreL i n k e d I n O n l i n e D a t a S o l u t i o n

Why Espresso

Scalability vs Features-set

GAP• Consistency is important

• K-V model does not align with full application needs

• Full Oracle data model and query complexity not needed

• Build a new data storage which is consistent, scalable, indexed, richer than K-V

Oracle

Scalability

Feat

ure

Set

Voldemort

GapESPRESSO

Espresso’s Design Goals

• Scalable and elastic

• Read after write Consistency

• Structure data with schema

• Secondary indies

• Transactional updates to inter-related data

• Multi-datacenter support

• Seamless integration with nearline and offline systems

Espresso - Architecture

Client

RouterRouterRouter

Storage Node

API Server

MySQL ZK

Helix

Kafka Data Replicator

Storage Node

API Server

MySQL

Storage Node

API Server

MySQL

SnapshotService

BackupStorage Streaming

Process

Remote Data Center

Hadoop

Offline Data Center

OnlineNear line

Offline

Control

Data Center

Espresso – Rest-ful API

Get GET /database/table/resource_id

Create PUT /database/table/resource_id {record}

Update POST /database/table/resource_id {field:value}

Hierarchy Get GET /database/table/resource_id/sub-resource_id

Query GET /database/table/resource_id?query=“field:pattern”

Transactional Updates

Update records sharing the sameresource_id in different tables• Multipost

• /database/table1/id1 {field:value}

• /database/table2/id1/sub-id1 {field:value}

Espresso – MySQL Mapping

Espresso DB

Table1

Table2

es_identity_1Table1

Table2

es_identity_2

es_identity_3

……

MySQLInstance1

es_identity_4Table1

Table2

es_identity_5

es_identity_6

……

MySQLInstance2

Distribute bypartition key

Espresso – Data Distribution

Node 1

P1 P2

P3 P4

Node 3

P1 P2

P3 P4

Node 2

P1 P2

P3 P4

Master

Slave

Offline

P4:Master: Node1Slave: Node2

Node3

Node1Node2Node3

Helix

Live instances

External view

Espresso – Cluster Expansion

Node1

P1 P2

P3 P4

Node3

P1 P2

P3 P4

Node2

P1 P2

P3 P4

P1:Master: Nod1Slave: Node2

Node3Offline: Node4

Node1Node2Node3Node4

Helix

Live instances

External view

Node4

P1 P2

P3

Espresso – Node Failover

Node1

P2

P3 P4

Node3

P1

P3 P4

Node2

P1 P2

P4


Node3


Helix

Live instances

External view

Node4

P1 P2

P3


Node1

P2

P3 P4

Node3

P1

P3 P4

Node2

P1 P2

P4


Node3


Helix

Live instances

External view

Node4

P1 P2

P3


Node1

P2

P3 P4

Node3

P1

P3 P4

Node2

P1 P2

P4


Node1Node2Node4

Helix

Live instances

External view

Node4

P1 P2

P3


Node1

P2

P3 P4

Node2

P1 P2

P4


Node1Node2Node4

Helix

Live instances

External view

Node4

P1 P2

P3

P1

P4

P3

Kafka

Ambry: Distributed Object StorageL i n k e d I n O n l i n e D a t a S o l u t i o n

Object Storage Use Cases

Image, video, audio

Media

Docs, spreadsheets,slides

Documents

Database backup

Backup

JS, CSS, template

Static content

Before Ambry

Media Server• Monolithic

• Not scalable

• No full control

• Expensive

Ambry

Distributed object storage system• Immutable blobs

• Geo-distributed, horizontally scalable

• Unstructured data

• Multi-master

• Cost effective

Ambry - Architecture

AmbryClient CDNs Http

Client

Http service

Routing service

Http service

Routing service

Http service

Routing service

ClusterManager

StorageSerivceStorageSerivceStorageService

AmbryClient CDNs Http

Client

Http service

Routing service

Http service

Routing service

Http service

Routing service

ClusterManager

StorageSerivceStorageSerivceStorageService

DataCenter1

DataCenter2

Cross-DCReplication

Ambry - PUT Operation

Http service

Routing service

Client

StorageService

StorageService

StorageService

1

3

43

44

3

5

2

1. PUT data

2. Choose partition and Generateblob id

3. Write data to 3 replicas

4. Wait for at least 2 nodes torespond successfully

5. Reply blob id to client

Ambry - GET Operation

Http service

Routing service

Client

StorageService

StorageService

StorageService

1

3

43

5

2

1. GET blob_id

2. Choose partition based onblob_id

3. Read from 2 replicas

4. Wait at least 1 node’s successfulresponse

5. Reply data to client

Ambry - Large Blob PUT Operation

Large blob

blob1 blob2 … blobNMeta blob

blob_id1, blob_id2 … blob_idN

StorageService

StorageService

StorageService

StorageServiceStorageService



StorageServiceStorageServiceStorageService

Routing service

Client

Storage Service

Blob_id

1

2 3 N+1N+2

N+3

Replication

• Multi-master replication

• Asynchronous

• Pull based

• Inter-colo and cross-coloreplication

Ambry –Replication

…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

640 700 770 850 900

Log

Offset Blob id

640 50

700 30

770 70

850 40

Journal


…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

640 700 770 850 900

Node 1

…… BlobId:50

BlobId:30

BlobId:80

BlobId:90

640 700 770 890 940

Node 2

BlobsFrom Offset700

Blob Ids:30, 80, 90

Blob data:80,90

Get blobs80, 90

BlobId:80

BlobId:90


…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

640 700 770 850 900

Nod1

…… BlobId:50

BlobId:30

BlobId:80

BlobId:90

640 700 770 890 940

Node2

BlobsFrom Offset700

Blob Ids:30, 70, 40, 80, 90

Blob data:70, 40

Get blobs70, 40

BlobId:80

BlobId:90

BlobId:70

BlobId:40

Venice: Derived Data PlatformL i n k e d I n N e a r l i n e + O f f l i n e D a t a S o l u t i o n

Kinds of Data

• Source of Truth

• Example use case:

• Profile

• Example systems:

• SQL

• Document Stores

• K-V Stores

Primary Data Derived Data

• Derived from computing primary data

• Example use case:

• People You May Know

• Example systems:

• Search Indices

• Graph Databases

• K-V Stores

Derived Data Lifecycle Today

Apps

Events Buffer

Offline Storage

Batch Jobs

Online Storage

StreamProcessing

Lambda Architecture + Venice

StreamProcessing

BatchProcessing

App

Kafka

Hadoop

FeaturesVenice

• Dataset versioning

• High throughput ingestion fromHadoop and Samza

• Automatic cluster management

• Multi-DC, Multi-Cluster, Multi-Tenant

• Run as a service

Venice Data Model

Store AVersion 3

Partition 2

……

Store B

……

Partition 1

R1 R2 R3

V2

V1

• Store

• Version

• Partition

• Replica

• Record

• Avro

StorageNode

Venice Components

Router

Samza

Client

Push JobHadoop

Controller

Push data flow

Metadata operation

Read data flow

Venice Batch Mode

StorageNode

Hadoop

Azkaban Job

Map

Reduce

StorageNodeStorageNode

StorageNodeStorageNodeStorageNode

VeniceController

VeniceController

VeniceController

Kafka Cluster

......4

2 31

5 5

6

VeniceController

VeniceController

VeniceRouter

8

7

Venice Version Swapping

RouterStorev7

Data Source Kafka Topics Venice Processes

Hadoop Storev8

Storev6

Push Job

Venice Version Swapping

RouterStorev7

Data Source Kafka Topics Venice Processes

Hadoop Storev8

Storev6

Push Job

Venice Hybrid Mode

• Merge batch and streaming data

• Minimize application complexity

• Multi-version support

Goals Write-time merge

• Hadoop writes into store-version topics

• Samza writes into a Real-Time Buffer topic (RTB)

• The RTB gets replayed into store-version topics

Venice Hybrid Mode

RouterSamza Storev7

Data Sources Kafka Topics Venice Processes

Hadoop Storev8

Push Job

Venice Hybrid Mode

RouterSamza Storev7

Data Sources Kafka Topics Venice Processes

Storev8

Hadoop

Summary

• Document store

• Online data

• Get/Put/Transactional

• Expansion and failover

Espresso Ambry Venice

• Object store

• Online immutable data

• Get/Put/Large blob PUT

• Multi-master replication

• K-V store

• Derived data

• Get/Push

• Batch + streaming

Learn more: engineering.linkedin.com/blog

Back up slides

Online Data

• Member Profile Update

• Post to a Group

• Social Gestures (Comment/Like/Share)

Nearline Data

• Standardization

• Search Index Update

• Network Update Stream

Offline Data

• People You May Know

• Who Viewed My Profile

• Jobs You May Be Interested In

Why Espresso

• Difficult/expensive to run at Internet scale

• Structured data schema

• Strong consistency support

Oracle Voldemort

• Simpler data model (K,V)

• Write availability

• Eventual Consistency

• Scales well and cheaply

Espresso - Architecture

Espresso – Cross-DC Replication

StorageNode

ClientClientRouter

StorageNode

StorageNode

Kafka ClusterData Replicator

StorageNode

ClientClientRouter

StorageNode

StorageNode

Kafka ClusterData Replicator

Datacenter 1 Datacenter 2

Cross-DC Replication

• Boomerang elimination

• Conflict resolution• Last write wins

• Unique id generation• User-selectable options

• Data consistency checker

Ambry – Data Distribution

P1 P2 P3

Node1

P3 P1 P2

Node2

P2 P3 P1

Node3

Partit ion Status

1 Read-write

2 Read-write

3 Read-write

Ambry – Cluster Expansion

P1 P2 P3

Node1

P3 P1 P2

Node2

P2 P3 P1

Node3

Partit ion Status

1 Read-only

2 Read-only

3 Read-write

4 Read-write

5 Read-write

6 Read-write

P4 P5 P6

Node4

P6 P4 P5

Node5

P5 P6 P4

Node6

Index Segment3

Index Segment3

Ambry –Storage Layout

…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

400GB640 700 770 850 900

start offset in current index segment

log end offset

blob id offset TTL

id 30 700 ∞

id 40 850 1/1/16

id 70 770 ∞

Sorted byblob id

Index

Log

Start offset: 700End offset: 900

Index segment1

Index segment2Index segment3

Storage Optimization

• O(1) I/O for writes

• Bloom filter for index segments

• Reply on OS page cache

• Zero copy for gets

Read/Write APIVenice

• Derived data K-V store• Single Get

• Batch Get

• High throughput ingestion from:• Hadoop

• Samza

• Or both (hybrid)

ScaleVenice

• Large scale• Multi-Datacenter

• Multi-Cluster

• Run “as a service”• Self-service onboarding

• Each cluster is multi-tenant

• Resource isolation

TradeoffsVenice

All writes go through Kafka• Scalable

• Burst tolerant

• Asynchronous

• No native “read your writes” semantics

Global Replication

Push Job

Controller

HadoopMirror Maker

Parent Controller

Datacenter Boundary

Storage Nodes

Mirror Maker

…

Mirror Maker

…