Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Data Storage Infra at LinkedIn
Yan YanStaff software engineer
Today’s Agenda
1. LinkedIn Overview
2. Data Infra at LinkedIn
3. Espresso – Distributed Document Store
4. Ambry – Distributed Object Store
5. Venice – Derived Data Platform
6. Summary
546 million users> 100 million MAU
Over 200 countries
ADVANCE MY CAREER
Get the right job
ADVANCE MY CAREER
Build meaningfulrelationships
ADVANCE MY CAREER
Establish & manage my reputation
ADVANCE MY CAREER
Research & contact people
ADVANCE MY CAREER
Stay well informed
Data Infra at LinkedIn
Common Data Patterns
Activity that should be reflected immediately
Online
Activity that should be reflected “soon”
Nearline
ETL processing –generally updated in
batches
Offline
UI
Business Service Layer
Data Service LayerEvent Buffer
OfflineStorage
Online Data Storage
StreamingPipeline
Offline Pipeline ETL
Nearline Data Storage
CDC
Data Analytics Platform
Espresso: Distributed Document StoreL i n k e d I n O n l i n e D a t a S o l u t i o n
Why Espresso
Scalability vs Features-set
GAP• Consistency is important
• K-V model does not align with full application needs
• Full Oracle data model and query complexity not needed
• Build a new data storage which is consistent, scalable, indexed, richer than K-V
Oracle
Scalability
Feat
ure
Set
Voldemort
GapESPRESSO
Espresso’s Design Goals
• Scalable and elastic
• Read after write Consistency
• Structure data with schema
• Secondary indies
• Transactional updates to inter-related data
• Multi-datacenter support
• Seamless integration with nearline and offline systems
Espresso - Architecture
Client
RouterRouterRouter
Storage Node
API Server
MySQL ZK
Helix
Kafka Data Replicator
Storage Node
API Server
MySQL
Storage Node
API Server
MySQL
SnapshotService
BackupStorage Streaming
Process
Remote Data Center
Hadoop
Offline Data Center
OnlineNear line
Offline
Control
Data Center
Espresso – Rest-ful API
Get GET /database/table/resource_id
Create PUT /database/table/resource_id {record}
Update POST /database/table/resource_id {field:value}
Hierarchy Get GET /database/table/resource_id/sub-resource_id
Query GET /database/table/resource_id?query=“field:pattern”
Transactional Updates
Update records sharing the sameresource_id in different tables• Multipost
• /database/table1/id1 {field:value}
• /database/table2/id1/sub-id1 {field:value}
Espresso – MySQL Mapping
Espresso DB
Table1
Table2
es_identity_1Table1
Table2
es_identity_2
es_identity_3
……
MySQLInstance1
es_identity_4Table1
Table2
es_identity_5
es_identity_6
……
MySQLInstance2
Distribute bypartition key
Espresso – Data Distribution
Node 1
P1 P2
P3 P4
Node 3
P1 P2
P3 P4
Node 2
P1 P2
P3 P4
Master
Slave
Offline
P4:Master: Node1Slave: Node2
Node3
Node1Node2Node3
Helix
Live instances
External view
Espresso – Cluster Expansion
Node1
P1 P2
P3 P4
Node3
P1 P2
P3 P4
Node2
P1 P2
P3 P4
P1:Master: Nod1Slave: Node2
Node3Offline: Node4
Node1Node2Node3Node4
Helix
Live instances
External view
Node4
P1 P2
P3
Espresso – Node Failover
Node1
P2
P3 P4
Node3
P1
P3 P4
Node2
P1 P2
P4
P1:Master: Nod4Slave: Node2
Node3
Node1Node2Node3Node4
Helix
Live instances
External view
Node4
P1 P2
P3
Espresso – Node Failover
Node1
P2
P3 P4
Node3
P1
P3 P4
Node2
P1 P2
P4
P1:Master: Nod4Slave: Node2
Node3
Node1Node2Node3Node4
Helix
Live instances
External view
Node4
P1 P2
P3
Espresso – Node Failover
Node1
P2
P3 P4
Node3
P1
P3 P4
Node2
P1 P2
P4
P1:Master: Nod4Slave: Node2
Node1Node2Node4
Helix
Live instances
External view
Node4
P1 P2
P3
Espresso – Node Failover
Node1
P2
P3 P4
Node2
P1 P2
P4
P1:Master: Nod4Slave: Node2
Node1Node2Node4
Helix
Live instances
External view
Node4
P1 P2
P3
P1
P4
P3
Kafka
Ambry: Distributed Object StorageL i n k e d I n O n l i n e D a t a S o l u t i o n
Object Storage Use Cases
Image, video, audio
Media
Docs, spreadsheets,slides
Documents
Database backup
Backup
JS, CSS, template
Static content
Before Ambry
Media Server• Monolithic
• Not scalable
• No full control
• Expensive
Ambry
Distributed object storage system• Immutable blobs
• Geo-distributed, horizontally scalable
• Unstructured data
• Multi-master
• Cost effective
Ambry - Architecture
AmbryClient CDNs Http
Client
Http service
Routing service
Http service
Routing service
Http service
Routing service
ClusterManager
StorageSerivceStorageSerivceStorageService
AmbryClient CDNs Http
Client
Http service
Routing service
Http service
Routing service
Http service
Routing service
ClusterManager
StorageSerivceStorageSerivceStorageService
DataCenter1
DataCenter2
Cross-DCReplication
Ambry - PUT Operation
Http service
Routing service
Client
StorageService
StorageService
StorageService
1
3
43
44
3
5
2
1. PUT data
2. Choose partition and Generateblob id
3. Write data to 3 replicas
4. Wait for at least 2 nodes torespond successfully
5. Reply blob id to client
Ambry - GET Operation
Http service
Routing service
Client
StorageService
StorageService
StorageService
1
3
43
5
2
1. GET blob_id
2. Choose partition based onblob_id
3. Read from 2 replicas
4. Wait at least 1 node’s successfulresponse
5. Reply data to client
Ambry - Large Blob PUT Operation
Large blob
blob1 blob2 … blobNMeta blob
blob_id1, blob_id2 … blob_idN
StorageService
StorageService
StorageService
StorageServiceStorageService
StorageServiceStorageService
StorageServiceStorageService
StorageServiceStorageServiceStorageService
Routing service
Client
Storage Service
Blob_id
1
2 3 N+1N+2
N+3
Replication
• Multi-master replication
• Asynchronous
• Pull based
• Inter-colo and cross-coloreplication
Ambry –Replication
…… BlobId:50
BlobId:30
BlobId:70
BlobId:40
640 700 770 850 900
Log
Offset Blob id
640 50
700 30
770 70
850 40
Journal
Ambry –Replication
…… BlobId:50
BlobId:30
BlobId:70
BlobId:40
640 700 770 850 900
Node 1
…… BlobId:50
BlobId:30
BlobId:80
BlobId:90
640 700 770 890 940
Node 2
BlobsFrom Offset700
Blob Ids:30, 80, 90
Blob data:80,90
Get blobs80, 90
BlobId:80
BlobId:90
Ambry –Replication
…… BlobId:50
BlobId:30
BlobId:70
BlobId:40
640 700 770 850 900
Nod1
…… BlobId:50
BlobId:30
BlobId:80
BlobId:90
640 700 770 890 940
Node2
BlobsFrom Offset700
Blob Ids:30, 70, 40, 80, 90
Blob data:70, 40
Get blobs70, 40
BlobId:80
BlobId:90
BlobId:70
BlobId:40
Venice: Derived Data PlatformL i n k e d I n N e a r l i n e + O f f l i n e D a t a S o l u t i o n
Kinds of Data
• Source of Truth
• Example use case:
• Profile
• Example systems:
• SQL
• Document Stores
• K-V Stores
Primary Data Derived Data
• Derived from computing primary data
• Example use case:
• People You May Know
• Example systems:
• Search Indices
• Graph Databases
• K-V Stores
Derived Data Lifecycle Today
Apps
Events Buffer
Offline Storage
Batch Jobs
Online Storage
StreamProcessing
Lambda Architecture + Venice
StreamProcessing
BatchProcessing
App
Kafka
Hadoop
FeaturesVenice
• Dataset versioning
• High throughput ingestion fromHadoop and Samza
• Automatic cluster management
• Multi-DC, Multi-Cluster, Multi-Tenant
• Run as a service
Venice Data Model
Store AVersion 3
Partition 2
……
Store B
……
Partition 1
R1 R2 R3
V2
V1
• Store
• Version
• Partition
• Replica
• Record
• Avro
StorageNode
Venice Components
Router
Samza
Client
Push JobHadoop
Controller
Push data flow
Metadata operation
Read data flow
Venice Batch Mode
StorageNode
Hadoop
Azkaban Job
Map
Reduce
StorageNodeStorageNode
StorageNodeStorageNodeStorageNode
VeniceController
VeniceController
VeniceController
Kafka Cluster
......4
2 31
5 5
6
VeniceController
VeniceController
VeniceRouter
8
7
Venice Version Swapping
RouterStorev7
Data Source Kafka Topics Venice Processes
Hadoop Storev8
Storev6
Push Job
Venice Version Swapping
RouterStorev7
Data Source Kafka Topics Venice Processes
Hadoop Storev8
Storev6
Push Job
Venice Hybrid Mode
• Merge batch and streaming data
• Minimize application complexity
• Multi-version support
Goals Write-time merge
• Hadoop writes into store-version topics
• Samza writes into a Real-Time Buffer topic (RTB)
• The RTB gets replayed into store-version topics
Venice Hybrid Mode
RouterSamza Storev7
Data Sources Kafka Topics Venice Processes
Hadoop Storev8
Push Job
Venice Hybrid Mode
RouterSamza Storev7
Data Sources Kafka Topics Venice Processes
Storev8
Hadoop
Summary
• Document store
• Online data
• Get/Put/Transactional
• Expansion and failover
Espresso Ambry Venice
• Object store
• Online immutable data
• Get/Put/Large blob PUT
• Multi-master replication
• K-V store
• Derived data
• Get/Push
• Batch + streaming
Learn more: engineering.linkedin.com/blog
Back up slides
Online Data
• Member Profile Update
• Post to a Group
• Social Gestures (Comment/Like/Share)
Nearline Data
• Standardization
• Search Index Update
• Network Update Stream
Offline Data
• People You May Know
• Who Viewed My Profile
• Jobs You May Be Interested In
Why Espresso
• Difficult/expensive to run at Internet scale
• Structured data schema
• Strong consistency support
Oracle Voldemort
• Simpler data model (K,V)
• Write availability
• Eventual Consistency
• Scales well and cheaply
Espresso - Architecture
Espresso – Cross-DC Replication
StorageNode
ClientClientRouter
StorageNode
StorageNode
Kafka ClusterData Replicator
StorageNode
ClientClientRouter
StorageNode
StorageNode
Kafka ClusterData Replicator
Datacenter 1 Datacenter 2
Cross-DC Replication
• Boomerang elimination
• Conflict resolution• Last write wins
• Unique id generation• User-selectable options
• Data consistency checker
Ambry – Data Distribution
P1 P2 P3
Node1
P3 P1 P2
Node2
P2 P3 P1
Node3
Partit ion Status
1 Read-write
2 Read-write
3 Read-write
Ambry – Cluster Expansion
P1 P2 P3
Node1
P3 P1 P2
Node2
P2 P3 P1
Node3
Partit ion Status
1 Read-only
2 Read-only
3 Read-write
4 Read-write
5 Read-write
6 Read-write
P4 P5 P6
Node4
P6 P4 P5
Node5
P5 P6 P4
Node6
Index Segment3
Index Segment3
Ambry –Storage Layout
…… BlobId:50
BlobId:30
BlobId:70
BlobId:40
400GB640 700 770 850 900
start offset in current index segment
log end offset
blob id offset TTL
id 30 700 ∞
id 40 850 1/1/16
id 70 770 ∞
Sorted byblob id
Index
Log
Start offset: 700End offset: 900
Index segment1
Index segment2Index segment3
Storage Optimization
• O(1) I/O for writes
• Bloom filter for index segments
• Reply on OS page cache
• Zero copy for gets
Read/Write APIVenice
• Derived data K-V store• Single Get
• Batch Get
• High throughput ingestion from:• Hadoop
• Samza
• Or both (hybrid)
ScaleVenice
• Large scale• Multi-Datacenter
• Multi-Cluster
• Run “as a service”• Self-service onboarding
• Each cluster is multi-tenant
• Resource isolation
TradeoffsVenice
All writes go through Kafka• Scalable
• Burst tolerant
• Asynchronous
• No native “read your writes” semantics
Global Replication
Push Job
Controller
HadoopMirror Maker
Parent Controller
Datacenter Boundary
Storage Nodes
Mirror Maker
…
Mirror Maker
…