Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
The Data Driven Network
Kapil Surlaker Director of Engineering
Real-Time Data at LinkedIn
Kapil Surlaker and Shirshanka Das XLDB 2016
2
What does real-time data mean?
Real-time Ingestion
Stream Processing
Real-time Serving
Batch Processing
Real-time Data: Use Cases
In-product analytics Reporting Features for ML models (impression discounting) Standardization Monitoring, Alerting
Member-Facing
Business-Facing
Data
corp.linkedin.com
Transformlinkedin.com
Every pipeline looks like this
Ingest Process ServeCreate[ ]
The Unified Metrics PlatformHow Single source of truth for metrics Centralized multi-tenant pipeline Easy on boarding
Why Disjointed efforts, unreliable systems Fragmented data pipelines across online and offline Unpredictable SLA across all systems Diminished trust in any dashboard
WorkflowMetric
Definition
Sandbox
Code Repository
Metric Owner
System Jobs
Build
Core Metrics Job
Central Team, Relevant
Stakeholders
1. iterate
2. create 3. review
4. check in
Metric Definition
Name Description TagsOwners
Dataset
Dimensions
TimeScript
Metrics
Entity Ids
Tier
Formulas
Entity Dimensions
Input Datasets
Time windows
UMP Data FlowUmp Monitor
Primary Data
(tracking, databases, external)
UMP Raw Data
UMP Aggregated
Data Relevance
Experiment analysis
Ad-hoc
Metrics Script
Data Prep agg cube
dimension verify
HDFS + Pinot
Dashboards
…
Ingest Process ServeCreate
Espresso Kafka Databus
Real-time data: DatabasesEspresso - key-document storage - secondary indexes - transactional updates
to related data - real-time change-
stream
Apps
Espresso DB
Databus
Downstream
Scale - XXX TB un-replicated - 1M+ qps, 0.1M+ wps - XXXX machines
Real-time Data: Tracking
Kafka (Tracking)Client-side Tracking
Tracking Frontend
Services
Downstream
Kafka - high-volume pub-sub
- tunable guarantees
- Scale - ~1.5T messages/day
- ~17M messages/sec
- XXX machines
What about our acquisitions?
Slideshare
Bizo
Lynda
…
Where does this data live?
RESTSFTPJDBC
What about other integrations?
• Salesforce • Google
Doubleclick • Responsys • Eloqua • …
Where does this data live?
RESTSFTP
JDBC
Ingest Process ServeCreate
Source Diversity
Batch +
StreamingData
Quality
Requirements
Gobblin Architecture
17
Source
Work Unit
Work Unit
Work Unit
Extract
Extract
Extract
Convert
Convert
Convert
Quality
Quality
Quality
Write
Write
Write
Data Publish
Task
Task
Task
Solving for real-timeInefficiencies in batch
YARN based
Apache Helix
Continuous
Auto-scaling
YARN
Helix
Executor 1
Executor 2
Executor 3
HDFS
Stream Source
Current ActivityOpen source @ github.com/linkedin/gobblin
Adopted by LinkedIn, Intel, Swisscom, NerdWallet,…
@LinkedIn ~100 TB per day @ LinkedIn
Hundreds of datasets
~20 different sources
Under Development Metadata-driven
Hadoop-Hadoop copies
Ingest Process ServeCreate
Hadoop Spark Samza
Transformation engines @ LinkedIn
@ LinkedInUse-cases
- Machine Learning
- Photon ML (Machine learning library on Spark)
- Much faster iteration times
- Larger feature sets
By the Numbers
- ~XXX machines
- ~XXXX unique flows
Improvement
Feeds relevance 2 h to 14 m (9 x)
Jobs relevance 32 m to 1.3 m (24 x)
Ads relevance 24 h to 45 m (32 x)
Communication relevance 18 h to 30 m (36 x) with 10 x more features
@ LinkedInUse-cases
- Standardization
- Email optimization
- Site-speed
By the Numbers
- ~XXX machines
- ~XXX jobs
What
- Stream processing framework
- Apache YARN for resource allocation
- Apache Kafka for stream storage
- Local state optimizations using RocksDB
There’s more out there…
Heron
2.0
Streams
Ingest Process ServeCreate
P not
Real-time. Interactive.
Slice and Dice metrics
Precompute!
Device Geo View
Android US 1
Android IN 1
iOS US 1
Dimension View
Android 2
iOS 1
US 2
IN 1
Android,US 1
iOS,US 1
Android,IN 1
More dimensions!Device Geo Carrier View
Android US ATT 1
Android IN Reliance 1
iOS US Verizon 1
Dimension View
Android 2
iOS 1
US 2
IN 1
ATT 1
Reliance 1
Verizon 1
Android,US 1
... ...
ChallengesHorizontally scalable Low latency Data freshness Fault tolerance OLAP features
SQL-like interface
(minus joins)
Sub-second query latency
Data load from Hadoop
and Kafka
Capabilities
Pinot Data Flow
Kafka Hadoop
Samza Process
Pinot
minuteshour +
Pinot@LinkedIn
Site-‐facing Apps Reporting dashboards Monitoring
In production since 2012 Open sourced in 2015 @ github.com/linkedin/pinot
(S)QL: Filters and AggsSELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ AND action = 'stop'
(S)QL: Group BySELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ GROUP BY action
(S)QL: ORDER BY and LIMITSELECT * FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND entityId = 1000 AND action = 'start' ORDER BY creationTime DESC LIMIT 1
Broker Helix
Real time Historical
Kafka Hadoop
Pinot Architecture
Queries
Raw Data Samza
Multi-tenantDeclarative specification Role-specific assignment Seamless cluster expansion Powered by Apache Helix
“numCopies”: 2 }
{ “resourceName”: “MyStore”,“numDataNodes”: 4,“numBrokers”: 2,
Scale Largest cluster: xxx nodes # of stores: xxx
Columnar Storage
Forward Index
Fast but needs a ton of RAM
Single-node OptimizationsDisk-based structures Indexes: inverted, bitmap, hybrid File optimization Compression: dictionary, p4delta Multi-valued columns, skip lists Vectorization: ~7x latency improvement Query rewrite Sketching algorithms: HyperLogLog
To pre-compute or not?
Data aware pre-computation
Speeding up the cycle
Form hypothesis Query Repeat
Form hypothesis Query Repeat
OR …
Breaking the cycle
Ingest Process ServeCreate
Hadoop Spark Samza
PinotEspresso Kafka Databus
Real-Time Data at LinkedIn
Kapil Surlaker @kapilsurlaker
49
Shirshanka Das @shirshanka
Thanks!