Upload
chi-yi-kuan
View
241
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Speakers: Jonathan Wu (LinkedIn), Praveen Neppalli Naga (LinkedIn), Chi-Yi Kuan (LinkedIn) Category: Hadoop in Action LinkedIn processes enormous amounts of events each day. This data is of critical importance for data analysts, engineers, business experts, and data scientists that seek deep understanding of the interactions within LinkedIn’s professional social graph. They use this data to derive insights and performance metrics, which lead to better business decisions on products, marketing, sales, and other functional areas. Areas of interest include Email, Growth, Engagement, and Trending metrics. Development of internal tools has traditionally been based on specific need, optimized for the business use case, and non-interoperable. The engineering challenge is to allow business users to easily access and organize huge amounts of data in a comprehensive way and to be able to flexible and quickly get to the insights through graphs and charts that they need. The data needs to be sufficiently granular to work for different needs, the interface needs to be intuitive and simple, and the infrastructure needs to be high performance allowing users to manipulate large amounts of data quickly. The solution to this challenge was realized by the LinkedIn Business Analytics and Data Analytics Infrastructure teams utilizing an integrated stack that includes an interactive analytics infrastructure and a self-serve data visualization front-end solution. The user interface provides a customizable ability to build charts, tables, and queries to suit highly customized reporting needs on any devices. The back-end infrastructure is based on Hadoop; which leverages LinkedIn’s investment in high scalable, data rich systems. The combined solution brings the ability to visualize, slice, dice, and drill through billions of records and hundreds of dimensions at fast scale. In this talk, you will learn the background of the data challenges that LinkedIn faced, how the teams came together to construct the solution, and the underlying stack structure powering this solution.
Citation preview
ORGANIZATION NAME
Photo: Courtesy of O'Reilly Conference on Flickr
How LinkedIn Democratizes
Big Data Visualization
Democratizes
Big Data Visualization
How
Jonathan Wu
Praveen Neppalli Naga
Chi-Yi Kuan
313,000,000 Members
End of Q2 2014
25,000,000,000 Page Views
Q2 2014
3,000,000+ Endorsements
3,500,000+ Companies
What can we do with Linkedin data
?
Sales
Talent flow between companies
Product & engineering
Is it simple?
Member attributes Page View events data
Photo Credit: https://www.flickr.com/photos/johnjoh/1060267344
Data is the new vineyard
Photo Credit: https://www.flickr.com/photos/johnjoh/1060267344
Data is the new vineyard
Data infra: collect & prepare data
Collect & Prepare Data Mysql, Oracle, Kafka + Hadoop
Serve Data Pinot
Taste Data Easy-to-use visualization
Data Computation
ETL
HDFS
Y
A
R
N
Map-Reduce Spark Tez
Pig Hive Cubert
Kafka Data Stores
Hadoop
Data infra: Serve data
Collect & Prepare Data Kafka + Hadoop
Serve Data Pinot
Taste Data Easy-to-use visualization
Products for members/customers with real-time
interactive analytics
• Who’s Viewed Your Profile
• Ads Reporting
• Jobs Analytics
Categories of interactive analytics products
Interactive business analytics for internal use
• How feature X is performing
Real-time business monitoring
• Page view changes across mobile devices in different
regions
Requirements for real-time interactive analytics
Slice and dice billions of records,
hundreds of dimensions
End to end freshness of minutes
not hours
Sub-second query response times
e.g. Which are top regions that contribute to my profile views? Which
industries in those regions?
Pinot
Distributed Analytics Infrastructure that
serves Interactive Analytics products at
Linkedin.
Data
Indexes
Distributed
System
Ingestion
What is Pinot?
Compressed Columnar indexes
(supports Mmap and In-memory)
Apache Helix for cluster
management
Apache Kafka (for near real-time)
and Hadoop
Data Indexes
Single Value
Index
Multi Value
Index
Inverted Index
• Fixed bit length encoding
• Sorted Index
• Secondary Sorted Index
• Multi-value Fixed bit length encoding
• BitMap Multi-value Index
• P4Delta
• Modified P4Delta
• BitMap
Cluster Management
• Create Resources
• Update Resource meta data
• Expand/Contract partitions dynamically
• Query Router
Data Ingestion
Kafka for Realtime
Hadoop for Historical
High Level Architecture
PINOT
Hadoop Kafka
Historical Realtime
CLUSTER MANAGER
Controller
Helix
Zookeeper
Broker 1 Broker 2
Server 1 Server 2 Server 3
Core Features
Low latency and high QPS OLAP Queries
with real-time ingestion
Support complex dimensions
Operational simplicity
Data bootstrapping & reconciliation
Usage @ Linkedin
About 18 member facing products
on Linkedin.com
Internal Reporting
Open-source.…coming soon
Reporting UI: serve & taste data
Collect & Prepare Data Kafka + Hadoop
Serve Data Pinot
Taste Data Easy-to-use visualization
I want to access big data without
running SQL
Business need
Start a new dashboard with one click
Select what metrics/dimensions you want
Charts are rendered in just a few seconds
Zoom into a single chart
Filter on various dimensions
Access everywhere
Portal that connects dashboards, internal reports,
and internal Wiki Pages
Enterprise analytics portal
Scale of the data
Pinot for interactive analysis
Self service visualization for insights
Summary
We are hiring
Jonathan Wu
www.linkedin.com/in/pneppalli
www.linkedin.com/in/jiyewu
www.linkedin.com/in/chiyikuan
Praveen Neppalli Naga [email protected]
Chi-yi Kuan [email protected]
650-605-2184
650-962-3299
650-426-6301