37
ORGANIZATION NAME Photo: Courtesy of O'Reilly Conference on Flickr How LinkedIn Democratizes Big Data Visualization

How LinkedIn Democratizes Big Data Visualization

Embed Size (px)

DESCRIPTION

Speakers: Jonathan Wu (LinkedIn), Praveen Neppalli Naga (LinkedIn), Chi-Yi Kuan (LinkedIn) Category: Hadoop in Action LinkedIn processes enormous amounts of events each day. This data is of critical importance for data analysts, engineers, business experts, and data scientists that seek deep understanding of the interactions within LinkedIn’s professional social graph. They use this data to derive insights and performance metrics, which lead to better business decisions on products, marketing, sales, and other functional areas. Areas of interest include Email, Growth, Engagement, and Trending metrics. Development of internal tools has traditionally been based on specific need, optimized for the business use case, and non-interoperable. The engineering challenge is to allow business users to easily access and organize huge amounts of data in a comprehensive way and to be able to flexible and quickly get to the insights through graphs and charts that they need. The data needs to be sufficiently granular to work for different needs, the interface needs to be intuitive and simple, and the infrastructure needs to be high performance allowing users to manipulate large amounts of data quickly. The solution to this challenge was realized by the LinkedIn Business Analytics and Data Analytics Infrastructure teams utilizing an integrated stack that includes an interactive analytics infrastructure and a self-serve data visualization front-end solution. The user interface provides a customizable ability to build charts, tables, and queries to suit highly customized reporting needs on any devices. The back-end infrastructure is based on Hadoop; which leverages LinkedIn’s investment in high scalable, data rich systems. The combined solution brings the ability to visualize, slice, dice, and drill through billions of records and hundreds of dimensions at fast scale. In this talk, you will learn the background of the data challenges that LinkedIn faced, how the teams came together to construct the solution, and the underlying stack structure powering this solution.

Citation preview

Page 1: How LinkedIn Democratizes Big Data Visualization

ORGANIZATION NAME

Photo: Courtesy of O'Reilly Conference on Flickr

How LinkedIn Democratizes

Big Data Visualization

Page 2: How LinkedIn Democratizes Big Data Visualization

Democratizes

Big Data Visualization

How

Jonathan Wu

Praveen Neppalli Naga

Chi-Yi Kuan

Page 3: How LinkedIn Democratizes Big Data Visualization

313,000,000 Members

End of Q2 2014

Page 4: How LinkedIn Democratizes Big Data Visualization

25,000,000,000 Page Views

Q2 2014

Page 5: How LinkedIn Democratizes Big Data Visualization

3,000,000+ Endorsements

Page 6: How LinkedIn Democratizes Big Data Visualization

3,500,000+ Companies

Page 7: How LinkedIn Democratizes Big Data Visualization

What can we do with Linkedin data

?

Page 8: How LinkedIn Democratizes Big Data Visualization

Sales

Talent flow between companies

Page 9: How LinkedIn Democratizes Big Data Visualization

Product & engineering

Page 10: How LinkedIn Democratizes Big Data Visualization

Is it simple?

Member attributes Page View events data

Page 11: How LinkedIn Democratizes Big Data Visualization

Photo Credit: https://www.flickr.com/photos/johnjoh/1060267344

Data is the new vineyard

Page 12: How LinkedIn Democratizes Big Data Visualization

Photo Credit: https://www.flickr.com/photos/johnjoh/1060267344

Data is the new vineyard

Page 13: How LinkedIn Democratizes Big Data Visualization

Data infra: collect & prepare data

Collect & Prepare Data Mysql, Oracle, Kafka + Hadoop

Serve Data Pinot

Taste Data Easy-to-use visualization

Page 14: How LinkedIn Democratizes Big Data Visualization

Data Computation

ETL

HDFS

Y

A

R

N

Map-Reduce Spark Tez

Pig Hive Cubert

Kafka Data Stores

Hadoop

Page 15: How LinkedIn Democratizes Big Data Visualization

Data infra: Serve data

Collect & Prepare Data Kafka + Hadoop

Serve Data Pinot

Taste Data Easy-to-use visualization

Page 16: How LinkedIn Democratizes Big Data Visualization

Products for members/customers with real-time

interactive analytics

• Who’s Viewed Your Profile

• Ads Reporting

• Jobs Analytics

Categories of interactive analytics products

Interactive business analytics for internal use

• How feature X is performing

Real-time business monitoring

• Page view changes across mobile devices in different

regions

Page 17: How LinkedIn Democratizes Big Data Visualization

Requirements for real-time interactive analytics

Slice and dice billions of records,

hundreds of dimensions

End to end freshness of minutes

not hours

Sub-second query response times

e.g. Which are top regions that contribute to my profile views? Which

industries in those regions?

Page 18: How LinkedIn Democratizes Big Data Visualization

Pinot

Distributed Analytics Infrastructure that

serves Interactive Analytics products at

Linkedin.

Page 19: How LinkedIn Democratizes Big Data Visualization

Data

Indexes

Distributed

System

Ingestion

What is Pinot?

Compressed Columnar indexes

(supports Mmap and In-memory)

Apache Helix for cluster

management

Apache Kafka (for near real-time)

and Hadoop

Page 20: How LinkedIn Democratizes Big Data Visualization

Data Indexes

Single Value

Index

Multi Value

Index

Inverted Index

• Fixed bit length encoding

• Sorted Index

• Secondary Sorted Index

• Multi-value Fixed bit length encoding

• BitMap Multi-value Index

• P4Delta

• Modified P4Delta

• BitMap

Page 21: How LinkedIn Democratizes Big Data Visualization

Cluster Management

• Create Resources

• Update Resource meta data

• Expand/Contract partitions dynamically

• Query Router

Page 22: How LinkedIn Democratizes Big Data Visualization

Data Ingestion

Kafka for Realtime

Hadoop for Historical

Page 23: How LinkedIn Democratizes Big Data Visualization

High Level Architecture

PINOT

Hadoop Kafka

Historical Realtime

CLUSTER MANAGER

Controller

Helix

Zookeeper

Broker 1 Broker 2

Server 1 Server 2 Server 3

Page 24: How LinkedIn Democratizes Big Data Visualization

Core Features

Low latency and high QPS OLAP Queries

with real-time ingestion

Support complex dimensions

Operational simplicity

Data bootstrapping & reconciliation

Page 25: How LinkedIn Democratizes Big Data Visualization

Usage @ Linkedin

About 18 member facing products

on Linkedin.com

Internal Reporting

Open-source.…coming soon

Page 26: How LinkedIn Democratizes Big Data Visualization

Reporting UI: serve & taste data

Collect & Prepare Data Kafka + Hadoop

Serve Data Pinot

Taste Data Easy-to-use visualization

Page 27: How LinkedIn Democratizes Big Data Visualization

I want to access big data without

running SQL

Business need

Page 28: How LinkedIn Democratizes Big Data Visualization

Start a new dashboard with one click

Page 29: How LinkedIn Democratizes Big Data Visualization

Select what metrics/dimensions you want

Page 30: How LinkedIn Democratizes Big Data Visualization

Charts are rendered in just a few seconds

Page 31: How LinkedIn Democratizes Big Data Visualization

Zoom into a single chart

Page 32: How LinkedIn Democratizes Big Data Visualization

Filter on various dimensions

Page 33: How LinkedIn Democratizes Big Data Visualization

Access everywhere

Page 34: How LinkedIn Democratizes Big Data Visualization

Portal that connects dashboards, internal reports,

and internal Wiki Pages

Enterprise analytics portal

Page 35: How LinkedIn Democratizes Big Data Visualization

Scale of the data

Pinot for interactive analysis

Self service visualization for insights

Summary

Page 36: How LinkedIn Democratizes Big Data Visualization
Page 37: How LinkedIn Democratizes Big Data Visualization

We are hiring

Jonathan Wu

www.linkedin.com/in/pneppalli

www.linkedin.com/in/jiyewu

www.linkedin.com/in/chiyikuan

[email protected]

Praveen Neppalli Naga [email protected]

Chi-yi Kuan [email protected]

650-605-2184

650-962-3299

650-426-6301