How LinkedIn Democratizes Big Data Visualization

Preview:

DESCRIPTION

Speakers: Jonathan Wu (LinkedIn), Praveen Neppalli Naga (LinkedIn), Chi-Yi Kuan (LinkedIn) Category: Hadoop in Action LinkedIn processes enormous amounts of events each day. This data is of critical importance for data analysts, engineers, business experts, and data scientists that seek deep understanding of the interactions within LinkedIn’s professional social graph. They use this data to derive insights and performance metrics, which lead to better business decisions on products, marketing, sales, and other functional areas. Areas of interest include Email, Growth, Engagement, and Trending metrics. Development of internal tools has traditionally been based on specific need, optimized for the business use case, and non-interoperable. The engineering challenge is to allow business users to easily access and organize huge amounts of data in a comprehensive way and to be able to flexible and quickly get to the insights through graphs and charts that they need. The data needs to be sufficiently granular to work for different needs, the interface needs to be intuitive and simple, and the infrastructure needs to be high performance allowing users to manipulate large amounts of data quickly. The solution to this challenge was realized by the LinkedIn Business Analytics and Data Analytics Infrastructure teams utilizing an integrated stack that includes an interactive analytics infrastructure and a self-serve data visualization front-end solution. The user interface provides a customizable ability to build charts, tables, and queries to suit highly customized reporting needs on any devices. The back-end infrastructure is based on Hadoop; which leverages LinkedIn’s investment in high scalable, data rich systems. The combined solution brings the ability to visualize, slice, dice, and drill through billions of records and hundreds of dimensions at fast scale. In this talk, you will learn the background of the data challenges that LinkedIn faced, how the teams came together to construct the solution, and the underlying stack structure powering this solution.

Citation preview

ORGANIZATION NAME

Photo: Courtesy of O'Reilly Conference on Flickr

How LinkedIn Democratizes

Big Data Visualization

Democratizes

Big Data Visualization

How

Jonathan Wu

Praveen Neppalli Naga

Chi-Yi Kuan

313,000,000 Members

End of Q2 2014

25,000,000,000 Page Views

Q2 2014

3,000,000+ Endorsements

3,500,000+ Companies

What can we do with Linkedin data

?

Sales

Talent flow between companies

Product & engineering

Is it simple?

Member attributes Page View events data

Photo Credit: https://www.flickr.com/photos/johnjoh/1060267344

Data is the new vineyard

Photo Credit: https://www.flickr.com/photos/johnjoh/1060267344

Data is the new vineyard

Data infra: collect & prepare data

Collect & Prepare Data Mysql, Oracle, Kafka + Hadoop

Serve Data Pinot

Taste Data Easy-to-use visualization

Data Computation

ETL

HDFS

Y

A

R

N

Map-Reduce Spark Tez

Pig Hive Cubert

Kafka Data Stores

Hadoop

Data infra: Serve data

Collect & Prepare Data Kafka + Hadoop

Serve Data Pinot

Taste Data Easy-to-use visualization

Products for members/customers with real-time

interactive analytics

• Who’s Viewed Your Profile

• Ads Reporting

• Jobs Analytics

Categories of interactive analytics products

Interactive business analytics for internal use

• How feature X is performing

Real-time business monitoring

• Page view changes across mobile devices in different

regions

Requirements for real-time interactive analytics

Slice and dice billions of records,

hundreds of dimensions

End to end freshness of minutes

not hours

Sub-second query response times

e.g. Which are top regions that contribute to my profile views? Which

industries in those regions?

Pinot

Distributed Analytics Infrastructure that

serves Interactive Analytics products at

Linkedin.

Data

Indexes

Distributed

System

Ingestion

What is Pinot?

Compressed Columnar indexes

(supports Mmap and In-memory)

Apache Helix for cluster

management

Apache Kafka (for near real-time)

and Hadoop

Data Indexes

Single Value

Index

Multi Value

Index

Inverted Index

• Fixed bit length encoding

• Sorted Index

• Secondary Sorted Index

• Multi-value Fixed bit length encoding

• BitMap Multi-value Index

• P4Delta

• Modified P4Delta

• BitMap

Cluster Management

• Create Resources

• Update Resource meta data

• Expand/Contract partitions dynamically

• Query Router

Data Ingestion

Kafka for Realtime

Hadoop for Historical

High Level Architecture

PINOT

Hadoop Kafka

Historical Realtime

CLUSTER MANAGER

Controller

Helix

Zookeeper

Broker 1 Broker 2

Server 1 Server 2 Server 3

Core Features

Low latency and high QPS OLAP Queries

with real-time ingestion

Support complex dimensions

Operational simplicity

Data bootstrapping & reconciliation

Usage @ Linkedin

About 18 member facing products

on Linkedin.com

Internal Reporting

Open-source.…coming soon

Reporting UI: serve & taste data

Collect & Prepare Data Kafka + Hadoop

Serve Data Pinot

Taste Data Easy-to-use visualization

I want to access big data without

running SQL

Business need

Start a new dashboard with one click

Select what metrics/dimensions you want

Charts are rendered in just a few seconds

Zoom into a single chart

Filter on various dimensions

Access everywhere

Portal that connects dashboards, internal reports,

and internal Wiki Pages

Enterprise analytics portal

Scale of the data

Pinot for interactive analysis

Self service visualization for insights

Summary

We are hiring

Jonathan Wu

www.linkedin.com/in/pneppalli

www.linkedin.com/in/jiyewu

www.linkedin.com/in/chiyikuan

jowu@linkedin.com

Praveen Neppalli Naga pneppalli@linkedin.com

Chi-yi Kuan ckuan@linkedin.com

650-605-2184

650-962-3299

650-426-6301