Using Apache Kylin @ eBay - 08/19 Big Data Application Meetup, Talk #1

Using Apache Kylin for Large Scale Data Analytics @

eBay

http://kylin.io

Agenda Big Data @ eBay What’s Apache Kylin? Cube Segments &

Streaming Q&A

Big Data @ eBay

$28BGMV VIA MOBILE

(2014)

266MMOBILE APP

GLOBALLY

1BLISTINGS CREATED

VIA MOBILE

157MACTIVE BUYERS

25MACTIVE SELLERS

800MACTIVE LISTINGS

8.8MNEW LISTINGS

EVERY WEEK

CHECKOUT

WATCH LIST

SHARE

LOCAL

PREFERENCES

BID

100’s PB 9 Quarters of historical

data

100MQUERIES/MONTH

20TBBehavior Data Capture

every day

CLICK

SEARCHDA

TA

Hadoop@ eBay

1-10 nodes2007

100+ nodes

1000s + core1 PB

2010

20111000+ node10,000+ core10+ PB

3000+ node10,000+ core

50+ PB2012

2013/201410,000 nodes150,000+ cores170 PB +> 2000 users2009

50+ nodes

Extreme OLAP Engine for Big DataApache Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

What’s Kylin

kylin / ˈkiːˈlɪn / 麒麟--n. (in Chinese art) a mythical animal of composite form

• Open Sourced on Oct 1st, 2014• Accepted as Apache Incubator Project on Nov 25th, 2014

Business Needs for Big Data Analysis

Sub-second query latency on billions of rows ANSI SQL for both analysts and engineers Full OLAP capability to offer advanced functionality Seamless Integration with BI Tools

Support of high cardinality and high dimensions High concurrency – thousands of end users Distributed and scale out architecture for large data

volume

Apache Kylin AdoptionExternal

25+ contributors in community

Adoption:In Production: eBay, Baidu MapsEvaluation: Huawei, Bloomberg Law, British Gas, JD.com

Microsoft, Tableau… eBay Use Cases

Case Cube Size Raw RecordsUser Session Analysis 14+TB 75+ billion rowsTraffic Analysis 21+ TB 20+ billion rowsBehavior Analysis 560+ GB 1.2+ billion rows

Kylin Architecture Overview

9

Cube Builder (MapReduce…)

SQL

Low Latency - SecondsRouting

3rd Party App(Web App, Mobile…)

Metadata

SQL-Based Tool(BI Tools: Tableau…)

Query Engine

HadoopHive

REST API JDBC/ODBC

Online Analysis Data Flow Offline Data Flow

Clients/Users interactive with Kylin via SQL

OLAP Cube is transparent to users

Star Schema Data Key Value Data

Data Cube

OLAPCubes(HBase)

SQL

REST Server

Data

Sou

rce

Abst

racti

on Engine

Abstraction

Stor

age

Abst

racti

on

Cube Storage in HBase

OLAP Cube – Balance between Space and Time

time, item

time, item, location

time, item, location, supplier

time item location supplier

time, location

Time, supplier

item, locationitem, supplier

location, supplier

time, item, supplier

time, location, supplier

item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells

1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>2. (9/15, milk, Urbana, *) - <time, item, location>3. (*, milk, Urbana, *) - <item, location>4. (*, milk, Chicago, *) - <item, location>5. (*, milk, *, *) - <item>

Cuboid = one combination of dimensions Cube = all combination of dimensions (all cuboids)

Dynamic data management framework. Formerly known as Optiq, Calcite is an Apache incubator project, used by

Apache Drill and Apache Hive, among others. http://optiq.incubator.apache.org

Querying the Cube

http://optiq.incubator.apache.org/

http://optiq.incubator.apache.org/

Metadata SPI Provide table schema from Kylin metadata

Optimization Rules Translate the logic operators into kylin operators

Relational Operators Find right cube Translate SQL into storage engine API call Generate physical execute plan by linq4j java implementation

Result Enumerator Translate storage engine result into java implementation result.

SQL Function Add HyperLogLog for distinct count Implement date time related functions (i.e. Quarter)

Query Engine based on Calcite

Full Cube Pre-aggregate all dimension combinations “Curse of dimensionality”: N dimension cube has 2N cuboid.

Partial Cubes To avoid dimension explosion, we divide the dimensions into

different aggregation groups 2N+M+L 2N + 2M + 2L

For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands

230 210 + 210 + 210

Tradeoff between online aggregation and offline pre-aggregation

Optimizing Cube Builds – Full Cube vs. Partial Cube

Optimizing Cube Builds: Partial Cubes

Optimizing Cube Building – Incremental Building

Optimizing Cube Builds: Improved Parallelism• The current algorithm

– Many MRs, the number of dimensions– Huge shuffles, aggregation at reduce side, 100x of total cube size

Full Data

0-D Cuboid

1-D Cuboid

2-D Cuboid

3-D Cuboid

4-D CuboidMR

MR

MR

MR

MR

Optimizing Cube Builds: Improved Parallelism

• The to-be algorithm, 30%-50% faster– 1 round MR– Reduced shuffles, map side aggregation, 20x total cube

size– Hourly incremental build done in tens of minutes

Data Split

Cube Segment

Data Split

Cube Segment

Data Split

Cube Segment

……

Final Cube

Merge Sort(Shuffle)

mapper mapper mapper

Historical DataRecent Data

Kafka Topics

SQL Query

Optimizing Cube Builds: Streaming

Inverted Index

Hybrid Storage Interface

Cube Segments

Use Case: SEO Operational Dashboard• eBay Site

– ebay.com, ebay.co.uk, ebay.de• Buyer Country

– US, CN, RU• Search Engine

– Google, Bing, Yahoo!• Referrer

– google.com, google.co.uk• Page

– Search, View Item, Product• User Experience

– Desktop, Mobile APP, mWeb

• Visits, GMB $, GMB share, conversion rate, bounce rate, # of view items, # of bought items etc.

Dimensions

Measurements

Q & A

Technology

Using Apache Kylin @ eBay - 08/19 Big Data Application Meetup, Talk #1