INSTITUTE OF COMPUTING TECHNOLOGY High ...novel.ict.ac.cn/zxu/Talks/Zhiwei Xu Invited Speech at...– E.g., Hadoop uses RPC, HTTP, NIO for communication, much slower than MPI – New

High-Performance Techniques for Big Data Computing

in Internet Services

Zhiwei Xu Institute of Computing Technology (ICT)

Chinese Academy of Sciences (CAS) www.ict.ac.cn, [email protected]

This research is supported in part by the National Basic Research Program of China (Grant 2011CB302502)

and the Strategic Priority Program of Chinese Academy of Sciences (Grant XDA06010400)

INSTITUTE OF COMPUTING TECHNOLOGY

SC12 November 14, 2012

Outline

•  Internet services are supercomputing •  Little’s law is as important as Amdahl’s law

– Need to incorporate exascale, power, energy •  Example researches

– A data placement problem – A data indexing problem – A data communication problem

•  A new 10-year CAS NICT project

HPC vs. Internet Services •  Both have vibrant markets & communities

–  HPC: ~$1 billion; Internet Services: O($10 billion) –  MOST invested in HPC and Cloud initiatives (2001-2015)

•  Internet services are supercomputing! –  They use large systems

•  Sequoia –  98K nodes, 1.6 PB memory, 55 PB filesystem

•  Google datacenter systems estimation –  800K-1M servers (nodes), ~16 PB memory –  ~hundreds of PB disks

•  Tencent datacenter systems estimation –  ~200K nodes, ~3.2 PB memory, ~200 PB disks

–  Their sciences are emerging •  SC used to be for physical universe •  SC is also for human-cyber-physical ternary universe

Alexa Top Sites (2012.10.21)

1. Google 2. Facebook 3. YouTube 4. Yahoo! 5. Baidu 6. Wikipedia 7. Windows Live 8. Twitter 9. QQ (Tencent) 10. Amazon 13. Taobao 16. Sina 23. eBay

HPC vs. Internet Services •  HPC has a focused multi-decade goal

–  A history-proof metric (flops) and ranking site (Top500.org) –  Provides concrete R&D objectives and roadmaps

•  1990: Gflops; 2000: Tflops; 2010: Pflops; 2020: Eflops –  Facilitates a worldwide, broad-based HPC community

•  devices, systems, software, application people •  academia, industry, governments, volunteers •  funding, R&D, use, and education issues

Jack Dongarra, On the Future of High Performance Computing: How to Think for Peta and Exascale Computing, SCI Institute, University of Utah, February 10, 2012,

Petaflops book, MPI, Blocks in Linpack

Chinese HPC Development Benefited from This International Goal

2020

Exaflops (1018) Datacenter for 100’s M (108) users

100 M (108) LOC100 M (108) W

Needs:

Maintain growth in performance, but control power & system software complexity

World Top1 computer speed (Flops)ICT computer speed (Flops)ICT computer system software (LOC) ICT computer power (W)

The User Experience Mantra •  “A function (or performance level) does not exist, if users

do not experience it well.” –  Internet services companies in China

•  A profound implication: It is not horizontal anymore, and systems researchers must –  consider communities and ecosystems (e.g., Hadoop, Hbase, S4) –  have access to workload data and ecosystems

Otherwise, a Pflops computer is only a subTflops system!

User Experience

Service

Application

Middleware

System Software

Machines

Components

1955-1980Vertical

1980-2005Horizontal

2005-2030End-to-end ecosystems

IBM DEC ……

EDS, Andersen, …

MS Office, SAP, …

Oracle, BEA, …

Windows, Unix, …

HP, Cisco, Dell, …

Intel, Seagate, …

IBM, Accenture, …

MS Office, SAP, …

Oracle, BEA, LAMP, …

Android, Windows, Linux…

HP, Cisco, Lenovo, …

Intel, ARM, Seagate, …

Apple

Facebook

Google

Tencent

PHP 51090X

Java 1816X

C 107X

Vectorized 36X

BLAS 1

Software bloat in MxM

Advancing Computer Systems without Technology Progress, M. D. Hill and C. Kozyrakis, ISAT Outbrief, 2012

R&D Issues: HPC vs. Internet Services•  Big data issues for Internet services

–  Data sizes: GBs to TBs vs. PBs to EBs, or trillion records –  Performance goals: 1Eflops@20MW vs. 1EB/H@20MW? –  Scalability R&D has progressed better than efficiency R&D

•  Common research issues: how to effectively –  Exploit parallelism (millions to a billion threads) –  Utilize locality (temporal, spatial, request, data, etc.) –  Reduce communication overheads

151

49

33

4.95

3.48

1.03

362

975

33

387

1

10

100

1000

1996 1998 2000 2002 2004 2006 2008 2010 2012

TB PB 10PB

Execution time (minutes)

Jim Gray’s Sort BenchmarkSpeed improved 150X in 11 years

1998-2004: 5X in 6 years 2004-2009: 33X in 5 years

Data size increased 10000X in 13 years

Breakthroughs are needed to sort 100PB by 2015, and 1EB by 2020, within a few hours, @20MW budget

2015 2020

100PB 1EB

Internet Services WorkloadsA service normally serves two types of data computing workloads

–  Batch (backend, offline): data mining, machine learning –  Performance metric for batch workloads: scale of processed data

•  TB/PB/10PB sorted, 1M/10M dimensions learned in a minute/hour/day –  Customer facing (frontend, online): request serving, transaction, analytics –  Performance metric for customer facing workloads: Amazon triple

•  (Simultaneous Requests, Percentile, Response Time) = (100K, 99.9%, 300ms）

1000s threads, latency <20ms

Internet

Crawler System (trillion pages)

Index System(trillion pages)

Ranking

Data Mining(1-100PB)

Machine Learning(sparse matrix with 1-10M dimensions)

Latency

Throughput

Request Fails!

User Experience Threshold

Each search request could generate 10K parallelism

Queuing Theory

Relate Multiple Performance Metrics to Energy •  Borrow from Little’s law and the Internet hourglass •  Focus on “threads per second” as a proxy of the performance goals

–  Subject to latency, power, energy constraints –  A thread is a schedulable sequence of instruction executions with its own

program counter •  POSIX thread, HW thread, Java thread, CUDA thread of GPU, Hadoop task, etc.

–  “Threads per second” serves as the neck of the performance metrics hourglass

Applications & Macro-benchmark Metrics

Instructions per

Second

Operations per

Second

Micro-benchmark

Scores

Threads per

Second

0 t T Time

ii wt +it

Worker threads App Framework Thread System thread

Thread iτ has a latency iw

and executes if flop

App Framework Thread

System thread

Para(t)=6

Power=P(t)

App

Framework

Thread

System

thread

Pageviews per day, etc.

ExaFlop/s, etc.

Assumptions and Observations

•  Assume N threads {τ1,…,τN} are executed in a computer system in time period [0, T], where –  power and energy are additive; inactive threads consume no power

•  Definitions of some average quantities –  Throughput λ: threads per second, averaged over [0, T] –  Parallelism L: number of active threads, averaged over [0, T] –  Latency W: latency of a thread, averaged over {τ1,…,τN} –  Power P: Watts consumed by the system, averaged over [0, T] –  Energy E: Joules consumed by a thread, averaged over {τ1,…,τN}

•  Observations –  Little’s Law: λ = L / W –  New observations

•  λ = P / E Throughput = system Power / thread Energy •  λ = L × (E/W) × (1/E) •  Throughput = Parallelism × Watts per thread × Threads per Joule

Connecting to ExaFlops@20MW

•  Definitions –  Work F: flop per thread, averaged over {τ1,…,τN}

–  Speed S: flop per second

•  Observation – S = F × λ = L × F × (E/W) × (1/E) –  Speed = Parallelism × Work × Watts per thread × Threads per Joule –  1 Eflops = 1 billion × 1 billion flop × (<20 mW) × (>1000 threads per Joule)

F= fii=1

N∑ /N

TfS N

i i /1∑ ==

where fi is flop of thread τi

A Billion Thread Parallelism Needed by 2020 for High-End Datacenter Computers (DCC)

•  How big “peak L” was/is/will be –  2000: kilo threads –  2010: million threads –  2020: billion threads

•  Performance/energy needs to improve 100-1000X –  increase parallelism 100-1000X –  reduce data movement cost

Attributes of a DCC 2010 2020 Daily PV (billion) 4-7 20-100

Active threads per PV 1000 10,000

Peak-to-average ratio 2-10 2-15

Peak-hour parallelism ~1 million ~1 billion

A Data Placement Problem•  How to place 1-1000PB data among thousands of

nodes to allow fast data warehouse operations?

•  RCFile –  Production use in many companies: Facebook, Taobao, Netflix,

Twitter, Yahoo, Linkedin, AOL, Salesforce.com, etc.

Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, Zhiwei Xu: RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. ICDE 2011: 1199-1208 or http://en.wikipedia.org/wiki/RCFile

NameNode

DataNode 1 DataNode 2 DataNode 3

A B C D101 201 301 401102 202 302 402103 203 303 403104 204 304 404105 205 305 405

R&D Issues •  Research problem

–  Find a data placement structure with both optimal read efficiency and communication overhead

•  Development problem –  How to fit the data computing community? –  Utilize the Apache ecosystem (HDFS, MapReduce)

Row-Store Column-store

Read effort 1 i/n (optimal)

Communicationoverhead 0 (optimal) β (0%≤β≤100%)

Ideal

i/n (optimal)

0 (optimal)

RCFile data layout in HDFS blocks (Relation partitioned into row groups, each group column stored)

Plain text format: 10.11MB

RCFile format: 1.8MB

Compressed row-store format: 2.13MB

Testing results on Facebook workload

A Data Index Problem•  How to index 10-1000 billion data records to allow

efficient multi-attribute range queries? –  Records stored in Distributed Ordered Tables (DOTs)

•  e.g., BigTable, HBase – Multi-attribute range query

•  Select A,B,C,D from table where B > 21 and B < 24 and C > 31

HMaster

HReignServer 1

HReignServer 3

HReignServer 2

A B C D E 11 21 31 41 51 12 22 32 42 52 13 23 33 43 53 14 24 34 44 54 15 25 35 45 55

Numbers of an Ideal Index Scheme

–  Total data table n rows, c columns; Result m rows, q columns; k indexes –  S = scan latency per record, R = random read latency –  r replicas, N nodes, failure probability p per node

Requirements Secondary index

Clustering index

Replicated secondary

index

Replicated clustering

index

Ideal

Touched data (# cells)

m+m*q m*q (optimal)

m+m*q m*q(optimal)

m*q��(optimal)

Operation latency

m*S+m*R(worst)

m*S(optimal)

m*S+m*R(worst)

m*S(optimal)

m*S��(optimal)

Storage cost (# cells)

n*c+n*k(optimal)

n*c* (k+1) (n*c+n*k)*r n*c*(k+1)*r(worst)

n*c+n*k��(optimal)

Failure probability

N*p(worst)

N*p(worst)

N*pr

(optimal)N*pr

(optimal)N*pr ��

(optimal)

CCIndex: Complementary Clustering Index

•  Touched data: optimal •  Operation latency: optimal

CCTs: Complementary Check Tables (r replicas)

A B C D E 11 21 31 41 51 12 22 32 42 52 13 23 33 43 53 14 24 34 44 54 15 25 35 45 55

B_A C D E 21_11 31 41 51 22_12 32 42 52 23_13 33 43 53 24_14 34 44 54 25_15 35 45 55

C_A B D E 31_11 21 41 51 32_12 22 42 52 33_13 23 43 53 34_14 24 44 54 35_15 25 45 55

A B C 11 21 31 12 22 32 13 23 33 14 24 34 15 25 35

•  Storage cost: good •  Failure probability: near optimal

B_A C 21_11 31 22_12 32 23_13 33 24_14 34 25_15 35

CCITs: Complementary Clustering Index Tables •  Complementary, not replicated •  Reduce storage cost from n*(k

+1)*c*r to n*(k+1)*(c+r) •  Scan a CCIT for a range query •  In case of a failure, scan a

CCT and read another CCIT

C_A B 31_11 21 32_12 22 33_13 23 34_14 24 35_15 25

Application at Taobao•  Taobao Magic Cube: an online data analytic service for merchants •  Migrate to CCIndex on Hbase in two months w/o changing hardware

•  Data analyzed: 7 days à 3 months; Data scale: >10 billion records •  Throughput: increases 7x; Latency: decreases 57%

Alexa Top Sites (2012.10.21)

1. Google 2. Facebook 3. YouTube 4. Yahoo! 5. Baidu 6. Wikipedia 7. Windows Live 8. Twitter 9. QQ (Tencent) 10. Amazon

13. Taobao 16. Sina 23. eBay

Week Season

Lady’s clothes sold ￥3.56B last month to 19.8M buyers

The Big Data Communication Problem •  Data computing lacks high-performance communication support

–  E.g., Hadoop uses RPC, HTTP, NIO for communication, much slower than MPI –  New work: Hadoop-R project at OSU, RDMA-Based Design of HDFS (SC12 paper)

•  System power consumption is not proportional to application usages –  An example: Sorting 2 GB on a 4-core 8-thread computer with Hadoop

•  32 worker threads, 202 system processes and 2 Hadoop processes •  121W/194W=62.3% power “wasted”, even when CPU utilization approaches 0%

•  Needs summary –  MPI-like communication

library to support data computing

–  More efficient than RPC, HTTP, NIO

–  Supporting key-value communication, not buffer-to-buffer communication in MPI

–  Supporting multiple data computing modes

–  As easy to use as MapReduce

32

204

Direct MPI Use Not Easy Or Scalable

WordCount via MapReduce：Scalable over 1GB, 1TB, 1PB …

//MapReducemap (String lineno, String contents) { for each word w in contents { EmitIntermediate(w, 1); } } reduce (String key, int value) { increment(key, value); }

//MPI process mapper:1st> load input 2nd> parse token 3rd> MPI_Send (serialization) … process reducer: 1st> MPI_Recv (Deserialization) 2nd> increment 3rd> save output …

LOC : 110 LOC: 850

Desired Sort Code via MPI-D: Scalable and Easy to Write

init

rank/size

send

recv

finalize33 lines of code 1 GB, 1 TB, 1PB

Basic Ideas of MPI-‐D• Observation: WordCount examplemap (String lineno, String contents) { for each word w in contents { EmitIntermediate(w, 1); } } reduce (String key, int value) { increment(key, value); }

send (key, value) recv(key, value)

•  KV + DOTA à MPI-‐D DOTA is inspired by the DOT model

Y. Huai, R. Lee, S. Zhang, C.H. Xia, and X. Zhang, "DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems", Proceedings of 2nd ACM Symposium on Cloud Computing (SOCC 2011)

Rethink BSP from a Big Data Viewpoint

•  Implicit synchronization •  Vertex: data •  Edge: operation •  DOTA

–  4 layers: Data, Operation, Transfer, Aggregation

– Defense of the Ancients

D1 D2 D3 Dn

O1 O2 O3 On

A1 A2 Am

T11 T12 T1m Tnm

D Layer

O Layer

T Layer

A Layer

Input / Output Data

Intermediate Data

Computation Operations Communication Operations

DOTA – Matrix Representaion

•  Capture the nature of big data computing in the DOTA model •  Computation matrices O and A

–  diagonal matrices to represent independent parallel computations •  Communication matrix T:

–  full matrix to represent interoperation of communication •  Example: MapReduce

–  O -> Map, T -> Shuffle, A -> Reduce

!!"! = !! … !!!!0

0!!

⋯…

00

⋮ ⋮ ⋱ ⋮0 0 ⋯ !!

!!,!!!,!

!!,!!!,!

⋯…

!!,!!!,!

⋮ ⋮ ⋱ ⋮!! ,! !! ,! ⋯ !! ,!

!!0

0!!

⋯…

00

⋮ ⋮ ⋱ ⋮0 0 ⋯ !!

!

!!!!!!!!!!!!!= !! !! … !!! !!

!!,!!!,!

!!,!!!,!

⋯…

!!,!!!,!

⋮ ⋮ ⋱ ⋮!! ,! !! ,! ⋯ !! ,!

!!0

0!!

⋯…

00

⋮ ⋮ ⋱ ⋮0 0 ⋯ !!

!

!!!!!!!!!!!!!= !! ,! !! !!!!!! … !! ,! !! !!!

!!!

!!0

0!!

⋯…

00

⋮ ⋮ ⋱ ⋮0 0 ⋯ !!

!

!!!!!!!!!!!!!= !! !!,! !! !! ,… , !! ,! !! !! … !! !!,! !! !! ,… , !! ,! !! !! !!!!!!!!!!!!!!= [!!! … !!! ]!

DOTA Is a Bipartite Communication Model

The “4D” features of DOTA – Dichotomic, Dynamic, Data-centric, and Diversified

Communicator O Communicator A

O1

O2

On

A1

A2

Am

D1

D2

...

...

Dm

Future Task

Current TaskFinished Task

Data Movement

Task MovementIntermediate Data

Proposed Specification of MPI-D •  Three groups of library functions

–  MPI_D_INIT() –  MPI_D_FINALIZE() –  MPI_D_SEND(key, value) –  MPI_D_RECV(key, value) –  MPI_D_COMM_SIZE(comm, size) –  MPI_D_COMM_RANK(comm, rank)

•  Five constants for data computing models –  Common –  MapReduce –  Iteration –  Stream, and –  UserDefined

•  Predefined structures –  Two pre-defined communicators

•  COMM_BIPARTITE_O, COMM_BIPARTITE_A –  A set of preserved configuration keys

Testing Environment •  Cluster

–  2 nodes (1 namenode + 1 datanode) •  Hardware

–  CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 4 cores, 8 threads –  Memory: 16GB –  DISK: 250GB SATA

•  Software –  Kernel: Linux-kernel 2.6.18-128.el5 x86_64 –  OS: CentOS 5.3 (Final) –  Java: 1.6.0_24 –  Hadoop: 0.20.2 version

•  Power Analyzer –  Fluke 4000, with rated power of 2200W –  Fluke Norma View software –  Sampling Frequency: real time (3~4 samples / second)

Common & MapReduce Modes

•  Sort 2GB on an 8-thread node •  Compared to Hadoop

–  MPID-MR saves 49.5% time, 43.7% energy

–  MPID-CM saves 81.8% time, 77.5% energy

Hadoop MPID-MR

MPID-CM

EXEC Time: 99 sec

EXEC Time: 50 sec

EXEC Time: 18 sec

The Iteration Mode MPID-PageRank

saves 71.7% time, 61.7% energy

Hadoop Hadoop

MPID-IT MPID-IT

EXEC Time: 364 sec

EXEC Time: 529 sec

EXEC Time: 103 sec

EXEC Time: 286 sec

MPID-K-means saves 46.1% time, 32.5% energy

Yahoo! S4 MPID-ST

Latency Distribution Average Latency

S4: 7.9 seconds MPID-ST: 1.2 seconds

The Stream Mode

•  Top-K Benchmark (k=10) •  1000 messages per second

•  100B per message •  120K messages

•  S4 drops 71% •  MPID-ST drops 4%

•  MPID-ST saves 84.8% latency, 82.4% energy per message

Conclusions •  Datacenter computers

–  have scaled to handle 1-100PB data –  will need to handle EB data, trillions of records

•  with higher efficiency and energy efficiency

•  A datacenter computer by 2020 will need to –  provide 1 billion threads –  reduce data movement cost by orders of magnitude via

•  utilizing locality and efficient communication

•  Community efforts are needed in –  roadmap, e.g., EB@20MW by 2020 –  data computing models, architectures –  open source system software & app frameworks

• http://mpi-d.github.com

The Chinese Academy of Sciences NICT Project

•  New generation ICT – 10-year research project (2012-2021) – 19 institutes, over 200 faculty members – Aim at China’s needs in 2020-2050

•  Human-cyber-physical ternary computing – A key component: cloud-sea computing systems

capable of handling ZB of data •  Billion-thread cloud servers for EB data processing •  GB-TB terminal devices (human facing) •  KB-GB sensor nodes (physical world facing)

References •  Zhiwei Xu: Measuring Green IT in Society. IEEE Computer 45(5): 83-85 (2012) •  Zhiwei Xu: How much power is needed for a billion-thread high-throughput

server? Frontiers of Computer Science 6(4): 339-346 (2012) •  Zhiwei Xu, Guojie Li: Computing for the masses. Commun. ACM 54(10):

129-137 (2011) •  Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong

Zhang, Zhiwei Xu: RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. ICDE 2011: 1199-1208

•  Xiaoyi Lu, Bing Wang, Li Zha, Zhiwei Xu: Can MPI Benefit Hadoop and MapReduce Applications? ICPP Workshops 2011: 371-379

•  Qi Guo, Tianshi Chen, Yunji Chen, Zhi-Hua Zhou, Weiwu Hu, Zhiwei Xu: Effective and Efficient Microprocessor Design Space Exploration Using Unlabeled Design Configurations. IJCAI 2011: 1671-1677

•  Yongqiang Zou, Jia Liu, Shicai Wang, Li Zha, Zhiwei Xu: CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries. NPC 2010: 247-261

谢谢!Thank you!

[email protected]

Documents

INSTITUTE OF COMPUTING TECHNOLOGY High ...novel.ict.ac.cn/zxu/Talks/Zhiwei Xu Invited Speech at...– E.g., Hadoop uses RPC, HTTP, NIO for communication, much slower than MPI – New