Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
High-Performance Techniques for Big Data Computing
in Internet Services
Zhiwei Xu Institute of Computing Technology (ICT)
Chinese Academy of Sciences (CAS) www.ict.ac.cn, [email protected]
This research is supported in part by the National Basic Research Program of China (Grant 2011CB302502)
and the Strategic Priority Program of Chinese Academy of Sciences (Grant XDA06010400)
INSTITUTE OF COMPUTING TECHNOLOGY
SC12 November 14, 2012
Outline
• Internet services are supercomputing • Little’s law is as important as Amdahl’s law
– Need to incorporate exascale, power, energy • Example researches
– A data placement problem – A data indexing problem – A data communication problem
• A new 10-year CAS NICT project
HPC vs. Internet Services • Both have vibrant markets & communities
– HPC: ~$1 billion; Internet Services: O($10 billion) – MOST invested in HPC and Cloud initiatives (2001-2015)
• Internet services are supercomputing! – They use large systems
• Sequoia – 98K nodes, 1.6 PB memory, 55 PB filesystem
• Google datacenter systems estimation – 800K-1M servers (nodes), ~16 PB memory – ~hundreds of PB disks
• Tencent datacenter systems estimation – ~200K nodes, ~3.2 PB memory, ~200 PB disks
– Their sciences are emerging • SC used to be for physical universe • SC is also for human-cyber-physical ternary universe
Alexa Top Sites (2012.10.21)
1. Google 2. Facebook 3. YouTube 4. Yahoo! 5. Baidu 6. Wikipedia 7. Windows Live 8. Twitter 9. QQ (Tencent) 10. Amazon 13. Taobao 16. Sina 23. eBay
HPC vs. Internet Services • HPC has a focused multi-decade goal
– A history-proof metric (flops) and ranking site (Top500.org) – Provides concrete R&D objectives and roadmaps
• 1990: Gflops; 2000: Tflops; 2010: Pflops; 2020: Eflops – Facilitates a worldwide, broad-based HPC community
• devices, systems, software, application people • academia, industry, governments, volunteers • funding, R&D, use, and education issues
Jack Dongarra, On the Future of High Performance Computing: How to Think for Peta and Exascale Computing, SCI Institute, University of Utah, February 10, 2012,
Petaflops book, MPI, Blocks in Linpack
Chinese HPC Development Benefited from This International Goal
2020
Exaflops (1018) Datacenter for 100’s M (108) users
100 M (108) LOC100 M (108) W
Needs:
Maintain growth in performance, but control power & system software complexity
World Top1 computer speed (Flops)ICT computer speed (Flops)ICT computer system software (LOC) ICT computer power (W)
The User Experience Mantra • “A function (or performance level) does not exist, if users
do not experience it well.” – Internet services companies in China
• A profound implication: It is not horizontal anymore, and systems researchers must – consider communities and ecosystems (e.g., Hadoop, Hbase, S4) – have access to workload data and ecosystems
Otherwise, a Pflops computer is only a subTflops system!
User Experience
Service
Application
Middleware
System Software
Machines
Components
1955-1980Vertical
1980-2005Horizontal
2005-2030End-to-end ecosystems
IBM DEC ……
EDS, Andersen, …
MS Office, SAP, …
Oracle, BEA, …
Windows, Unix, …
HP, Cisco, Dell, …
Intel, Seagate, …
IBM, Accenture, …
MS Office, SAP, …
Oracle, BEA, LAMP, …
Android, Windows, Linux…
HP, Cisco, Lenovo, …
Intel, ARM, Seagate, …
Apple
Tencent
PHP 51090X
Java 1816X
C 107X
Vectorized 36X
BLAS 1
Software bloat in MxM
Advancing Computer Systems without Technology Progress, M. D. Hill and C. Kozyrakis, ISAT Outbrief, 2012
R&D Issues: HPC vs. Internet Services• Big data issues for Internet services
– Data sizes: GBs to TBs vs. PBs to EBs, or trillion records – Performance goals: 1Eflops@20MW vs. 1EB/H@20MW? – Scalability R&D has progressed better than efficiency R&D
• Common research issues: how to effectively – Exploit parallelism (millions to a billion threads) – Utilize locality (temporal, spatial, request, data, etc.) – Reduce communication overheads
151
49
33
4.95
3.48
1.03
362
975
33
387
1
10
100
1000
1996 1998 2000 2002 2004 2006 2008 2010 2012
TB PB 10PB
Execution time (minutes)
Jim Gray’s Sort BenchmarkSpeed improved 150X in 11 years
1998-2004: 5X in 6 years 2004-2009: 33X in 5 years
Data size increased 10000X in 13 years
Breakthroughs are needed to sort 100PB by 2015, and 1EB by 2020, within a few hours, @20MW budget
2015 2020
100PB 1EB
Internet Services WorkloadsA service normally serves two types of data computing workloads
– Batch (backend, offline): data mining, machine learning – Performance metric for batch workloads: scale of processed data
• TB/PB/10PB sorted, 1M/10M dimensions learned in a minute/hour/day – Customer facing (frontend, online): request serving, transaction, analytics – Performance metric for customer facing workloads: Amazon triple
• (Simultaneous Requests, Percentile, Response Time) = (100K, 99.9%, 300ms)
1000s threads, latency <20ms
Internet
Crawler System (trillion pages)
Index System(trillion pages)
Ranking
Data Mining(1-100PB)
Machine Learning(sparse matrix with 1-10M dimensions)
Latency
Throughput
Request Fails!
User Experience Threshold
Each search request could generate 10K parallelism
Queuing Theory
Relate Multiple Performance Metrics to Energy • Borrow from Little’s law and the Internet hourglass • Focus on “threads per second” as a proxy of the performance goals
– Subject to latency, power, energy constraints – A thread is a schedulable sequence of instruction executions with its own
program counter • POSIX thread, HW thread, Java thread, CUDA thread of GPU, Hadoop task, etc.
– “Threads per second” serves as the neck of the performance metrics hourglass
Applications & Macro-benchmark Metrics
Instructions per
Second
Operations per
Second
Micro-benchmark
Scores
Threads per
Second
0 t T Time
ii wt +it
Worker threads App Framework Thread System thread
Thread iτ has a latency iw
and executes if flop
App Framework Thread
System thread
Para(t)=6
Power=P(t)
App
Framework
Thread
System
thread
Pageviews per day, etc.
ExaFlop/s, etc.
Assumptions and Observations
• Assume N threads {τ1,…,τN} are executed in a computer system in time period [0, T], where – power and energy are additive; inactive threads consume no power
• Definitions of some average quantities – Throughput λ: threads per second, averaged over [0, T] – Parallelism L: number of active threads, averaged over [0, T] – Latency W: latency of a thread, averaged over {τ1,…,τN} – Power P: Watts consumed by the system, averaged over [0, T] – Energy E: Joules consumed by a thread, averaged over {τ1,…,τN}
• Observations – Little’s Law: λ = L / W – New observations
• λ = P / E Throughput = system Power / thread Energy • λ = L × (E/W) × (1/E) • Throughput = Parallelism × Watts per thread × Threads per Joule
Connecting to ExaFlops@20MW
• Definitions – Work F: flop per thread, averaged over {τ1,…,τN}
– Speed S: flop per second
• Observation – S = F × λ = L × F × (E/W) × (1/E) – Speed = Parallelism × Work × Watts per thread × Threads per Joule – 1 Eflops = 1 billion × 1 billion flop × (<20 mW) × (>1000 threads per Joule)
F= fii=1
N∑ /N
TfS N
i i /1∑ ==
where fi is flop of thread τi
A Billion Thread Parallelism Needed by 2020 for High-End Datacenter Computers (DCC)
• How big “peak L” was/is/will be – 2000: kilo threads – 2010: million threads – 2020: billion threads
• Performance/energy needs to improve 100-1000X – increase parallelism 100-1000X – reduce data movement cost
Attributes of a DCC 2010 2020 Daily PV (billion) 4-7 20-100
Active threads per PV 1000 10,000
Peak-to-average ratio 2-10 2-15
Peak-hour parallelism ~1 million ~1 billion
A Data Placement Problem• How to place 1-1000PB data among thousands of
nodes to allow fast data warehouse operations?
• RCFile – Production use in many companies: Facebook, Taobao, Netflix,
Twitter, Yahoo, Linkedin, AOL, Salesforce.com, etc.
Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, Zhiwei Xu: RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. ICDE 2011: 1199-1208 or http://en.wikipedia.org/wiki/RCFile
NameNode
DataNode 1 DataNode 2 DataNode 3
A B C D101 201 301 401102 202 302 402103 203 303 403104 204 304 404105 205 305 405
R&D Issues • Research problem
– Find a data placement structure with both optimal read efficiency and communication overhead
• Development problem – How to fit the data computing community? – Utilize the Apache ecosystem (HDFS, MapReduce)
Row-Store Column-store
Read effort 1 i/n (optimal)
Communicationoverhead 0 (optimal) β (0%≤β≤100%)
Ideal
i/n (optimal)
0 (optimal)
RCFile data layout in HDFS blocks (Relation partitioned into row groups, each group column stored)
Plain text format: 10.11MB
RCFile format: 1.8MB
Compressed row-store format: 2.13MB
Testing results on Facebook workload
A Data Index Problem• How to index 10-1000 billion data records to allow
efficient multi-attribute range queries? – Records stored in Distributed Ordered Tables (DOTs)
• e.g., BigTable, HBase – Multi-attribute range query
• Select A,B,C,D from table where B > 21 and B < 24 and C > 31
HMaster
HReignServer 1
HReignServer 3
HReignServer 2
A B C D E 11 21 31 41 51 12 22 32 42 52 13 23 33 43 53 14 24 34 44 54 15 25 35 45 55
Numbers of an Ideal Index Scheme
– Total data table n rows, c columns; Result m rows, q columns; k indexes – S = scan latency per record, R = random read latency – r replicas, N nodes, failure probability p per node
Requirements Secondary index
Clustering index
Replicated secondary
index
Replicated clustering
index
Ideal
Touched data (# cells)
m+m*q m*q (optimal)
m+m*q m*q(optimal)
m*q���(optimal)
Operation latency
m*S+m*R(worst)
m*S(optimal)
m*S+m*R(worst)
m*S(optimal)
m*S���(optimal)
Storage cost (# cells)
n*c+n*k(optimal)
n*c* (k+1) (n*c+n*k)*r n*c*(k+1)*r(worst)
n*c+n*k���(optimal)
Failure probability
N*p(worst)
N*p(worst)
N*pr
(optimal)N*pr
(optimal)N*pr ���
(optimal)
CCIndex: Complementary Clustering Index
• Touched data: optimal • Operation latency: optimal
CCTs: Complementary Check Tables (r replicas)
A B C D E 11 21 31 41 51 12 22 32 42 52 13 23 33 43 53 14 24 34 44 54 15 25 35 45 55
B_A C D E 21_11 31 41 51 22_12 32 42 52 23_13 33 43 53 24_14 34 44 54 25_15 35 45 55
C_A B D E 31_11 21 41 51 32_12 22 42 52 33_13 23 43 53 34_14 24 44 54 35_15 25 45 55
A B C 11 21 31 12 22 32 13 23 33 14 24 34 15 25 35
• Storage cost: good • Failure probability: near optimal
B_A C 21_11 31 22_12 32 23_13 33 24_14 34 25_15 35
CCITs: Complementary Clustering Index Tables • Complementary, not replicated • Reduce storage cost from n*(k
+1)*c*r to n*(k+1)*(c+r) • Scan a CCIT for a range query • In case of a failure, scan a
CCT and read another CCIT
C_A B 31_11 21 32_12 22 33_13 23 34_14 24 35_15 25
Application at Taobao• Taobao Magic Cube: an online data analytic service for merchants • Migrate to CCIndex on Hbase in two months w/o changing hardware
• Data analyzed: 7 days à 3 months; Data scale: >10 billion records • Throughput: increases 7x; Latency: decreases 57%
Alexa Top Sites (2012.10.21)
1. Google 2. Facebook 3. YouTube 4. Yahoo! 5. Baidu 6. Wikipedia 7. Windows Live 8. Twitter 9. QQ (Tencent) 10. Amazon
13. Taobao 16. Sina 23. eBay
Week Season
Lady’s clothes sold ¥3.56B last month to 19.8M buyers
The Big Data Communication Problem • Data computing lacks high-performance communication support
– E.g., Hadoop uses RPC, HTTP, NIO for communication, much slower than MPI – New work: Hadoop-R project at OSU, RDMA-Based Design of HDFS (SC12 paper)
• System power consumption is not proportional to application usages – An example: Sorting 2 GB on a 4-core 8-thread computer with Hadoop
• 32 worker threads, 202 system processes and 2 Hadoop processes • 121W/194W=62.3% power “wasted”, even when CPU utilization approaches 0%
• Needs summary – MPI-like communication
library to support data computing
– More efficient than RPC, HTTP, NIO
– Supporting key-value communication, not buffer-to-buffer communication in MPI
– Supporting multiple data computing modes
– As easy to use as MapReduce
32
204
Direct MPI Use Not Easy Or Scalable
WordCount via MapReduce:Scalable over 1GB, 1TB, 1PB …
//MapReducemap (String lineno, String contents) { for each word w in contents { EmitIntermediate(w, 1); } } reduce (String key, int value) { increment(key, value); }
//MPI process mapper:1st> load input 2nd> parse token 3rd> MPI_Send (serialization) … process reducer: 1st> MPI_Recv (Deserialization) 2nd> increment 3rd> save output …
LOC : 110 LOC: 850
Desired Sort Code via MPI-D: Scalable and Easy to Write
init
rank/size
send
recv
finalize33 lines of code 1 GB, 1 TB, 1PB
Basic Ideas of MPI-‐D• Observation: WordCount examplemap (String lineno, String contents) { for each word w in contents { EmitIntermediate(w, 1); } } reduce (String key, int value) { increment(key, value); }
send (key, value) recv(key, value)
• KV + DOTA à MPI-‐D DOTA is inspired by the DOT model
Y. Huai, R. Lee, S. Zhang, C.H. Xia, and X. Zhang, "DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems", Proceedings of 2nd ACM Symposium on Cloud Computing (SOCC 2011)
Rethink BSP from a Big Data Viewpoint
• Implicit synchronization • Vertex: data • Edge: operation • DOTA
– 4 layers: Data, Operation, Transfer, Aggregation
– Defense of the Ancients
D1 D2 D3 Dn
O1 O2 O3 On
A1 A2 Am
T11 T12 T1m Tnm
D Layer
O Layer
T Layer
A Layer
Input / Output Data
Intermediate Data
Computation Operations Communication Operations
DOTA – Matrix Representaion
• Capture the nature of big data computing in the DOTA model • Computation matrices O and A
– diagonal matrices to represent independent parallel computations • Communication matrix T:
– full matrix to represent interoperation of communication • Example: MapReduce
– O -> Map, T -> Shuffle, A -> Reduce
!!"! = !! … !!!!0
0!!
⋯…
00
⋮ ⋮ ⋱ ⋮0 0 ⋯ !!
!!,!!!,!
!!,!!!,!
⋯…
!!,!!!,!
⋮ ⋮ ⋱ ⋮!! ,! !! ,! ⋯ !! ,!
!!0
0!!
⋯…
00
⋮ ⋮ ⋱ ⋮0 0 ⋯ !!
!
!!!!!!!!!!!!!= !! !! … !!! !!
!!,!!!,!
!!,!!!,!
⋯…
!!,!!!,!
⋮ ⋮ ⋱ ⋮!! ,! !! ,! ⋯ !! ,!
!!0
0!!
⋯…
00
⋮ ⋮ ⋱ ⋮0 0 ⋯ !!
!
!!!!!!!!!!!!!= !! ,! !! !!!!!! … !! ,! !! !!!
!!!
!!0
0!!
⋯…
00
⋮ ⋮ ⋱ ⋮0 0 ⋯ !!
!
!!!!!!!!!!!!!= !! !!,! !! !! ,… , !! ,! !! !! … !! !!,! !! !! ,… , !! ,! !! !! !!!!!!!!!!!!!!= [!!! … !!! ]!
DOTA Is a Bipartite Communication Model
The “4D” features of DOTA – Dichotomic, Dynamic, Data-centric, and Diversified
Communicator O Communicator A
O1
O2
On
A1
A2
Am
D1
D2
...
...
Dm
Future Task
Current TaskFinished Task
Data Movement
Task MovementIntermediate Data
Proposed Specification of MPI-D • Three groups of library functions
– MPI_D_INIT() – MPI_D_FINALIZE() – MPI_D_SEND(key, value) – MPI_D_RECV(key, value) – MPI_D_COMM_SIZE(comm, size) – MPI_D_COMM_RANK(comm, rank)
• Five constants for data computing models – Common – MapReduce – Iteration – Stream, and – UserDefined
• Predefined structures – Two pre-defined communicators
• COMM_BIPARTITE_O, COMM_BIPARTITE_A – A set of preserved configuration keys
Testing Environment • Cluster
– 2 nodes (1 namenode + 1 datanode) • Hardware
– CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 4 cores, 8 threads – Memory: 16GB – DISK: 250GB SATA
• Software – Kernel: Linux-kernel 2.6.18-128.el5 x86_64 – OS: CentOS 5.3 (Final) – Java: 1.6.0_24 – Hadoop: 0.20.2 version
• Power Analyzer – Fluke 4000, with rated power of 2200W – Fluke Norma View software – Sampling Frequency: real time (3~4 samples / second)
Common & MapReduce Modes
• Sort 2GB on an 8-thread node • Compared to Hadoop
– MPID-MR saves 49.5% time, 43.7% energy
– MPID-CM saves 81.8% time, 77.5% energy
Hadoop MPID-MR
MPID-CM
EXEC Time: 99 sec
EXEC Time: 50 sec
EXEC Time: 18 sec
The Iteration Mode MPID-PageRank
saves 71.7% time, 61.7% energy
Hadoop Hadoop
MPID-IT MPID-IT
EXEC Time: 364 sec
EXEC Time: 529 sec
EXEC Time: 103 sec
EXEC Time: 286 sec
MPID-K-means saves 46.1% time, 32.5% energy
Yahoo! S4 MPID-ST
Latency Distribution Average Latency
S4: 7.9 seconds MPID-ST: 1.2 seconds
The Stream Mode
• Top-K Benchmark (k=10) • 1000 messages per second
• 100B per message • 120K messages
• S4 drops 71% • MPID-ST drops 4%
• MPID-ST saves 84.8% latency, 82.4% energy per message
Conclusions • Datacenter computers
– have scaled to handle 1-100PB data – will need to handle EB data, trillions of records
• with higher efficiency and energy efficiency
• A datacenter computer by 2020 will need to – provide 1 billion threads – reduce data movement cost by orders of magnitude via
• utilizing locality and efficient communication
• Community efforts are needed in – roadmap, e.g., EB@20MW by 2020 – data computing models, architectures – open source system software & app frameworks
• http://mpi-d.github.com
The Chinese Academy of Sciences NICT Project
• New generation ICT – 10-year research project (2012-2021) – 19 institutes, over 200 faculty members – Aim at China’s needs in 2020-2050
• Human-cyber-physical ternary computing – A key component: cloud-sea computing systems
capable of handling ZB of data • Billion-thread cloud servers for EB data processing • GB-TB terminal devices (human facing) • KB-GB sensor nodes (physical world facing)
References • Zhiwei Xu: Measuring Green IT in Society. IEEE Computer 45(5): 83-85 (2012) • Zhiwei Xu: How much power is needed for a billion-thread high-throughput
server? Frontiers of Computer Science 6(4): 339-346 (2012) • Zhiwei Xu, Guojie Li: Computing for the masses. Commun. ACM 54(10):
129-137 (2011) • Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong
Zhang, Zhiwei Xu: RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. ICDE 2011: 1199-1208
• Xiaoyi Lu, Bing Wang, Li Zha, Zhiwei Xu: Can MPI Benefit Hadoop and MapReduce Applications? ICPP Workshops 2011: 371-379
• Qi Guo, Tianshi Chen, Yunji Chen, Zhi-Hua Zhou, Weiwu Hu, Zhiwei Xu: Effective and Efficient Microprocessor Design Space Exploration Using Unlabeled Design Configurations. IJCAI 2011: 1671-1677
• Yongqiang Zou, Jia Liu, Shicai Wang, Li Zha, Zhiwei Xu: CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries. NPC 2010: 247-261
谢谢!Thank you!