Addressing shared resource contention in datacenter servers Colloquium Talk by Sergey Blagodurov sba70/ Stony Brook University Fall

Embed Size (px)

Citation preview

  • Slide 1

Addressing shared resource contention in datacenter servers Colloquium Talk by Sergey Blagodurov http://www.sfu.ca/~sba70/ Stony Brook University Fall 2013 Slide 2 My research (40,000 feet view) Talk by Sergey Blagodurov Stony Brook University Academic research at Simon Fraser University: I am finishing my PhD with Prof. Alexandra Fedorova My work is on scheduling in High Performance Computing (HPC) clusters I prototype better datacenters! Industrial research at Hewlett-Packard Laboratories: I am Research Associate at Sustainable Ecosystems Research Group My work is on designing a net-zero energy cloud infrastructure -2- Slide 3 Why datacenters are important? Talk by Sergey Blagodurov Stony Brook University #1 Dematerialization Online shopping less driving Working from home Digital content delivery -3- Slide 4 Why datacenters are important? Talk by Sergey Blagodurov Stony Brook University #2 Moving into cloud -4- Slide 5 Why datacenters are important? Talk by Sergey Blagodurov Stony Brook University #3 Increasing demand for supercomputers The biggest scientific discoveries Tremendous cost savings Medical innovations -5- Slide 6 Why doing research in datacenters? Datacenters use lots of energy: Consumption rose by 60% in the last five years More than the entire country of Mexico! now ~1-2% of world electricity Typical electricity costs per year: Google (>500K servers, ~72MW): $38M Microsoft (>200K servers, ~68MW): $36M Sequoia (~100K nodes, 8MW): $7M Talk by Sergey Blagodurov Stony Brook University Datacenters consume lots of energy and its getting worse! Seawater hydro-electric storage on Okinawa, Japan -6- Slide 7 Why doing research in datacenters? 23k cars in annual greenhouse gas emissions CO 2 emissions from the electricity use of 15k homes for one year 20 MW 24/7 datacenter that is on for 1 year is equivalent to: Talk by Sergey Blagodurov Stony Brook University A single datacenter generates as much greenhouse gas as a small city! -7- Slide 8 Where do datacenters spend energy? Talk by Sergey Blagodurov Stony Brook University Servers: 70-90% Cooling and other infrastructure: 10-30% CPU and Memory are the biggest consumers -8- Slide 9 Memory Controller HyperTransport Shared L3 Cache System Request Interface Crossbar switch Core 0 L1, L2 cache Core 1 L1, L2 cache Core 2 L1, L2 cache Core 3 L1, L2 cache Memory node 0 NUMA Domain 0 to other domains An AMD Opteron 8356 Barcelona domain Talk by Sergey Blagodurov Stony Brook University -9- Slide 10 An AMD Opteron system with 4 domains MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache Talk by Sergey Blagodurov Stony Brook University -10- Slide 11 Contention for the shared last-level cache (CA) MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache Talk by Sergey Blagodurov Stony Brook University -11- Slide 12 MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache Contention for the memory controller (MC) Talk by Sergey Blagodurov Stony Brook University -12- Slide 13 MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache Contention for the inter-domain interconnect (IC) Talk by Sergey Blagodurov Stony Brook University -13- Slide 14 MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache Remote access latency (RL) A Talk by Sergey Blagodurov Stony Brook University -14- Slide 15 MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache A B Memory node 0 Isolating Memory controller contention (MC) Talk by Sergey Blagodurov Stony Brook University -15- Slide 16 Memory Controller (MC) and Interconnect (IC) contention are key factors hurting performance Dominant degradation factors Talk by Sergey Blagodurov Stony Brook University -16- Slide 17 Characterization method Given two threads, decide if they will hurt each others performance if co-scheduled Scheduling algorithm Separate threads that are expected to interfere AB AB Contention-Aware Scheduling Talk by Sergey Blagodurov Stony Brook University -17- Slide 18 Limited observability We do not know for sure if threads compete and how severely! Trial and error infeasible on large systems Cant try all possible combinations Even sampling becomes difficult A good trade-off: measure LLC Miss rate! Threads interfere if they have high miss rates No account for cache contention impact Characterization Method Talk by Sergey Blagodurov Stony Brook University -18- Slide 19 Miss rate as a predictor for contention penalty Talk by Sergey Blagodurov Stony Brook University -19- Slide 20 Goal: isolate threads that compete for shared resources and pull the memory to the local node upon migration ABCD Domain 1Domain 2Domain 1Domain 2 Migrate competing threads along with memory to different domains Memory node 1 MCHT Server-level scheduling AB Memory node 2 MCHT MCHT Memory node 2 Memory node 1 MCHT X Y A Y W Sort threads by LLC missrate: ABXY Talk by Sergey Blagodurov Stony Brook University CD Z W C D WZ X D Z B C -20- Slide 21 Server-level results SPEC CPU 2006 SPEC MPI 2007 LAMP Talk by Sergey Blagodurov Stony Brook University -21- Slide 22 datacenter network Memory node Node 0 Core Possibilities of datacenter-wide scheduling Memory node Core A A AA AAAA Memory node Node 3 Core Memory node Core A A AA AAAA Memory node Node 1 Core Memory node Core B C B C B C C B Memory node Node 2 Core Memory node Core DD D D Memory node Node 5 Core Memory node Core D D DD -22- Memory node Node 4 Core Memory node Core B C B C B C C B Talk by Sergey Blagodurov Stony Brook University Slide 23 Clavis-HPC features Talk by Sergey Blagodurov Stony Brook University Contention-aware cluster scheduling: See: online detection of contention, communication overhead, power consumption. Think: approximate an optimal cluster schedule (cast the problem as a multi-objective one) Do: use a low-overhead virtualization (OpenVZ) to migrate jobs across the nodes -23- Slide 24 Enumeration tree search Talk by Sergey Blagodurov Stony Brook University Branch-and-Bound enumeration search tree: -24- Finding an optimal schedule: an implementation using Choco solver minimizes weighted sum: Slide 25 Solver evaluation Talk by Sergey Blagodurov Stony Brook University Solver evaluation (custom branching strategy) -25- Slide 26 Cluster-wide scheduling (a case for HPC) Talk by Sergey Blagodurov Stony Brook University -26- Vanilla HPC framework: Clavis-HPC: Slide 27 Results Talk by Sergey Blagodurov Stony Brook University -27- Slide 28 Talk by Sergey Blagodurov Stony Brook University Whats the impact? Faster execution saves money: A datacenter with $30M electricity bill 20% less energy due to faster execution -28- $6M/year savings! Slide 29 Whats next? Talk by Sergey Blagodurov Stony Brook University Eric Schmidt, former CEO of Google: Every two days now we create as much data as we did from the dawn of civilization up until 2003. Big MoneyBig Responsibility Big Data: -29- Slide 30 Big Data has many facets Talk by Sergey Blagodurov Stony Brook University -30- Slide 31 Use case: sensor data from a cross-country flight Talk by Sergey Blagodurov Stony Brook University -31- Slide 32 Storage Future research directions Talk by Sergey Blagodurov Stony Brook University #1 Memory hierarchy in Exascale era -32- Memory node Compute node Memory node Core will turn into: Memory node Compute node Memory node Core FLASH Software defined storage PCRAM Memory node Core PCRAM Core Memory node Core PCRAM FLASH Slide 33 or data analysis Future research directions Talk by Sergey Blagodurov Stony Brook University #2 Big Data placement Big Data analysis -33- Slide 34 cloud? HPC cluster? warehouse? smth else? task Future research directions Talk by Sergey Blagodurov Stony Brook University #3 How to choose a datacenter for a given Big Data analytic task? -34- Slide 35 Conclusion Talk by Sergey Blagodurov Stony Brook University In a nutshell: Datacenters is the platform of choice Datacenter servers are major energy consumers The energy is wasted because of resource contention I address the resource contention automatically and on-the-fly Future plans: Big Data retrieval and analysis -35- Slide 36 Any [time for] questions? Addressing shared resource contention in datacenter servers Talk by Sergey Blagodurov Stony Brook University Slide 37 7). Users or sysadmins analyze the contention-aware resource usage report. 8). Users can checkpoint their jobs (OpenVZ snapshots). 9). Sysadmins can perform automated job migration across the nodes through OpenVZ live migration and are able to dynamically consolidate workload on fewer nodes, turn the rest off to save power. 7). Users or sysadmins analyze the contention-aware resource usage report. 8). Users can checkpoint their jobs (OpenVZ snapshots). 9). Sysadmins can perform automated job migration across the nodes through OpenVZ live migration and are able to dynamically consolidate workload on fewer nodes, turn the rest off to save power. 5). The virtualized jobs execute on the containers under the contention aware user-level scheduler (Clavis-DINO). They access cluster storage to get their input files and store the results. 2). Resource Manager (RM) on the head node receives the submission request and passes it to the Job Scheduler (JS). 3). JS determines what jobs execute on what containers and passes the scheduling decision to RM. 4). RM starts/stops the jobs on the given containers. 6). RM generates a contention-aware report about resource usage in the cluster during the last scheduling interval. 10). RM passes the contention-aware resource usage report to JS. 2). Resource Manager (RM) on the head node receives the submission request and passes it to the Job Scheduler (JS). 3). JS determines what jobs execute on what containers and passes the scheduling decision to RM. 4). RM starts/stops the jobs on the given containers. 6). RM generates a contention-aware report about resource usage in the cluster during the last scheduling interval. 10). RM passes the contention-aware resource usage report to JS. Clavis-HPC framework 1). User connects to the HPC cluster via client and submits a job with a PBS script. The user can characterize the job with a contention metric (devil, comm-devil). Clients (tablet, laptop, desktop, etc) Head node RM, JS, Clavis-HPC Centralized cluster storage (NFS, Lustre) Cluster network (Ethernet, InfiniBand) Monitoring (JS GUI), control (IPMI, iLO3, etc) Compute nodes contention monitors (Clavis) OpenVZ containers libraries (OpenMPI, etc) RM daemons (pbs_mom) Talk by Sergey Blagodurov Stony Brook University -37- Slide 38 Clavis-HPC additional results Talk by Sergey Blagodurov Stony Brook University Results of the contention-aware experiments -38- Slide 39 Cluster-wide scheduling (a case for HPC) Talk by Sergey Blagodurov Stony Brook University -39- Vanilla HPC framework: Clavis-HPC: Slide 40 Where do datacenters spend energy? Talk by Sergey Blagodurov Stony Brook University Servers: 70-90% Cooling and other infrastructure: 10-30% CPU and Memory are the biggest consumers -40- Slide 41 Critical (preferred access to resources): RUBiS WikiBench Non-critical: Datacenter batch load: Swaptions Facesim FDS HPC jobs: LU, BT, CG Cloud datacenter workloads Talk by Sergey Blagodurov Stony Brook University -41- Slide 42 Automated collocation Talk by Sergey Blagodurov Stony Brook University Server under-utilization is a long standing problem: Increases both CapEx and OpEx costs Even for modern servers energy efficiency at 30% load can be less than half the efficiency at 100% load. Solution: Collocate critical and non-critical applications. Manage resource access through Linux control group mechanisms. Work-conserving vs. non work-conserving collocation: managing with caps (limits) vs. managing with weights (priorities) improving isolation vs. improving server utilization -42- Slide 43 Talk by Sergey Blagodurov Stony Brook University Whats the impact? Automated collocation enables net-zero energy usage: -43- Slide 44 Workload collocation using static prioritization Talk by Sergey Blagodurov Stony Brook University Scenario A (Swaptions, Facesim, FDS) Scenario B (LU, BT,CG) -44- Slide 45 Workload collocation during spikes Talk by Sergey Blagodurov Stony Brook University Weight-based collocation: tolerable critical workload performance loss -45- Slide 46 Workload collocation during spikes Talk by Sergey Blagodurov Stony Brook University A value twice as high for a process compared to another = twice as many CPU cycles -46- Slide 47 cloud HPC cluster warehouse key/value? parallel databases? filesystem? Future research directions Talk by Sergey Blagodurov Stony Brook University #4 What storage organization is the most suitable for each datacenter type? -47- Slide 48 Data warehouse project Talk by Sergey Blagodurov Stony Brook University Data assurance for power delivery networks Data warehouse Record from meter C Record from meter B Record from meter A Assurance rules Record from meter B Record from meter A C is broken -48- Slide 49 LLC missrate works, but is not very accurate What if we want a metric that is more accurate? Then we need to profile many performance counters simultaneously and we need to build a model that predicts the degradation. We would have to train the model beforehand on a representative workload. The need of training the model is the price of higher accuracy! Increasing prediction accuracy Talk by Sergey Blagodurov Stony Brook University -49- Slide 50 Our Solution Talk by Sergey Blagodurov Stony Brook University -50- Devising an accurate metric (outline) Slide 51 Our Solution Talk by Sergey Blagodurov Stony Brook University -51- Devising an accurate metric (outline) Slide 52 Talk by Sergey Blagodurov Stony Brook University -52- Devising an accurate metric (methodology) Slide 53 Talk by Sergey Blagodurov Stony Brook University -53- Devising an accurate metric (methodology) Slide 54 Talk by Sergey Blagodurov Stony Brook University -54- Devising an accurate metric (methodology) Slide 55 Talk by Sergey Blagodurov Stony Brook University -55- Devising an accurate metric (methodology) Slide 56 Our Solution Talk by Sergey Blagodurov Stony Brook University -56- Devising an accurate metric (model) REPTree module in Weka: creates a tree with each attribute placed in a tree node branches of the tree are values that this attribute takes The leaf stores degradation (obtained on the training stage) Slide 57 Intel Events : 340 Recordable core events, 19 Core events selected Average Prediction Error: 16% AMD Events: 208 Recordable Core events, 223 Recordable Chip Events 32 Core events selected, 8 Chip events selected Average Prediction Error: 13% Talk by Sergey Blagodurov Stony Brook University -57- Devising an accurate metric (results)