A High-Fidelity Temperature Distribution Forecasting System for Data Centers Guoliang Xing Assistant Professor Department of Computer Science and Engineering

Embed Size (px)

Citation preview

  • Slide 1
  • A High-Fidelity Temperature Distribution Forecasting System for Data Centers Guoliang Xing Assistant Professor Department of Computer Science and Engineering Michigan State University
  • Slide 2
  • Cyber-Physical Systems Cyber-physical systems are engineered systems that are built from and depend upon the synergy of computational and physical components 1 Many critical application domains Medical, auto, energy, transportation # 1 national priority for Networking and IT Research and Development (NITRD) NITRD Review report by President's Council of Advisors on Science and Technology (PCAST) titled Leadership Under Challenge: Information Technology R&D in a Competitive World, 2007 1 NSF Cyber-physical systems solicitation13502 2
  • Slide 3
  • Our CPS Projects Data center thermal monitoring Real-time volcano monitoring Aquatic process profiling Robotic fish, Smart Microsystems Lab, MSU Tungurahua Volcano, Ecuador Volcano Monitoring Sensors Data Center Monitoring, HPCC, MSU Harmful Algae Bloom in Lake Mendota in Wisconsin, 1999 3
  • Slide 4
  • Outline Data center thermal monitoring Background System design Testbed evaluation Real-time volcano monitoring Barcode streaming for smartphones 4
  • Slide 5
  • Motivation Data centers are critical computing infrastructure 509,147 data centers world wide, 285 million sq. ft. 1 2.8M hours of downtime, 142 billions direct loss/year 1 23% server outages are heat-induced shutdowns An aerial view of EMC's new data center in Durham, North Carolina 2 An EMC data center 2 1 Emerson Network Power, State of the Data Centers 2011, 2 http://www.datacenterknowledge.com/archives/2011/09/15/emc-opens-new-cloud-data-center-in-nc/. 5
  • Slide 6
  • Motivation Many data centers are overcooled Low AC set-points, high server fan speeds Excessive cooling energy up to 50% or more of total power consumption Rapid increase of energy use in data centers From 2005 to 2010, electricity use in data centers grew 36% (US) and 56% (world wide) 1 An estimated 2% of electricity budget of US 1 1 Jonathan G. Koomey, Grouth in data center electricity use 2005 to 2010, Analytics Press, 2011. 6
  • Slide 7
  • Temperature Forecasting Predict server temperature evolution Identify potential hot spots Enable high CRAC set-points for energy saving Temperature at inlets/outlets indicates hotspots Inlets Outlets 7 cool airhot air
  • Slide 8
  • Requirements High-fidelity Prediction 1 o C prediction error Long prediction horizons (e.g., 10 minutes) Coverage: normal conditions & emergencies (e.g., AC failures) Timeliness and low overhead Real-time online prediction Decouple from infrastructure in data center 8
  • Slide 9
  • Challenges Complex air and thermal dynamics Highly dynamic workloads Physical failures ACs, servers, fans Row 1 Row 2 Raised-floor cold air Server exhaust 12-day CPU utilization data of one rack (64 servers with 512 CPU cores) in High Performance Computer Center at Michigan State University 9
  • Slide 10
  • Related Work Data-driven prediction approach Collect in situ sensor data Construct prediction model (parameter learning) Regression, neural networks, etc. Real-time prediction Limitation Require extensive training Rare but critical physical failures in data centers? 10
  • Slide 11
  • Related Work Computational Fluid Dynamics (CFD) modeling Spatially discretized geometry model Iteratively solve partial differential equations Limitation Inaccuracy, high compute complexity error 11
  • Slide 12
  • System Architecture CFD + Wireless Sensing + Data-driven Prediction Preserve realistic physical characteristics in training data Capture dynamics by in situ sensing and real-time prediction Data Center Calibration Sensing (CPU, fan speed, temperature, airflow) Sensing (CPU, fan speed, temperature, airflow) Real-time Prediction CFD Modeling geometric model (server/rack dimension and placement) 12
  • Slide 13
  • Thermal Sensing Inlet / Outlet Temperature Sensing Air velocity CRAC Temp CPU utilization Fan speed Temperature Airflow velocity LAN 13
  • Slide 14
  • CFD Modeling & Calibration Data Center Calibration Sensing (CPU, fan speed, temperature, airflow) Sensing (CPU, fan speed, temperature, airflow) Real-time Prediction CFD Modeling 14
  • Slide 15
  • CFD Modeling & Calibration Polynomial Calibration Physical Geometry Model CFD Modeling Steady/Transient CFD Steady Sensor Data Calibration coefficients Temperature from CFD Calibration order Training: sensor reading Runtime: calibrated temperature Training: sensor reading Runtime: calibrated temperature t t+3 mint+6 min Transient 15
  • Slide 16
  • Real-time Prediction Data Center Calibration Sensing (CPU, fan speed, temperature, airflow) Sensing (CPU, fan speed, temperature, airflow) Real-time Prediction CFD Modeling 16
  • Slide 17
  • Real-time Prediction Training Linear Prediction Model Prediction 17
  • Slide 18
  • Single-rack Experiment Testbed configuration 30 temperature sensors Telosb, Iris 2 airflow sensors AccuSense F333 15 servers Dell PowerEdge 850 Western Scientific Controlled CPU utilization Temperature sensor Ceiling vent airflow sensor Insulation Temperature sensor Airflow sensor AC inlet 18
  • Slide 19
  • Experiment Results Multi-horizon prediction CFD-assisted prediction Error increases with horizon 19
  • Slide 20
  • Production Data Center Experiment Testbed configuration 5 racks, 229 servers, 2016 cores 4 in-row CRAC units 35 temperature sensors 4 airflow sensors Dynamic CPU utilization Airflow sensor Temperature sensor Chained Temp. sensor In-row CRACs 20
  • Slide 21
  • Experiment Results Long-term experiment (12 days) Outlet Inlet 21
  • Slide 22
  • Outline Data center thermal monitoring Real-time volcano monitoring Background Quality-driven earthquake detection Deployment and evaluation Barcode streaming for smartphones 22
  • Slide 23
  • Volcano Hazards 7% world population live near active volcanoes 20 - 30 explosive eruptions/year Eruption in Chile, 6/4, 2011 $68 M instant damage, $2.4 B future relief. www.boston.com/bigpicture/2011/06/volcano_erupts_in_chile.html 23 Eruptions in Iceland 2010 A week-long airspace closure [ Wikipedia]
  • Slide 24
  • Volcano Monitoring Traditional seismometer Expensive (~ $10K), bulky, difficult to install, up to a dozen of nodes for most active volcanoes! Data collection and retrieval ~10G data in a month Processing Detection, timing, localization 4D Tomography computation Real-time, 3D fluid dynamics of a volcano conduit system Extremely computation-intensive 24
  • Slide 25
  • VolcanoSRI Project Large-scale, long-term deployment Up to 500 nodes on an active volcano in Ecuador Sampling@100Hz, several month lifetime Collaborative in-network processing Detection, timing, localization 4D tomography computation The tentative deployment map at Ecuador (Photo credits: Prof. Jonathan Lees) 25
  • Slide 26
  • Challenge 1: Spatial Diversity Complicated physical process Highly dynamic magnitude Dynamic source location Two earthquakes on Mt St Helens 26
  • Slide 27
  • Challenge 2: Frequency Diversity Responsive to P-wave within [1 Hz, 10 Hz] Freq. spectrum changes with signal magnitude [1 Hz, 5 Hz] [5 Hz, 10 Hz] Signal energy: X 10000 X 100 27
  • Slide 28
  • Approach Overview Select sensors with best signal qualities FFT (computation-intensive) Local detection Decision fusion sensor selection decision fusion system decision FFT seismic sensor 1 0 1 28 avoid raw data transmission
  • Slide 29
  • Smartphone-based Node Seismometer Geospace Geophone model GS-11D LG GT540 Android 1.6 IOIO board Amplifier External GPS GPS antenna 29
  • Slide 30
  • Field Deployment First deployment on Tungurahua, Ecuador Six nodes, one week, 8/2012 30
  • Slide 31
  • Results 19 days 3.9 months 5% detect prob. Signal collected by permanent seismometer Signal collected by our node Centralized processing Data collection w/ compression STA/LTA Heuristic seismic detection algorithm Weighted decision fusion No sensor selection 31
  • Slide 32
  • Outline Data center thermal monitoring Real-time volcano monitoring Barcode streaming for smartphones Background Barcode streaming Implementation and evaluation 32
  • Slide 33
  • Wireless payment -Preserve security and privacy 33 Advertisement -Broadcast brochures, coupons and maps (e.g., retail stores, museums) Data exchange -Transfer small piece of info btw smartphones (e.g., contacts, photos) PayPal inStore App
  • Slide 34
  • 34 QR code [1] Low capacity (typically 50 chars) HCCB [2] (High Capacity Color Barcode) High decoding overhead [1] I. 18004:2006. Automatic identification and data capture techniques - QR code 2005 bar code symbology specification. [2] D. Parikh and G. Jancke. Localization and segmentation of a 2d high capacity color barcode. In Applications of Computer Vision, 2008. Not suitable for high-rate streaming
  • Slide 35
  • 35 High capacity & fast decoding rate Smart frame Corner Tracker Timing Reference Blocks Code area Blocks with 4 orthogonal colors Single barcode capacity up to 20 Kbits (4 inch) p
  • Slide 36
  • Typical received barcode image Original barcode Distorted barcode image Poor image quality -Low quality camera -Small size and low resolution screen -Relative movement Perspective distortion Severe blur in captured images Limited computation resource - Need to capture and process up to 30 images per second 36
  • Slide 37
  • 37 Blur usually occurs along the border of blocks with different colors Typical barcode image captured by smartphone camera
  • Slide 38
  • 38 Color ordering Goal: Group blocks with same color to reduce border length.
  • Slide 39
  • 39 Implementation - Android 2.3.3 Gingerbread - Sender: 56 KB storage, 5MB RAM, Nexus S (4 inch screen, 800x480) - Receiver: 72 KB storage, 3.5~12MB RAM, HTC Inspire (8MP camera) 200Kbps throughput under various settings - Block size - View angle, alignment - Screen refreshing rate, camera resolution - Mobility, distance - Ambient lighting, screen brightness Nexus S HTC Inspire
  • Slide 40
  • Future Work Data center monitoring Workload scheduling, power optimization Volcano monitoring Signal processing: timing and localization System building: power management and programming interfaces Barcode streaming for smartphones Security of light channel and user authentication 40
  • Slide 41
  • Acknowledgement Group members Tian Hao (Ph.D, 2010-), Yu Wang (Ph.D, 2010-), Jun Huang (Ph.D, 2009-), Ruogu Zhou (Ph.D, 2009-), Dennis Philips (Ph.D, 2009-), Jinzhu Chen (Ph.D, 2010-), Mohammad-Mahdi Moazzami (Ph.D, 2011-), Fatme El- Moukaddem (Ph.D, co-supervised with Dr. Eric Torng), Rui Tan (Postdoc) National Science Foundation CDI, VolcanoSRI, 2011-2015 (in collaboration with WenZhan Song @ Georgia State University, Jonathan Lees@University of North Carolina, Chapel Hill) CAREER, performance-critical sensor networks, PI, 2010-2015. ECCS, aquatic sensor networks, PI, 2010-2013 (in collaboration with Xiaobo Tan @ MSU) CNS, real-time and performance control of networked sensor system, MSU PI, 2012-2015 (in collaboration with Xiaorui Wang @ Ohio State) CNS, Interference in crowded spectrum, MSU PI, 2009-2012 (in collaboration with Gang Zhou @ William & Mary) 41
  • Slide 42
  • Representative Publications J. Chen, R. Tan, Y. Wang, G. Xing, X. Wang, X. Wang, B. Punch, D. Colbry, A High-Fidelity Temperature Distribution Forecasting System for Data Centers, The 33st IEEE Real-Time Systems Symposium (RTSS), 2012, acceptance ratio: 35/157=22% R. Tan, G. Xing, J. Chen, W. Song, R. Huang, Quality-driven Volcanic Earthquake Detection using Wireless Sensor Networks, 31st IEEE Real-Time Systems Symposium (RTSS), 2010. T. Hao, R. Zhou, G. Xing, COBRA: Color Barcode Streaming for Smartphone Systems, The 10th International Conference on Mobile Systems, Applications, and Services (MobiSys), 2011, acceptance ratio: 32 / 182 = 17.5% J. Huang, G. Xing, G. Zhou, R. Zhou, Beyond Co-existence: Exploiting WiFi White Space for ZigBee Performance Assurance, The 18th IEEE International Conference on Network Protocols (ICNP), 2010, acceptance ratio: 31/170 = 18.2%, Best Paper Award (1 out of 170 submissions). R. Zhou, Y. Xiong, G. Xing, L. Sun, J. Ma, ZiFi: Wireless LAN Discovery via ZigBee Interference Signatures, The 16th Annual International Conference on Mobile Computing and Networking (MobiCom), acceptance ratio: 33/233=14.2%. S. Liu, G. Xing, H. Zhang, J. Wang, J. Huang, M. Sha, L. Huang, Passive Interference Measurement in Wireless Sensor Networks, The 18th IEEE International Conference on Network Protocols (ICNP), acceptance ratio: 31/170 = 18.2%, Best Paper Candidate (6 out of 170 submissions). X. Xu, L. Gu, J. Wang, G. Xing, Negotiate Power and Performance in the Reality of RFID Systems, The 8th Annual IEEE International Conference on Pervasive Computing and Communications (PerCom), acceptance ratio: 27/227=12%, Best Paper Candidate (3 out of 227 submissions). 42
  • Slide 43
  • 43 Streaming barcodes btw screen and camera or receiversender Real-time visible light communication (VLC) system for off-the-shelf smartphones -Encode info into color barcodes -Stream barcodes from screen to camera -High communication throughput (70~200 kbps for 4 inch, 800x640 screen)
  • Slide 44
  • Quality-driven Earthquake Detection Assured false alarm rate & detection probability Real-time detection Temporal resolution: 1s Long network lifetime Avoid raw data transmission 44
  • Slide 45
  • Encode data into barcodes and display on the screen PRE-PROCESSING Color enhancement Blur assessment CODE EXTRACTION Code Scan Smart Frame detection Sender Receiver CODE GENERATION Motion-aware coding Blur-aware color ordering 45
  • Slide 46
  • CODE GENERATION Motion-aware coding Sender Receiver Select and enhance the received images PRE-PROCESSING Color enhancement Blur assessment CODE EXTRACTION Code Scan Smart Frame detection 46 Blur-aware color ordering
  • Slide 47
  • Sender Receiver Extract data from enhanced images CODE EXTRACTION Code Scan Smart Frame detection CODE GENERATION Motion-aware coding PRE-PROCESSING Color enhancement Blur assessment 47 Blur-aware color ordering
  • Slide 48
  • Decision Fusion at BS Extended majority rule Closed-form detection performance 48 > threshold, decide 1 # of positive local decisions total # of sensors P F = f ( P F1, P F2, , P FN ) P D = f ( P D1, P D2, , P DN ) P Fi / P Di : false alarm rate / detection prob. of sensor i
  • Slide 49
  • 49 Small block size can achieve higher throughput (>200 kbps) at the cost of lower decoding rate (99.5%) at the cost of lower throughput (