This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore...

View
219
Download
1
Category

Documents

Tags:

Preview:

Citation preview

This work was performed under the auspices of the U.S. Department of Energy by Lawrence

Livermore National Laboratory under Contract DE-AC52-07NA27344.

Petascale Tools Workshop August 2014

LLNL-PRES-xxxxxx

Gremlins’ the Sequel – the Horrors of Exascale

Judit Gimenez, BSCMartin Schulz, LLNL

http://scalability.llnl.gov/

http://www.bsc.es/

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Can we make a Petascale class machine

behave like what we expect Exascale machines to look like?

• Limit Resources (power, memory, network, I/O, …)• Increase compute/bandwidth ratios• Increase fault rates and lower MTBF rates

In short: release GREMLINs into a petascale machine

Goal: Emulation Platform for the Co-Design process• Evaluate proxy-apps and compare to baseline• Determine bounds of behaviors proxy apps can tolerate• Drive changes in proxy apps to counter-act exascale properties

How (not) to Build an Exascale Machine

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Power• Impact of changes in frequency/voltage• Impact of limits in available power per machine/rack/node/core

Memory• Restrictions in bandwidth• Reduction of cache size• Limitations of memory size

Resiliency• Injection of faults to understand impact of faults• Notification of “fake” faults to test recovery

Noise• Injection of controlled or random noise events• Crosscut summarizing the effects of previous GREMLINs

Broad Classes of GREMLINs

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Power• Impact of changes in frequency/voltage• Impact of limits in available power per machine/rack/node/core

Memory• Restrictions in bandwidth• Reduction of cache size• Limitations of memory size

Resiliency• Injection of faults to understand impact of faults• Notification of “fake” faults to test recovery

Noise• Injection of controlled or random noise events• Crosscut summarizing the effects of previous GREMLINs

Broad Classes of GREMLINs

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Using RAPL to install power caps• Exposes chip

variations• Turns homogenous

machines into inhomogeneous ones

Optimal configuration under a power cap• Widely differing

performance• Application specific

characteristics• Need for models

Results from the Power Gremlins

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Low-level infrastructure• Libmsr: user-level API to enable access to MSRs (incl. RAPL

capping)• Msr-safe: kernel module to make MSR access “safe”

Current status• Support for Intel Sandy Bridge• More CPUs (incl. AMD) in progress• Code released on github: https://github.com/scalability-llnl/

libmsr• Inclusion into RHEL pending• Deployed on TLCC cluster cab

Analysis update• Full system characterization (see Barry’s talk)• Application analysis in progress

Update on Power Gremlin Infrastructure

https://github.com/scalability-llnl/libmsr

https://github.com/scalability-llnl/libmsr

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Scheduling research• Find optimal configurations for each code• Balance processor inhomogeneity• Understand relationship to load balancing• Integration into FLUX

New Gremlins• Artificially introduce noise events• Network Gremlins

— Limit network bandwidth or increase latency— Inject controlled cross traffic

Adaptation of the Gremlins to new programming models• Initially developed for MPI (using PnMPI as base)• First new target: OmpSs

Next Steps: Feeding the Gremlins After Midnight

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

BSC Experiments with Gremlins

Gremlins integrated with • OmpSs runtime• Extrae instrumentation (online mode)

Analysis of Gremlins’ impact• Living with Gremlins! Measure applications’ sensitivity• Growing our Gremlins! uniform vs. non-uniform populations• Have Gremlins side-effects?

— Do they increase/affect variability? — Should not affect other resources

First results• Up to now playing with memory Gremlins

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Integrating Gremlins in OmpSs runtime system

Gremlins launched at runtime initialization but remain transparent

Each gremlin thread is exclusively pinned on one core

These processors then become inaccessible by the runtime

Runtime parameters can be used to • enable/disable gremlins, • define number of gremlin threads, resource type • how much of that resource a single gremlin thread should

use

Marc Casas

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Current Work

Identify sensitive tasks• Classify how sensitive they are• Match them with tasks that can be run concurrently and are

less sensitive to the respective resource type

Implement smart Scheduler in OmpSs• Modify OmpSs scheduler to identify resource sensitive tasks

with the use of gremlin threads• Implement a scheduler that takes this information into

account when scheduling tasks for execution

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Results L3 cache capacity interference

0 gremlin - 20MB

1 gremlin - 15MB

2 gremlin - 12MB

3 gremlin - 7MB

4 gremlin - 4MB

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Results with memory BW interference

0 gremlin - 40GB/s

1 gremlin - 37.2GB/s

2 gremlin - 34.3GB/s

3 gremlin - 31.5GB/s

4 gremlin - 28.7GB/s

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

CGPOP sensitivity LLC cache size and bw

Without gremlins

Limiting size (4MB)

Limiting bandwidth (28.7 GB/s)

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

CGPOP sensitivity LLC cache – variability / unbalance

4lnj8dmatvec

8hyn9lpcg

Limiting size (4MB)

Limiting bandwidth (28.7 GB/s)

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

CGPOP sensitivity LLC cache – unbalance along time

8hyn9lpcg

Limiting size (4MB)

Limiting bandwidth (28.7 GB/s)

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

CGPOP sensitivity LLC cache – tracking evolution

4lnj8dmatvec

8hyn9lpcgahyn9lpcg3hyn9lpcg

Limiting bandwidth

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Integrating Gremlins in Extrae

Extrae online analysis mode• Based on MRNet

Gremlins API to activate them locally• All Gremlins launched at initialization time

First experiments with LLC cache size gremlins• Periodic increase of Gremlins• Unbalanced steal of resources

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

LULESH sensitivity to LLC cache size

Can we extract insight from chaos?• Unbalanced Gremlins creation

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

LULESH sensitivity to LLC cache size – tracking evol.

L3/in

str.

ra

tio

Lawrence Livermore National LaboratoryPetascale Tools Workshop

Judit Gimenez and Martin Schulz

Conclusions

Different regions show different sensitivity to resource reductions

Asynchrony affects actual sharing of resources• Today happens without control variability

Detailed analysis detects increases on variability and potential non-uniform impact

Recommended

ASCR Update August 24, 2010This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PRES-XXXXXX

Documents

Attachment 1 to Modification 093 - NNSS · 2016-08-24 · Attachment 1 to Modification 093 DE-AC52-06NA25946 A. Part I, The Schedule, Section B, Supplies or Service and Prices/Costs,

Documents

LLNL-PRES-668705 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344

Documents

Swiss Re Finance (Jersey) Limited 2019 Annual Report7c197c95-ff33-4b35-ac52-efc6567… · Net income/loss attributable to common shareholder 182 – 247 –236 Premiums earned and

Documents

General Model SM21/ SR20/ SC21/ SM29/SC29/ SM60 ... · Industrial Electrodes for pH/Redox GS 12B6J1-E-E 17th Edition ... AL4 AL6 SR20 AC22 AP24 AP26 SR20 AC52 AC32 AS52 AAP26 AGP26

Documents

Data Practice Presentation5c4c0291-ac52-42f3-a15c... · Machine Learning Algorithm development to support data-driven predictions and decision-making Blockchain Distributed ledger

Documents

· NSN 7540-01-152-8070 Previous edition unusable . Los Alamos National Security, LLC Contract No. DE-AC52-06NA25396 Modification No. 201 Page 2 of 2 Prime Contract Section J, Appendix

Documents

Lawrence Livermore National Laboratory · PDF fileLawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 . ... 8555 reactions 712 reactions 70 60 50 40 30 ... 2008

Documents

CMBE GFDL AM2 vs. CMBE - ASR · This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Documents

cbsegames.incbsegames.in/Results_18/National Volleyball 2019.pdfPublic School, Sendhwa (MP.) GROUP UNDER-19 UNDER-19 UNDER-19 UNDER-19 UNDER-17 UNDER-17 UNDER-17 UNDER-17 GIRLS GIRLS

Documents

Power System Co-simulation at LLNL: Lessons Learned and ... · National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Power System Co-simulation

Documents

Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

Documents

This work was performed under the auspices of the Lawrence Livermore National Security, LLC, (LLNS) under Contract No. DE-AC52-07NA27344 Lawrence Livermore

Documents

LLNL-PRES-?????? This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344

Documents

LLNL-PRES-638575 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Documents

Employee Benefits Orientation...This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344

Documents

ceodelhi.nic.inceodelhi.nic.in/WriteReadData/AssemblyConstituency/AC52/...ELECTORAL ROLL 2016, STATE- (U05) DELHI No. Name and Reservation Status of Assembly Constituency : 52 - TUGHLAKABAD,

Documents

Mie Scattering Analysiskmbr267/MieReport2NSF.pdfThis report was authored by the National Security Technologies, LLC, under Contract No. DE-AC52-06NA25946 with the U.S. Department of

Documents

Under Down Under

Documents

MODEL NO. WE12538M 12.5 HP 38 Inch Lawn Tractordl.owneriq.net/7/7911ddb9-b879-4cb6-ac52-2944bab74575.pdf · MODEL NO. WE12538M 12.5 HP 38 Inch Lawn Tractor For Parts and Service,

Documents