Integration and Performance of New Technologies in the CMS

Integration and Performance of New Technologies

in the CMS Simulation

Kevin Pedro (FNAL)on behalf of the CMS Collaboration

November 5, 2019

• High Luminosity LHC will have an instantaneous luminosity of5–7.5×1034 cm-2s-1 → more data, more radiation, more pileup

• CMS detector will be upgraded to cope with these conditions

o Including High Granularity Calorimeter (HGCal)in the endcap region

• Simulating HGCal requires 10× more geometryvolumes → 40–60% increase in simulation time

• HGCal also requires more accurate physics liststo simulate precisely-measured particle showers

• Reconstruction scales with pileup, requires more complex algorithms→ larger fraction of total CPU usage in Run 4 vs. Run 2

Run 4 simulation needs to provide more accuracy in more complicated geometry… w/ smaller fraction of total CPU usage vs. Run 2

Simulation Challenges

2CHEP 2019 Kevin Pedro (FNAL)

CE-HCE-E

• CMS currently uses a modified physics list FTFP_BERT_EMM: o Simplified multiple scattering model for most regions (not HCAL, HGCal)o Brings ~15% reduction in simulation CPU usage vs. default FTFP_BERT

Accurate simulation of showers in HGCal requires more involved models• Test case: custom physics list FTFP_BERT_EMN: o Goudsmit-Saunderson model for e+e– multiple scattering below 100 MeVo Angular generator for bremsstrahlungo More accurate Compton scattering model

Simulation takes 2–3× longer w/ new physics list

Simulating HGCal Physics


Run 2 Run 4 (range)Minimum Bias FTFP_BERT_EMM 1.00 1.18 1.24(10.5.ref08) FTFP_BERT_EMN 1.06 2.01 2.15ttbar FTFP_BERT_EMM 1.00 1.64 1.75(10.5.ref08) FTFP_BERT_EMN 1.14 2.97 3.25

A. Gheata

• CMS constantly tests technical improvements and physics-preserving approximations to speed up simulation

Led to 4–6× faster simulation vs. Geant4 default settings

o Major contributions: Russian Roulette algorithm, forward shower libraries, specific cuts/models per detector region

o New products like VecGeom also important (7–14% speedup)

• However, even this will not suffice for the demands of HL-LHC

• Enter GeantV: Vectorized Transport Engine

o Track-level parallelism& basketization of similar tracks

o Better usage of modern CPUregisters by improving data locality

Single instruction, multiple data (SIMD) vectorization

Simulation R&D


https://indico.cern.ch/event/818702/%236-design-implementation-and-pe

• Important to test R&D products in experiment software framework:

1. Engage in co-development between R&D team and experiments to prevent divergences or incompatibilities

2. Measure any potential CPU penalties from running in full software framework

3. Estimate human cost for experiments to adapt to new interfaces when considering migrating to new, backward-incompatible tools

Tests of GeantV integration have met these goals

• Not currently planning to migrate CMS simulation to GeantV

o These tests should be viewed as a demonstrator of feasibility

o Also provide independent measurements of any speedup

GeantV compared to Geant4 w/ VecGeom (“moving target”)

Testing GeantV in CMSSW


Generate events in CMSSW framework, convert HepMC to GeantV format Build CMSSW geometry natively and pass to GeantV engineo Currently using TGeo; could be improved with DD4hep

• Using constant magnetic field, limited EM-only physics list Calorimeter scoring adapted Run GeantV using CMSSW ExternalWork feature:o Asynchronous, non-blocking, task-based processing

Hits output in CMS format, immediately suitable for digitization etc.

Elements of GeantV Integration


External processing

CMSSW thread acquire()

GeantV

produce()(other work)

• Sensitive detectors and scoring trickiest to adapt

o Necessary to test “full chain” (simulation → digitization → reconstruction)

o Significantly more complicated than multithreaded Geant4

• Duplicate scoring class objects per event per thread, then aggregate→ 4 streams, 4 threads = 16 objects

o GeantV TaskData supports this approach

Geant4 vs. GeantV Scoring


Event Geant4 SDSDSDParticles Hits

Event Geant4 SDSDSDParticles Hits

Geant4 shares memory, but each event processed in separate thread

EventGeantV

SDSDSDHits

Event SDSDSD

Each event processed in multiple threads, mixed in with other events

?

• Each ScoringClass object has instance of sensitive detector class

o Some memory overhead from duplicated class members (can be minimized)

• Merged ScoringClass output copied to cache attached to Event object

o Cache used by CMSSW to write output

GeantV Data Aggregation


RunManager

threads

TaskDataDataPerThread

TaskDataDataPerThread

events

ScoringClass ScoringClass ScoringClass ScoringClass

1 2 1 2

A BUserApplication

TaskDataHandle

EventScoringCache

merge

Goal: use exact same scoring code for Geant4 and GeantV

• Problem: totally incompatible APIs

o Example: G4Step::GetTotalEnergyDeposit() vs. geant::Track::Edep()

• Solution: template wrappers with unified interfacese.g. StepWrapper<T>::getEnergyDeposit()

o Scoring code only calls accessors of wrapper, which stores pointer to T

o No branching or virtual table → minimize overhead

o Wrappers for Run, Event, Step, Volume → collected in single Traits class

• Scoring class templated on Traits class

Preserve 1000s of lines of scoring code

o Wrappers & data aggregation approach make integration straightforward

Path to Integration


• Generate 50 GeV electrons w/ random η, φ direction

• Energy, time, # hits agree within ~5% between Geant4 and GeantV

Physics Validation


ECAL barrel ECAL barrel

• Settings:o GeantV pre-beta-7+ (63468c9b) w/ vectorized multiple scattering, fieldo Generate & reuse 500 events w/ 2 electrons, E = 50 GeV, random η & φo Keep # events / thread constant, keep unused threads busy, disable outputo CMS 2018 (Run 2) geometry, calorimeter scoring enabled

• Machine: Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz, 35840 KB cache, 28 cores, sse4.2 instructions (from CERN OpenLab)

• Track wall clock time & memory with CMSSW TimeMemoryInfo toolo Measures VSIZE, RSS per evento Calculate throughput and speedup from wall timeo Multithreaded jobs are necessary for efficient use of computing grids

• To characterize CMSSW performance results, first run built-in GeantV FullCMS standalone test (single thread, settings as similar as possible)o GeantV: 526.796s, Geant4: 842.76s

→ 1.6× speedup (standalone)

Performance Tests


speedup

• speedup = throughput(threads=N)/throughput(threads=1)• G4 has better scaling w/ # threads than GV

Time Performance

12

• GV 1.7× faster than G4 single thread, still ~1.3× faster in multithreaded• GV single track mode similar to basketized version

CHEP 2019 Kevin Pedro (FNAL)

throughput [evt/s] throughput GV/G4

• Memory grows ~linearly w/ # threads (expected)

• GV uses more memory than G4 (expected)

• GV single track and basketized modes use similar memory

Memory Performance


RSS memory

RSS GV/G4

• CMS upgrade simulation for HL-LHC may be 2–3× slower than Run 2

o Driven by HGCal (increased geometry volumes, more precise physics list)

• CMS integration tests of GeantV met all goals

1. Co-development ensured compatible threading models & interfaces

2. Similar speedup measured in full experiment software framework

3. Efficient path to integration established

• GeantV prototype is a useful demonstrator:

o Benefits from rewriting/modernizing simulation code

Primary speedup from smaller instruction set → fewer cache misses

o Use of abstraction libraries for efficient code on multiple platforms

o R&D products being integrated into Geant4 (e.g. VecGeom)

Next step: GeantX project for HPCs w/ GPUs

Conclusions


• Results and R&D presented here are the products of years of work by many scientists, developers, etc. – a (multi-) team effort!

• Thanks to:o Geant4 Collaborationo GeantV R&D Teamo CMS Simulation Groupo CMS Core Software Groupo HEP Software Foundationo Support from Intel, Fermilab, and CERN OpenLab

Acknowledgements


Backup

• M. Hildreth et al., “CMS Full Simulation for Run-2”, J. Phys. Conf. Ser. 664 (2015) 072022.

• HEP Software Foundation, “A Roadmap for HEP Software and Computing R&D for the 2020s”, Comp. Soft. Big Sci. 3 (2019) 7, arxiv:1712.06982.

• HEP Software Foundation, “Detector Simulation White Paper”, HSF-CWP-2017-07, arxiv:1803.04165, October 2017.

• D. Elvira et al., “CMS Simulation in the HL-LHC Era”, HSF-CWP-011, January 2017.

• J. Apostolakis et al., “Towards a high performance geometry library for particle-detector simulations”, J. Phys. Conf. Ser. 608 (2015) 012023.

• K. Pedro, “Current and Future Performance of the CMS Simulation”, Eur. Phys. J. Web. Conf. 214 (2019) 02036.

• K. Pedro, “Integration of new simulation technologies in the experiments", Joint HSF/OSG/WLCG Workshop, JLab, March 2019.

• A. Gheata, “Design, implementation and performance results of the GeantVprototype.”, HSF meeting, October 2019.

• G. Amadio et al., “GeantV: Concurrent particle transport simulation in HEP”, in progress, for submission to Comp. Soft. Big Sci.

References


http://dx.doi.org/10.1088/1742-6596/664/7/072022

https://doi.org/10.1007/s41781-018-0018-8

https://arxiv.org/abs/1712.06982

http://hepsoftwarefoundation.org/activities/cwp.html

https://arxiv.org/abs/1803.04165

https://hepsoftwarefoundation.org/cwp-whitepapers.html

http://dx.doi.org/10.1088/1742-6596/608/1/012023

https://doi.org/10.1051/epjconf/201921402036

https://indico.cern.ch/event/759388/contributions/3331647/

https://indico.cern.ch/event/818702/%236-design-implementation-and-pe

• CMSSW (GitHub/cms-sw)o CMS Offline Software, ~6 million LOC

• VecCore (GitHub/root-project)o SIMD abstraction libraryo Supports backends: Vc, UME::SIMD, CUDA

• VecMath (GitHub/root-project)o Vectorized math utilitieso Built on top of VecCore

• VecGeom (CERN/GitLab)o Vectorized geometry and navigation, multi-particle interface

• GeantV (CERN/GitLab)o pre-beta-7 tag now available!

• install-geant (GitHub/kpedro88), SimGVCore (GitHub/kpedro88)o Test repositories to install and integrate GeantV in CMSSW

Repositories


https://github.com/cms-sw/cmssw/

https://github.com/root-project/veccore

https://github.com/root-project/vecmath

https://gitlab.cern.ch/VecGeom/VecGeom

https://gitlab.cern.ch/GeantV/geant

https://gitlab.cern.ch/GeantV/geant/tags/pre-beta-7

https://github.com/kpedro88/install-geant/

https://github.com/kpedro88/SimGVCore

• Phase 1 upgrades began during Run 2 and will be in operation through the end of Run 3 (installation finishes during Long Shutdown 2)

• Phase 2 upgrades will be in operation during Runs 4, 5 (installation during Long Shutdown 3)

CMS & LHC Upgrade Schedule


Phase 0 Phase 1 Phase 2 (HL-LHC)

We are here

• CMS full simulation uses Geant4

• Run 2: Sim is 40% of total CPU time in CMS→ most expensive “step” in MC production(vs. generation, digitization, reconstruction)

• Largest contributors to CPU usage in Geant4:geometry, magnetic field, EM physics

CMS Simulation Overview


Geant4 10.0p02(similar in other versions)

~60%

~15%

~10%

Geometry/FieldEM Physics

Had.Physics

CMS user actions other

• HF shower library, Russian Roulette have largest impacts

• VecGeom, mag. field improvements entered production in past ~year

o Enabled by validating and using latest Geant4 versions

• Cumulative effects: overall, simulation is 6.2× (4.1×) faster for ( ) vs. default Geant4 settings

CMS full simulation is at least 8× faster than ATLAS

CMS Simulation Optimizations

21

Relative CPU usageConfiguration MinBias ttbarNo optimizations 1.00 1.00Static library 0.95 0.93Production cuts 0.93 0.97Tracking cut 0.69 0.88Time cut 0.95 0.97Shower library 0.60 0.74Russian roulette 0.75 0.71FTFP_BERT_EMM 0.87 0.83VecGeom (scalar) 0.87 0.93Mag. field step,track 0.92 0.90All optimizations 0.16 0.24


Setup:

• TBB controls running modules

• Concurrent processing of multiple events

• Separate helper thread to control external

• Can wait until enough work is buffered before running external process

External Work in CMSSW (1)


Acquire:

• Module acquire() method called

• Pulls data from event

• Copies data to buffer

• Buffer includes callback to start next phase of module running



Work starts:

• External process runs

• Data pulled from buffer

• Next waiting modules can run(concurrently)



Work finishes:

• Results copied to buffer

• Callback puts module back into queue



Produce:

• Module produce() method is called

• Pulls results from buffer

• Data used to create objects to put into event



• Observed speedup: 2.0 ± 0.5

o Primary source of speedup actually reduced instruction cache misses

Modern codebase → smaller compiled library

Speedup depends strongly on CPU cache size

o Vectorization gives ~15–30% depending on instruction set (sse vs. avx)

Significant portion of operations still not vectorized

Speedup for EM physics alone:1.5–3× on Haswell, 2–4× on Skylake w/ AVX2

Amdahl’s law limits overall speedup

Inconsistent support for AVX on worldwide grid

o More details in backup

GeantV Status


GeantV Basketizing Efficiency

28

CMS application benchmark• 100 GeV isotropic e-

• 100 primaries, 16 event slots• Field type: CMS map• 1 thread, performance mode

• Bad vectorization efficiency overall• Up to 10-12% depending on

platform• Large overhead, sometimes bigger

than vectorization• Dispatching to many small

basketizers (geometry) not effective

• Good use cases: field and MSC• larger % FLOP

Fractions of total scalar execution timeXeon®CPU E5-2630 [email protected] GHz ,16 core

2x32KB L1/core, 256KB L2, 20MB L3

+ Only post step sampling of physics models* Only querying distance to boundary and safetyx Only MSC position/direction correction calculationo Best configuration for vectorization (Field /Physics/MSC)

Measurement errors < 0.5%

• Overhead: Bo = (TBE – Tscalar) / Tscalar

• Observed efficiency: Be = (Tscalar – Tvector) / Tscalar

• Vectorization efficiency: Bv = Be + BoCHEP 2019 Kevin Pedro (FNAL)

A. Gheata

mailto:[email protected]


Basketizing Overhead vs. Arch

29

Geant4 (10.4.p03) vs. GeantV (beta)10 GeV electron x 1000 events (1-thread, 10 measurements), 16 event slots

Overhead seems to largely increase for smaller (data) cache size


A. Gheata


Performance Summary

30

Geant4 (10.4.p03) vs. GeantV (beta)10 GeV electron x 1000 events (1-thread, 10 measurements)

strk (single track mode): emulation of Geant4 style tracking

CPU performance of G4/GV varies significantly over different platforms


A. Gheata


Documents

Integration and Performance of New Technologies in the CMS