Status report from the Deferred Trigger Study Group John Baines, Giovanna Lehmann Miotto, Wainer Vandelli, Werner Wiedenmann, Eric Torrence, Armin Nairz

Status report from the Deferred Trigger Study Group

John Baines, Giovanna Lehmann Miotto, Wainer Vandelli, Werner Wiedenmann, Eric Torrence, Armin Nairz

Use casesDeferred Triggers:• Subset of events stored in DAQ system & processed later in run - separate stream

• potentially useful when CPU is saturated at start of fill

• Broadly two different classes of use case:– Deferred HLT processing: Deferred stream based on L1. Build event, cache, then run HLT

latercaching@~5-10kHz, deferred processing ~ 1s/event, rejection ~50-100High cache rate => need high replay rate => shorter per-event processing time• EB rate for deferred + prompt must fit in budget (~20kHz for 2nd generation ROS).e.g. cache all L1 multi-jet events (3-5kHz for 4J20 as 2-3x1034) & run topoclusteringe.g. cache L1 multi-jet and/or high pT dilepton triggers & run HLT tracking for displaced vertex trigger

– post-HLT processing: Deferred stream based on HLT result. Very similar to L4 casecaching at ~0.5-1Hz, deferred processing ~10s/event, rejection ~5-10Lower cache rate => lower replay rate => longer per-event processing time• could be used to increase effic. for same T0 rate: apply looser selection in HLT then deferred

trigger runs slower offline selection & applies tighter cutse.g. deferred stream for triggers requiring full-event EF tracking e.g. MET, b-jet, Tau

Deferred Trigger Processing OptionsProcessing options:• Inter-fill processing: only processed deferred stream between physics fills

– different processes from normal prompt triggers (file based like offline debug stream recovery)– Started between fills when farm relatively idle. Stopped when new fill starts.

=> Baseline option• in-fill plus inter-fill processing: attempt to make use also of spare CPU capacity later in run

– little gain if LHC luminosity levelling– Could be in competition with end-of-fill triggers– Dynamic partitioning of HLT farm:

• Different processes for normal prompt processing & processing cached events.• Need to dynamically vary partitioning as CPU usage changes for prompt processing • Delay to reconfigure partition and start/stop processSignificant difficulties for DAQ => Disfavoured option

– Variable deferral fraction: Still have only inter-fill processing of cached events but add ability to process some fraction of deferred-triggers promptly in normal trigger processes. • mechanism similar to pre-scales use to update fraction of events cached during run • Relatively small change for online but events from same LB split between prompt and

deferred files=> Significant additional complexity for Tier0 => Disfavoured option

Storage options• Distributed storage: local disks of HLT nodes

+ Potentially large ~1600TB, but not RAID disk => not secure+ Distributed => play-back not limited by data rates from disk- Book-keeping & operations difficulties, Can’t balance load for playback=> Disfavoured option

• Central storage: expand existing SFO+ Secure storage; much higher fault tolerance+ Can balance load across farm during play-back+ straight forward book-keeping+ minimizes changes needed to current system- Playback limited to data rates ~5GB/s (2.5 kHz event rate)=> Baseline option

• Clustered storage: per-rack SFO-like disk-server• Lower number of disks than distributed scheme

=> Retains some of the advantage of the central scheme • More distributed than central scheme

Þ higher playback ratesÞ Solution for higher-rates

Order of Magnitude Cost EstimateBaseline:• Inter-fill processing, Central Storage• 1kHz caching rate, 2.5 kHz playback • 8s/event processing time• 210 TB Disk Cache• Hardware Cost: ~100 kCHF Possible Use case: • EF fullscan for MET/Tau/b-jet

High Rate System: (Baseline x 10)• Inter-fill processing, Clustered Storage• 10kHz caching rate, 25 kHz playback• 0.8s/event processing time• 2100 TB Disk Cache• Hardware Cost ~1MCHFPossible use case:• Multi-jets: Topocluster• Displaced vertex trigger: L2 ID fullscan for

multijets or high pT mu

Processing power equivalent to 40% of current farm capacityWall-time to process: <30 hours • based on 2012 fill data: could be longer for more efficient LHC

Effort Needed:• 3.5 SY for online sw infrastructure changes + 0.25 SY for Tier-0 sw infrastructure changes • excludes effort to develop, configure & install h/w & operational effortTime-scale:• 1 year for sw development + commissioning during extended break, e.g. winter shutdown

Processing time factor 10 less

Summary• Deferred stream could have significant benefits for a CPU limited farm

BUT:– Deferred stream processing only suitable for specific use cases (low rate, high

processing time) – much less flexible than normal prompt processingÞ preferable to address need for added CPU by upgrading nodes or adding racks. – significant cost: both hardware & effort

• Preferred scheme is inter-fill processing:– In-fill processing unattractive due to added complexity:

• Online for dynamic partitioning or• Offline for variable deferral fraction

• Central or clustered storage preferred • A base-line infrastructure could provide:

– up to 2.5 kHz deferred stream rate– 8s/event for processing – processing completed within 48 hrs (under 2012 operating conditions)

• In the case of more efficient LHC, would need to lower deferred stream rate

Additional Material

IntroductionDeferred Triggers:• Subset of events stored in DAQ system &

processed later in run

Two processing options considered:• Inter-fill processing: only processed deferred

stream between physics fills• Dynamic processing: process both in-fill and

inter-fill – attempt to make use also of spare CPU capacity later in run

Potential competition with end of fill triggers

~50% decrease after 4 hours

Assumptions• Events built before being cached

– may contain intermediate HLT result in case HLT run before caching• Deferred stream consists of a specific subset of triggers:

– must not include triggers needed by calibration stream to produce constants for bulk processing

• Deferred triggers output to a separate stream • Deferred stream needs:

– Different constants – possible from a different run– Separate monitoring – relates to past not current condition– Independent of state of on-going run

Þ Need separate processes for deferred stream processing.Þ File-based processing is the most straight-forwardÞ Need to partition farm between prompt and deferred processing and

dynamically balance resources– Relatively straight forward in inter-fill scheme– Difficult in dynamic scheme

=> Inter-fill scheme is the baseline

Disk size & Total Processing time

Inter-fill scheme:Includes delays due to pausing of reprocessing during subsequent physics fills

Disk Usage by Deferred Stream(TB)

Wall-time to process deferred stream (hours)

Result’s of Eric’s model based on 2012 fill information

Cache: 0.5kHz playback: 2.5kHz

Cache: 1 kHz playback: 2.5kHz


Time to process

Inter-fill scheme:Includes delays due to pausing of reprocessing during subsequent physics fills


Cache: 1 kHz playback: 2.5kHz


Disk Usage

Inter-fill scheme:Includes effect of delays due to pausing of reprocessing during subsequent physics fills

Requirements - Some examples: Inter-fill processing

Event (Data) Rate Max. wall-time to

process [h]

Max. Disk Usage

[TB]

Average HLT Processing

Time [s/event]

Effective inc. in farm proc. capacity [cores]

Caching[kHz (GB/s)]

Playback[kHz (GB/s)]

0.5 (1) 2.5 (5) 23 85 8 20%

1 (2) 2.5 (5) 29 210 8 40%

2.5 (5) 2.5 (5) 49 660 8 100%

10 (20) 25 (50) 29 2100 0.8 40%

10 (20) 10 (20) 49 2640 2 100%

From Model

From Model

20k cores/playback rate

HLT proc. Time * caching rate/20k= caching rate/Playback rate

Current SFO :6x21 TB + 3x10 TB disks => 156TBWrite: 1.6 GB/s; Read: 2GB/s

Input Input

Clus

tere

d st

orag

e

Baseline

High-Rate

In-fill & Inter-fill processing• Dynamic Partitioning of farm has to dynamically take into account changes in CPU

requirement • Each change imposes delays to configure & start/abort processesÞ hard!• Relatively small potential gains (except in special case):

Event (Data) Rate Max. wall-time to

process [h]

Max. Disk Usage

[TB]Caching

[kHz (GB/s)]Playback

[kHz (GB/s)]

0.5 (1) 2.5 (5) 0.8 c.f. 23 14 c.f. 85

1 (2) 2.5 (5) 25 c.f. 29 113 c.f. 210

1.5 (3) 2.5 (5) 31 253

Special case: in-fill processing

rate = caching rate

Would it be possible to use a mechanism similar to end of fill triggers?Define a variable deferral fraction• Set to 1 at start of run• Set to e.g. 0.8 during run => 80% of deferred triggers cached, 20% processed promptly• Big disadvantage: events from same lumi block in o/p files produces up to 48 hrs apart

Assume 20% of farm used for prompt processing after 4 hours

DAQ & HLT• Activation of deferred stream processing should be automatic

– But can be stopped/aborted by expert• Error handling should not normally require operator intervention

– But alert expert if system cannot restart correctly• Must be possible to rapidly stop partition when needed

– And re-start again from this point when CPU becomes available• Need to define action in case disks become full

– Stop deferred stream,– Exceptionally transfer events unprocessed to Tier0? (if rate ~500Hz)

• Extensive book-keeping framework needed:– To drive play-back– to account for data possible loses

Tier0• While technically possible to deal with delays > 48 hours, anything that deviates from standard

work-flow is significant extra work=> should keep within 48 hours except in very rare exceptions• Important that output files are LB-aware i.e. closed at LB boundaries• In the case of the clustered or distributed options would need to make a significant addition to

T0 to merge files:– Multi-step RAW file merging needed (more complicated than current 1-step process)– Currently ~10 files per LB, could be ~200 smaller files for clustered storage (even more for distributed

storage) • Completeness of dataset is an issue: rely on completeness in many places

– e. g. RAW merging job only defined for complete data– Would need to adapt T0 workflow to enable processing of prompt stream with only partially

complete LBs• Extra infrastructure needed if, in exceptional circumstances, unprocessed events are streamed to

Tier-0:– Complete HLT processing & re-streaming needed offline - similar to debug reproc. but much

bigger scale (~10M events c.f. few hundred)– Retro-active insertion of prcessed data into handshake DB– Merge many small files produced– Need to add to files from truncated online processing before bulk reconstruction

DQ• Online monitoring should be separate• Offline: should be possible to treat deferred steam in same way as other

streams• => Deferred triggers adequately represented in express stream• Deferred stream available for bulk processing within 48 hours of run-end• Need stream-dependant good run list

Documents

Status report from the Deferred Trigger Study Group John Baines, Giovanna Lehmann Miotto, Wainer Vandelli, Werner Wiedenmann, Eric Torrence, Armin Nairz