32
Supplemental Supplemental Material: Data on Material: Data on Tevatron Complex Tevatron Complex Reliability Reliability Elliott McCrory 25 February 2004

Supplemental Material: Data on Tevatron Complex Reliability

  • Upload
    bonita

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Supplemental Material: Data on Tevatron Complex Reliability. Elliott McCrory 25 February 2004. Presentation Outline. Main talk is elsewhere The data here are supplemental to that talk. Downtime Logger, D18. Console Application: Screen Shot. Web Page: 4 Weeks of Downtime. Click Here…. - PowerPoint PPT Presentation

Citation preview

Supplemental Material: Supplemental Material: Data on Tevatron Data on Tevatron

Complex ReliabilityComplex Reliability

Elliott McCrory25 February 2004

Reliability Data (Supplemental Material) – McCrory 2

Presentation OutlinePresentation Outline

Main talk is elsewhere The data here are

supplemental to that talk

Reliability Data (Supplemental Material) – McCrory 3

Downtime Logger, D18Downtime Logger, D18C

on

sole

Ap

plic

ati

on

: S

creen

Sh

ot

Reliability Data (Supplemental Material) – McCrory 4

Web Page: 4 Weeks of Web Page: 4 Weeks of DowntimeDowntime

Click Here…

Reliability Data (Supplemental Material) – McCrory 5

Focus on one subsystem: BPSFocus on one subsystem: BPS

Entries like these are uses as the basis for the analysis

Reliability Data (Supplemental Material) – McCrory 6

Tevatron Complex Tevatron Complex SubsystemsSubsystems

BDIAG BMAG BMISC BPS BRF BVAC BWATR CMISC CNET CR&LK CTIME EXPAR FDIAG FMISC FPS FWATR

LDIAG LMAG LMISC LPS LRF LVAC LWATR MIDIAG MIMISC MIPS MIRF MISC MIVAC MIWATR MIMAG MIXFER

POWER RADTRIP RRMISC SAFETY TCRYO TDIAG TMAG TMISC TPS TQPM TQUEN TRF TVAC TWATR WATER

NTF PACC (L) PBCOOL PBDIAG PBMAG PBMISC PBPS PBRF PBTRGT PBVAC PBWATR PBXFER

Reliability Data (Supplemental Material) – McCrory 7

Definition of TermsDefinition of Terms MTBF

Mean Time Between FailureSimple arithmetic average of the time

between failures λ

Used to characterize the probability of an interval in the classic random queuing problem

• E.g., the arrival interval of cars at a toll booth λ = 1/<t> = 1/σ

• Actually, not σ, but R.M.S.Calculated here as (2/(MTBF + σ(MTBF)))

Average Downtime Fraction= [Σ (time down)] / (specified interval)

Reliability Data (Supplemental Material) – McCrory 8

MTBF For Each Major SystemMTBF For Each Major System

Days since 1/1/2003

Linac

Run

nin

g a

vera

ge M

TB

F, h

ours

Running average over 120 days

Running average over 30 days

Fall 2003Shutdown

Good

Bad

Reliability Data (Supplemental Material) – McCrory 9

MTBF For Each Major SystemMTBF For Each Major System

Booster

Days since 1/1/2003

Run

nin

g a

vera

ge M

TB

F, m

inute

s

Good

Bad

Reliability Data (Supplemental Material) – McCrory 10

MTBF For Each Major SystemMTBF For Each Major System

PBar

Days since 1/1/2003

Run

nin

g a

vera

ge M

TB

F, m

inute

s

A lot less downtime

Good

Bad

Reliability Data (Supplemental Material) – McCrory 11

MTBF For Each Major SystemMTBF For Each Major System

Main Injector

Days since 1/1/2003

Run

nin

g a

vera

ge M

TB

F, m

inute

s

Good

Bad

Reliability Data (Supplemental Material) – McCrory 12

MTBF For Each Major SystemMTBF For Each Major System

Tevatron

Days since 1/1/2003

Run

nin

g a

vera

ge M

TB

F, m

inute

s

Good

Bad

Reliability Data (Supplemental Material) – McCrory 13

MTBF For Each Major SystemMTBF For Each Major System

Controls

Days since 1/1/2003

Run

nin

g a

vera

ge M

TB

F, m

inute

s

Good

Bad

Reliability Data (Supplemental Material) – McCrory 14

Tevatron as 2 MachinesTevatron as 2 Machines The machines are called “Low Beta” and

“Not Low Beta” Is “Low Beta” if energy is bigger than 900Otherwise, it is called “Not Low Beta”

Only one machine exists at a time Time between failures

Only count time in which the machine “exists”

That is, count minutes at 980 to determine time between failures at 980

Ditto for “Not Low Beta” Eliminate obvious bad points

E.g., 5000+ minutes of shutdown between failures at 150 for TeV this winter.

Reliability Data (Supplemental Material) – McCrory 15

Tevatron as 2 machines, Tevatron as 2 machines, MTBFMTBF

0

20

40

60

80

100

120

140

160

180

1/1/2003 2/20/2003 4/11/2003 5/31/2003 7/20/2003 9/8/2003 10/28/2003 12/17/2003 2/5/2004

Failure Interval, T150

Failure Interval, T980

10 per. Mov. Avg.(Failure Interval, T150)

10 per. Mov. Avg.(Failure Interval, T980)

MTBF is getting longer!

Reliability Data (Supplemental Material) – McCrory 16

λλ(t)(t) For Each Major System For Each Major System

Linac

Days since 1/1/2003

Run

nin

g a

vera

ge M

TB

F, h

ours

Running average over 120 days

Running average over 30 days

Bad

Good

Reliability Data (Supplemental Material) – McCrory 17

λλ(t)(t) For Each Major System For Each Major System

Booster

Days since 1/1/2003

Run

nin

g a

vera

ge M

TB

F, h

ours

Bad

Good

Reliability Data (Supplemental Material) – McCrory 18

λλ(t)(t) For Each Major System For Each Major System

Main Injector

Days since 1/1/2003

Run

nin

g a

vera

ge M

TB

F, h

ours

Bad

Good

Reliability Data (Supplemental Material) – McCrory 19

λλ(t)(t) For Each Major System For Each Major System

PBar

Days since 1/1/2003

Run

nin

g a

vera

ge M

TB

F, h

ours

Bad

Good

Reliability Data (Supplemental Material) – McCrory 20

λλ(t)(t) For Each Major System For Each Major System

Tevatron

Days since 1/1/2003

Run

nin

g a

vera

ge M

TB

F, h

ours

Bad

Good

Reliability Data (Supplemental Material) – McCrory 21

λλ(t)(t) For Each Major System For Each Major System

Controls

Days since 1/1/2003

Run

nin

g a

vera

ge M

TB

F, h

ours

Bad

Good

Reliability Data (Supplemental Material) – McCrory 22

Average Downtime FractionAverage Downtime Fraction

Days since 1/1/2003

Run

nin

g a

vera

ge D

ow

nti

me

Dura

tion,

min

ute

s

Linac

Running average over 120 days

Running average over 30 days

Bad

Good

Reliability Data (Supplemental Material) – McCrory 23

Average Downtime FractionAverage Downtime Fraction

Days since 1/1/2003

Run

nin

g a

vera

ge D

ow

nti

me

Dura

tion,

min

ute

s

Booster Bad

Good

Reliability Data (Supplemental Material) – McCrory 24

Average Downtime FractionAverage Downtime Fraction

Days since 1/1/2003

Run

nin

g a

vera

ge D

ow

nti

me

Dura

tion,

min

ute

s

Main Injector Bad

Good

Reliability Data (Supplemental Material) – McCrory 25

Average Downtime FractionAverage Downtime Fraction

Days since 1/1/2003

Run

nin

g a

vera

ge D

ow

nti

me

Dura

tion,

min

ute

s

PBar Bad

Good

Reliability Data (Supplemental Material) – McCrory 26

Average Downtime FractionAverage Downtime Fraction

Days since 1/1/2003

Run

nin

g a

vera

ge D

ow

nti

me

Dura

tion,

min

ute

s

Tevatron Bad

Good

Reliability Data (Supplemental Material) – McCrory 27

Average Downtime FractionAverage Downtime Fraction

Days since 1/1/2003

Run

nin

g a

vera

ge D

ow

nti

me

Dura

tion,

min

ute

s

Controls Bad

Good

Reliability Data (Supplemental Material) – McCrory 28

Radioactive DecayRadioactive DecayRadioactive Decay of substance with half-life of 20

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60 70 80 90 100

Time

Su

bst

ance

Rem

ain

ing

λ e –λ t is the frequency distribution of the classic random queuing problem

F(t) = 2 -t/τF(t) = 2 -t/τ

= e -ln(2) t/τ

≡ e –λ t

λ = 0.0346

<t> = σ = 1/ λ = 28.9

Reliability Data (Supplemental Material) – McCrory 29

Controls DowntimesControls DowntimesTime Duration Min System Description

1/2/03 11:30 10 min. 10 CR&LK Orbmp timer card work 1/3/03 8:30 1.00 hrs. 60 CMISC CONS2 getting eaten by rtvscan, localcon, winvmenue, auto-reboots

1/8/03 16:19 47 min. 47 CNET Unable to talk to Low Beta QPMs - network problem. 1/28/03 1:55 55 min. 55 CR&LK Booster beam permit down.

1/30/03 12:30 1.77 hrs. 106.2 CMISC TLG got sick - leaky memory is bad. 2/10/03 10:27 1 min. 1 CMISC MI abort on MECAR

2/13/03 7:18 33 min. 33 CR&LK MI correctors in crate 24 coming back with calculation errors 2/17/03 2:15 55 min. 55 CR&LK MI CAMAC fiber repeater PS @ MI40 is dead.

2/19/03 17:17 58 min. 58 CR&LK TeV crate A8 PS and fan pack failure. Both were replaced. 2/19/03 18:37 41 min. 41 CR&LK TeV $A8 PS replacement was a bad spare. A workable spare was installed.2/21/03 16:30 14 min. 14 CMISC Booster FE @ >100% of bandwidth

3/6/03 13:30 1.50 hrs. 90 CMISC BBM violation: 120 GeV to AP0. No beam was present.3/27/03 2:50 50 min. 50 CMISC False BBM violation.8GeV-MI integration problem.4/5/03 23:55 15 min. 15 CNET No shot zone held off proton loading due to known network problem

4/8/03 8:00 1.43 hrs. 85.8 CR&LK E48 kicker not charging for reverse tune up. 465 card replaced. 4/11/03 4:16 1.73 hrs. 103.8 CTIME Tev Crate $3D TClock problems, replace 178 card

4/18/03 15:50 2.93 hrs. 175.8 CNET A2 thermometry crate networking repair. 4/25/03 4:40 3.17 hrs. 190.2 CR&LK C:DQ0 trip traced to bad crate controller for Tev $D9

4/25/03 16:25 9 min. 9 CMISC MI LLRF needed a reboot.4/25/03 19:23 35 min. 35 CMISC MI LLRF needed a reboot.4/26/03 21:06 12 min. 12 CTIME rebooting TLG front end 5/16/03 11:58 40 min. 40 CMISC T:STORE and C:FILE set to wrong store number (2511)

5/22/03 7:25 21 min. 21 CMISC Proton loading controls problem.6/9/03 0:08 55 min. 55 CR&LK Damper problems due to bad PS in Booster crate $93

6/20/03 3:50 10 min. 10 CR&LK Bad 170 card in Pbar crate $57. 6/20/03 10:01 1 min. 1 CTIME Pbar trip due to TLG $29 spacing. 6/24/03 22:15 1.25 hrs. 75 CR&LK Tev E1 vacuum controls problem.

6/27/03 9:27 5 min. 5 CR&LK Booster off due to lost power in crates $40 and $41 7/3/03 23:30 2.92 hrs. 175.2 CMISC Communications lost with all TeV sector frig VME's. 7/7/03 13:55 21 min. 21 CR&LK 453 card failure; MI-3; crate $85 7/8/03 20:40 47 min. 47 CR&LK Booster MADC #8 failure. Chassis changed out.

7/14/03 14:14 13 min. 13 CR&LK Lost Crates 70, 71, 72 when a circuit breaker tripped when used by a heat gun 7/17/03 4:16 2.77 hrs. 166.2 CMISC Pbar MADC #14 P.S. failure 7/25/03 9:29 5 min. 5 CR&LK I:HT848 C453 card replacement: MI3, C84, slot 5

7/25/03 10:27 2 min. 2 CR&LK MI node crate $84 changeout 7/25/03 10:56 4 min. 4 CR&LK MI node crate $84 troubleshooting 7/25/03 20:25 55 min. 55 CR&LK TeV crate $CC power supply changeout7/30/03 21:04 5.02 hrs. 301.2 CR&LK QUench due to A1 CIA failure. 7/31/03 15:44 6 min. 6 CMISC Camac link Driver for Switchyard front end failed.

8/4/03 20:13 55 min. 55 CR&LK PBar Crate $5B takes down stacking 8/5/03 8:25 2.17 hrs. 130.2 CMISC FCC without power affecting network. MiniBoone DAC specifically.

8/10/03 9:20 59 min. 59 CR&LK C:S7B1A replacement.8/13/03 17:16 44 min. 44 CTIME Booster EAPS trip due to BRF1. 220 sec timeline was the cause. 8/14/03 10:00 2.00 hrs. 120 CMISC VAX OACs restarts due to Database hardware problem.

8/18/03 9:00 1.83 hrs. 109.8 CR&LK Bad PS in Tev crate $C5.8/27/03 11:43 14 min. 14 CR&LK SY CAMAC link down, held off MiniBooNE beam permit. 8/28/03 20:00 4.73 hrs. 283.8 CTIME MiniBooNE toroid timing problems at MI10

1/5/04 18:12 7 min. 7 CR&LK MI crate $65 needed a restore after a 453 card swap 1/7/04 21:10 24 min. 24 CR&LK C:SFB2 465 card swap problems (dip switch incorrectly set) 1/13/04 9:40 1.25 hrs. 75 CMISC MiniBooNe DAQ problems

1/14/04 12:55 12 min. 12 CTIME HVF12 PS problems due to timeline. 1/20/04 10:30 30 min. 30 CMISC TRF gates will not come on.

1/22/04 7:01 9 min. 9 CR&LK C467 replaced for I:LAM41. 1/22/04 23:45 1.93 hrs. 115.8 CMISC Alarms screen not operating 1/31/04 17:42 7 min. 7 CTIME P1 line trip due to TLG mixup.

2/4/04 16:00 1.50 hrs. 90 CMISC MiniBooNE and stacking beam inhibited due to HEP inhibit ($53) generated.

Data

Dum

p o

f A

ll C

on

trols

D

ow

nti

mes

since

1-1

-2003

Reliability Data (Supplemental Material) – McCrory 30

0.136

0.0090.006

0.183

0.044

0.0220.037 0.043

0.001

0.01

0.1

1

Booster Controls Experiment Linac Main Inj Pbar Tevatron Misc

λλ For All Systems For All SystemsLo

g S

cale

Probability of failure per hour = 1 – λ

E.g. Linac up for 5 hours: (1-0.183)5 = 0.364

Reliability Data (Supplemental Material) – McCrory 31

Quality of Quality of λλ Fit Assumption Fit Assumption

0

0.2

0.4

0.6

0.8

1

1.2

Booster Controls Experiment Linac Main Inj Pbar Tevatron Misc

Q = <t>/σ

Reliability Data (Supplemental Material) – McCrory 32

Store DurationStore Duration

Stores Duration, Hours

dN

/ d

T

Failures Intentional: 2003

Intentional: 2004