8
ALMA Integrated Computing Team ICT Coordination and Planning Meeting #2 Santiago 28-29 January 2014 Alarm system A.Caproni

ALMA Integrated Computing Team ICT Coordination and Planning Meeting #2 Santiago 28-29 January 2014 Alarm system A.Caproni

Embed Size (px)

Citation preview

Page 1: ALMA Integrated Computing Team ICT Coordination and Planning Meeting #2 Santiago 28-29 January 2014 Alarm system A.Caproni

ALMA Integrated Computing Team

ICT Coordination and Planning Meeting #2Santiago 28-29 January 2014

Alarm system

A.Caproni

Page 2: ALMA Integrated Computing Team ICT Coordination and Planning Meeting #2 Santiago 28-29 January 2014 Alarm system A.Caproni

ICT-CPM2 28-29 January 2014

Alarm system status

According to operators the alarm panel is useless Too many alarms Stale alarms False alarms Result of a 4h profiling by Patricio (mid Nov 2013)

~31k alarms ACTIVE 16103 TERMINATE 15407 Pri 0: 41 PRi 1: 1820 Pri 2: 500 Pri 3: 29149

Insufficient coverage: Scripts and tools not provided by ALMA computing

Page 3: ALMA Integrated Computing Team ICT Coordination and Planning Meeting #2 Santiago 28-29 January 2014 Alarm system A.Caproni

ICT-CPM2 28-29 January 2014

Snapshot - 1

Page 4: ALMA Integrated Computing Team ICT Coordination and Planning Meeting #2 Santiago 28-29 January 2014 Alarm system A.Caproni

ICT-CPM2 28-29 January 2014

Snapshot - 2

Page 5: ALMA Integrated Computing Team ICT Coordination and Planning Meeting #2 Santiago 28-29 January 2014 Alarm system A.Caproni

ICT-CPM2 28-29 January 2014

Snapshot - 3

Page 6: ALMA Integrated Computing Team ICT Coordination and Planning Meeting #2 Santiago 28-29 January 2014 Alarm system A.Caproni

ICT-CPM2 28-29 January 2014

AS improvement plan (proposal)

Show only “real alarms”, remove the others (trust) Useful documentation in panel (twiki?) Fix most chattering alarms

DGCK:*:1, DGCK:*:4 FLOOG,*,7

Fix stale alarms Manager,*,1 LO2BBpX:*:1, LO2BBpX:*:10, LO2BBpX:*:11 WCA:*:1

Improve system startup and device initialization Profile during operations like array creation/destruction, total power… TMCDB configuration (input from System Engineering for BACI props)

Page 7: ALMA Integrated Computing Team ICT Coordination and Planning Meeting #2 Santiago 28-29 January 2014 Alarm system A.Caproni

ICT-CPM2 28-29 January 2014

AS improvement plan (proposal)

ACS next improvements Alarm server to dump alarms on files (ICT-1908)

Offline profiling Correlate alarms and logs while debugging (?) After the facts GUIs and tools

Alarm panel to group alarms belonging to the same array (ICT-1760)

Nominate a “Alarm System Manager” Regularly profile the AS Check and update the documentation

Page 8: ALMA Integrated Computing Team ICT Coordination and Planning Meeting #2 Santiago 28-29 January 2014 Alarm system A.Caproni

ICT-CPM2 28-29 January 2014

ACS handed over to OSF after fixing persistence and NCs RTI/DDS tested with 48 antennas

Number of alarms expected to grow having more antennas Alarm system performance

AS persists alarms in memory Already decoupled from source NC

ACS “new” AlarmSource API avoid resending a alarm if its state did not change Enable/disable alarm sending Queuing of alarms

Scalability